We’re seeing some weird errors occur when the CPU and Memory usage get close to 50%, the entire VM locks up and grinds to a halt. No processes can get started, everything just sort of…hangs.
Looking at msacc
, I see stuff like
Average thread real-time : 2023907 us
Accumulated system run-time : 122979882 us
Average scheduler run-time : 1181889 us
Thread aux check_io emulator gc other port sleep
Stats per thread:
async( 0) 0.00% 0.00% 0.00% 0.00% 0.01% 0.01% 99.98%
aux( 1) 95.46% 0.02% 0.00% 0.00% 0.02% 0.00% 4.50%
dirty_cpu_( 1) 0.00% 0.00% 0.02% 99.54% 0.00% 0.00% 0.44%
dirty_cpu_( 2) 0.00% 0.00% 0.02% 97.90% 0.00% 0.00% 2.08%
dirty_cpu_( 3) 0.00% 0.00% 0.02% 99.08% 0.00% 0.00% 0.90%
dirty_cpu_( 4) 0.00% 0.00% 0.02% 99.80% 0.00% 0.00% 0.18%
dirty_cpu_( 5) 0.00% 0.00% 0.02% 97.32% 0.00% 0.00% 2.66%
dirty_cpu_( 6) 0.00% 0.00% 0.02% 97.90% 0.00% 0.00% 2.08%
dirty_cpu_( 7) 0.00% 0.00% 0.02% 98.89% 0.00% 0.00% 1.09%
dirty_cpu_( 8) 0.00% 0.00% 0.02% 99.03% 0.00% 0.00% 0.94%
dirty_cpu_( 9) 0.00% 0.00% 0.02% 97.67% 0.00% 0.00% 2.31%
dirty_cpu_(10) 0.00% 0.00% 0.02% 96.13% 0.00% 0.00% 3.85%
dirty_cpu_(11) 0.00% 0.00% 0.02% 98.25% 0.00% 0.00% 1.72%
dirty_cpu_(12) 0.00% 0.00% 0.01% 99.84% 0.00% 0.00% 0.14%
dirty_cpu_(13) 0.00% 0.00% 0.02% 98.90% 0.00% 0.00% 1.08%
dirty_cpu_(14) 0.00% 0.00% 0.02% 97.25% 0.00% 0.00% 2.73%
dirty_cpu_(15) 0.00% 0.00% 0.02% 97.06% 0.00% 0.00% 2.92%
dirty_cpu_(16) 0.00% 0.00% 0.02% 99.75% 0.00% 0.00% 0.23%
dirty_cpu_(17) 0.00% 0.00% 0.02% 98.95% 0.00% 0.00% 1.04%
dirty_cpu_(18) 0.00% 0.00% 0.02% 98.81% 0.00% 0.00% 1.17%
dirty_cpu_(19) 0.00% 0.00% 0.02% 98.35% 0.00% 0.00% 1.63%
dirty_cpu_(20) 0.00% 0.00% 0.02% 97.21% 0.00% 0.00% 2.76%
dirty_cpu_(21) 0.00% 0.00% 0.02% 98.94% 0.00% 0.00% 1.04%
dirty_cpu_(22) 0.00% 0.00% 0.02% 97.40% 0.00% 0.00% 2.58%
dirty_cpu_(23) 0.00% 0.00% 0.01% 97.60% 0.00% 0.00% 2.38%
dirty_cpu_(24) 0.00% 0.00% 0.02% 98.98% 0.00% 0.00% 1.00%
dirty_cpu_(25) 0.00% 0.00% 0.02% 99.22% 0.00% 0.00% 0.76%
dirty_cpu_(26) 0.00% 0.00% 0.02% 99.65% 0.00% 0.00% 0.33%
dirty_cpu_(27) 0.00% 0.00% 0.02% 99.50% 0.00% 0.00% 0.48%
dirty_cpu_(28) 0.00% 0.00% 0.02% 98.95% 0.00% 0.00% 1.03%
dirty_cpu_(29) 0.00% 0.00% 0.02% 99.35% 0.00% 0.00% 0.63%
dirty_cpu_(30) 0.00% 0.00% 0.01% 98.52% 0.00% 0.00% 1.46%
dirty_cpu_(31) 0.00% 0.00% 0.02% 98.69% 0.00% 0.00% 1.29%
dirty_cpu_(32) 0.00% 0.00% 0.02% 99.04% 0.00% 0.00% 0.94%
dirty_cpu_(33) 0.00% 0.00% 0.01% 99.42% 0.00% 0.00% 0.56%
dirty_cpu_(34) 0.00% 0.00% 0.10% 99.78% 0.00% 0.00% 0.12%
dirty_cpu_(35) 0.00% 0.00% 0.02% 99.58% 0.00% 0.00% 0.39%
dirty_cpu_(36) 0.00% 0.00% 0.02% 99.16% 0.00% 0.00% 0.82%
dirty_cpu_(37) 0.00% 0.00% 0.02% 99.96% 0.00% 0.00% 0.02%
dirty_cpu_(38) 0.00% 0.00% 0.02% 95.51% 0.00% 0.00% 4.47%
dirty_io_s( 1) 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 99.97%
dirty_io_s( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 3) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 4) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 5) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 6) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 7) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 8) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s( 9) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
dirty_io_s(10) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
poll( 0) 0.00% 2.06% 0.00% 0.00% 0.00% 0.00% 97.94%
scheduler( 1) 2.50% 0.18% 64.17% 3.96% 1.35% 0.91% 26.93%
scheduler( 2) 3.11% 0.20% 60.10% 4.77% 1.15% 1.07% 29.61%
scheduler( 3) 3.00% 0.15% 57.79% 5.28% 1.33% 1.02% 31.42%
scheduler( 4) 2.27% 0.17% 55.32% 3.10% 1.34% 0.97% 36.84%
scheduler( 5) 1.68% 0.14% 60.57% 3.27% 1.00% 0.61% 32.72%
scheduler( 6) 3.82% 0.16% 62.58% 3.67% 1.20% 0.94% 27.63%
scheduler( 7) 2.69% 0.15% 52.60% 3.22% 1.15% 0.63% 39.55%
scheduler( 8) 3.26% 0.17% 53.29% 3.69% 1.03% 0.89% 37.68%
scheduler( 9) 2.58% 0.16% 56.15% 2.95% 1.16% 0.70% 36.30%
scheduler(10) 2.47% 0.16% 51.61% 4.94% 1.11% 0.74% 38.96%
scheduler(11) 3.04% 0.17% 56.17% 4.63% 1.37% 1.00% 33.63%
scheduler(12) 3.02% 0.14% 56.15% 3.35% 1.05% 0.76% 35.53%
scheduler(13) 3.01% 0.15% 53.35% 6.15% 0.99% 0.85% 35.49%
scheduler(14) 2.34% 0.14% 55.08% 3.92% 0.97% 0.76% 36.79%
scheduler(15) 3.31% 0.16% 59.34% 4.52% 1.11% 0.69% 30.87%
scheduler(16) 3.30% 0.17% 52.17% 6.91% 1.20% 0.86% 35.38%
scheduler(17) 3.25% 0.13% 52.67% 2.67% 1.06% 0.56% 39.67%
scheduler(18) 1.85% 0.14% 48.81% 2.29% 1.02% 0.61% 45.28%
scheduler(19) 3.30% 0.16% 57.76% 3.47% 1.36% 0.99% 32.95%
scheduler(20) 2.13% 0.17% 51.69% 3.73% 1.03% 1.85% 39.41%
scheduler(21) 2.61% 0.19% 53.44% 3.25% 1.14% 1.25% 38.12%
scheduler(22) 3.33% 0.16% 54.18% 4.06% 1.22% 0.87% 36.18%
scheduler(23) 2.94% 0.13% 52.54% 3.76% 1.02% 0.97% 38.63%
scheduler(24) 2.27% 0.13% 60.31% 3.38% 1.14% 0.88% 31.88%
scheduler(25) 2.13% 0.14% 58.30% 4.44% 1.01% 0.72% 33.26%
scheduler(26) 4.31% 0.14% 55.71% 2.91% 1.09% 0.77% 35.08%
scheduler(27) 2.36% 0.22% 50.89% 3.36% 1.36% 1.16% 40.66%
scheduler(28) 2.10% 0.16% 59.33% 3.63% 1.13% 0.74% 32.92%
scheduler(29) 2.79% 0.16% 54.97% 3.87% 0.99% 0.84% 36.37%
scheduler(30) 3.44% 0.14% 53.27% 4.43% 1.16% 2.45% 35.12%
scheduler(31) 2.07% 0.11% 50.37% 3.36% 0.85% 0.58% 42.64%
scheduler(32) 2.63% 0.14% 42.84% 2.42% 0.97% 0.68% 50.31%
scheduler(33) 3.13% 0.10% 39.32% 2.02% 0.67% 0.48% 54.29%
scheduler(34) 2.25% 0.11% 33.59% 1.75% 0.79% 0.47% 61.04%
scheduler(35) 2.27% 0.08% 21.66% 1.42% 0.65% 0.39% 73.54%
scheduler(36) 2.84% 0.08% 17.69% 1.12% 0.42% 0.30% 77.56%
scheduler(37) 1.86% 0.06% 19.05% 1.29% 0.43% 0.21% 77.10%
scheduler(38) 1.92% 0.09% 22.21% 1.21% 0.48% 0.32% 73.77%
Stats per type:
async 0.00% 0.00% 0.00% 0.00% 0.01% 0.01% 99.98%
aux 95.46% 0.02% 0.00% 0.00% 0.02% 0.00% 4.50%
dirty_cpu_sche 0.00% 0.00% 0.02% 98.60% 0.00% 0.00% 1.37%
dirty_io_sched 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%
poll 0.00% 2.06% 0.00% 0.00% 0.00% 0.00% 97.94%
scheduler 2.72% 0.15% 50.44% 3.48% 1.04% 0.83% 41.35%
I checked How to debug\profile scheduler utilization, which sounded similar, but I’ve determined it’s a slightly different problem. The odd thing is, I can’t seem to reproduce this with artificial load, but it’s trivially reproducible when our load scales up.
msacc
seems to suggest most of it doing gc’ing. I’ve checked etop
(nothing suspicious) as well as some tracing with redbug and observer and haven’t found anything that’s the culprit.
I was wondering if anyone has seen similar problems, or if there’s a way to see what’s happening on a specific scheduler. I checked Erlang --, but that gives me general scheduler utilization.