This is about an ancient version of Erlang, R15 and earlier. I’m hoping someone recognises this and can tell me what’s going on.
I use Erlang to rotate logfiles, on an embedded single-core system. So no SMP. When rotating the logfiles, I notice that other processes sometimes get starved for CPU an unreasonably long time. I can reproduce the problem on a bog-standard x86 linux box, but I can’t explain what’s going on.
Complete code to reproduce:
In a nutshell, the code runs two processes. One compresses a logfile, using Erlang’s internal gzio, 500k at a time. The other just sleeps 10 ms at a time in a receive. The sleeping process typically gets starved for 700 ms while the compression runs. That’s unexpected.
Here’s what a run looks like in the logfile, the first column is a timestamp in microseconds:
3317458395 sleeping. relative 22
3317469395 sleeping. relative 33
3317480395 sleeping. relative 44
3317528571 compressing. relative 92 ms
3317568268 compressing. relative 131 ms
3317608473 compressing. relative 172 ms
3317651503 compressing. relative 215 ms
3317697024 compressing. relative 260 ms
3317738523 compressing. relative 302 ms
3317778853 compressing. relative 342 ms
3317819305 compressing. relative 382 ms
3317859849 compressing. relative 423 ms
3317900184 compressing. relative 463 ms
3317940621 compressing. relative 504 ms
3317981149 compressing. relative 544 ms
3318021277 compressing. relative 584 ms
3318061482 compressing. relative 625 ms
3318101664 compressing. relative 665 ms
3318142969 compressing. relative 706 ms
3318143101 sleeping. relative 706
So, the ‘compression’ loop takes about 40 ms per iteration, and the sleeping loop normally takes 10 or 11 ms, and then it gets starved for 700 ms.
I instrumented the VM to try and figure out what was going on, but I don’t properly understand how scheduling is meant to work—big thanks though to Happi for the BEAM book, which was a good way to get started. I can see that beam_emu:schedule() is called after about 300 ms of starvation, but it still doesn’t decide to schedule the starving process. So “not enough reductions being counted for gzio” doesn’t seem to explain it all.
Clues?