Detecting blocked dirty I/O schedulers

fhunleth · February 23, 2022, 1:47pm

I was wondering how others detect blocked dirty I/O schedulers or how they would do this. The problem that came up was that each time a particular call was made, it would block a dirty I/O scheduler for a long time - like minutes. Once all of the dirty I/O schedulers were blocked, the system was unusable and not debuggable. However, before that, everything seemed fine even if there was only one dirty I/O scheduler left.

The issue has been fixed, but this was painful to debug, so I’d like to be better prepared next time.

I “think” that what I’d like to track is how busy the dirty I/O schedulers are and get an alarm that can be logged before the system becomes unusable if all are blocked. It feels like erlang:system_monitor/2, erlang:system_info/1 and erlang:statistics/1 come close to addressing this. They’re not quite what I want, though. erlang:system_monitor/2/ can help me find long schedules, but I’d be ok with some long schedules. It looks like microstate accounting has useful information, but the docs indicate that it’s a profiling tool.

Are there other tools in OTP that I should be looking at? Or perhaps a library?

rickard · February 25, 2022, 3:32am

I guess you want to look at scheduler utilisation of dirty I/O schedulers. If all of them are closing in on 100% you may be closing in on trouble.

If you take samples using scheduler:sample_all/0 you can calculate the scheduler utilisation of all schedulers, including dirty I/O schedulers, using scheduler:utilization/2. You may also utilise the primitive statistics(scheduler_wall_time_all) directly and calculate the utilisation yourself.

fhunleth · February 26, 2022, 4:39pm

Thanks @rickard for the reply. I’ll go that direction.