I was wondering how others detect blocked dirty I/O schedulers or how they would do this. The problem that came up was that each time a particular call was made, it would block a dirty I/O scheduler for a long time - like minutes. Once all of the dirty I/O schedulers were blocked, the system was unusable and not debuggable. However, before that, everything seemed fine even if there was only one dirty I/O scheduler left.
The issue has been fixed, but this was painful to debug, so I’d like to be better prepared next time.
I “think” that what I’d like to track is how busy the dirty I/O schedulers are and get an alarm that can be logged before the system becomes unusable if all are blocked. It feels like erlang:system_monitor/2
, erlang:system_info/1
and erlang:statistics/1
come close to addressing this. They’re not quite what I want, though. erlang:system_monitor/2/
can help me find long schedules, but I’d be ok with some long schedules. It looks like microstate accounting has useful information, but the docs indicate that it’s a profiling tool.
Are there other tools in OTP that I should be looking at? Or perhaps a library?