"Wedged" processes - detecting a wedged vs slow process?

Now I recently had an issue where a gen_server got wedged and had a message queue grow rather large.

This got me thinking. There is no way I am aware of that can detect a wedged vs slow process. Now I know that there are good arguments to not have priority messages, but how about something in the sys or erlang module that sends a signal to the head of the message queue that will be invisible to the user, but can assist in determining the nature of the problem.


% Default timeout of 5 seconds?
sys:ping(Pid) -> {ok, TimeElapsed} | {error, timeout}
sys:ping(Pid, Timeout) -> {ok, TimeElapsed} | {error, timeout}

AFAIK, there is no way to insert messages at arbitrary places in message queues, front or otherwise. They all go to the rear. Messages can only be received out of order via selective receives. Not by gen_server and friends, though, at least not without some weird gymnastics. It might be possible to use selective receives to grab the ping messages before any other by the message handling mechanism in gen or gen_*, but that would slow everything down.

Also, the distinction between “wedged” and “just very slow” is not clear. In general, processes should be designed such that they can get slow, but not “wedged”, by using timeouts internally or delegate maybe-slow work to spawned processes, preferably under a dedicated simple one-for-one supervisor. timer:kill_after is another option to allow a maximum runtime for processes, the timer module has just recently been modernized and streamlined :wink:


Standard way to achieve that is to use primitives breaking message order delivery guarantees. For example, ETS tables: a process can periodically check an ETS table whether it contains {Pid, ping, Seq} message and if there one, send a message to “monitoring” process with that Seq.

But all in all, it comes to a definition of “wedged” process. Is it “wedged” when it does not process the message queue? Or when the queue grows faster than drains? Or when it’s in an infinite busy loop? Or it’s just hanging in some sort of a deadlock? It’s easy to check all these conditions for a single selected process. But when you need to do that at scale (say, for all processes in the system), it gets really complex.