I’m looking for some insight into process “patterns” that can result in is_process_alive/1 (docs) being unexpectedly slow.
For context, I have a singleton gen_server that is responsible for monitoring many other processes in a distribution. These processes track themselves with the server with a cast and only interact with it asynchronously, so the expectation was that slowdowns in this process would not be externally visible.
However, there’s a codepath that checks the liveness of the singleton gen_server using is_process_alive/1, and I observed that call being extremely slow on occasion. Usually it completes in microseconds, but it regularly takes upwards of a second. This was unexpected – I always considered this check to be “free” – but looking at the docs, is_process_alive/1 additionally guarantees that the process is not exiting, which means it has to wait for all signals to be processed by the process being checked.
This is where I’d like more information. What are the situations in which is_process_alive/1 might be slow? I found experimentally that just having messages in its queue isn’t an issue. Does GC of the process being checked cause it to block? Could off_heap message queues help? Are there other functions like is_process_alive/1 that could be unexpectedly slow that you have to watch out for?
is_process_alive/1 sends a non-message signal to the target process asking it to reply back with whether it’s alive.
When there are a lot of signals to handle, it takes time to get to and respond to that signal. Non-message signals in particular (links, monitors, is_process_alive/1 for that matter, etc) matter a lot here since they require special handling unlike ordinary messages, which can be handled in bulk.
Think of the signal handling like you would any gen_server, if it’s overloaded then it may take a long time until it gets around to handling your latest request.
Edit: GC, off_heap msgq, etc are red herrings.
Why not ask that singleton gen_server to monitor/2 the processes instead?
Clarifying question: do these signals go in the same queue as “normal” messages? That is: if my gen_server:call() is taking a while, does that also block is_process_alive/1, etc.?
Yes and no, they go into the same queue as messages are signals like any other, but signal handling is separate from receive or whatever else you do in your code. You can think of it as something that runs all the time in the background. I wrote a bit about this in a blog post some years back.
We do. Actually, we monitor both ways – the gen_server monitors each process, and each process monitors the gen_server (so that they can re-track themselves with a replacement gen_server if the first goes down). This means we have potentially many processes simultaneously monitoring the server, which, based on your explanation, means we potentially have a lot of these non-message signals that need to be handled before is_process_alive/1 can be.
Ultimately, we don’t actually need the is_process_alive/1 check, and the fix for us was to just get rid of it. The goal of this post wasn’t necessarily to make it fast, but to understand the situations in which it can be slow. My understanding now is: non-message signals require special handling, so if a lot of them are being sent to one process, non-message signals that require a response (like is_process_alive/1) can result in the caller waiting. Does that sound correct?
all signals (including messages) are continuously received and handled behind the scenes
Are there specific things that can slow down or pause the continuous handling of these signals? Is there anything that can speed it up? (For context, we don’t see increased scheduler/CPU utilization on the node handling this particular gen_server, which makes me wonder whether there’s anything we could do to allocate more resources to this process.)
Also: Does process GC pause non-message signal handling for that process?