Does a supervisor use a heartbeat behind the scenes to monitor workers?

Hello,

Although I bought Programming Erlang 15 years ago I didn’t really get a chance to do big Erlang projects.

So I’m now reading Joe Armstrong’s thesis to refresh some concerns I have about distributed systems.

Assuming we have a multi-machine Erlang deployment, how exactly is a crash detected? I imagine it can only be a heartbeat mechanism behind the scenes.

I would assume that a process crashing on the local machine (local VM) is detected by the Erlang runtime itself, easily. Which then creates synthetic events to notify a supervisor.

But a process crashing on a remote machine could imply the whole machine crashed and there’s no communication. Which means only a heartbeat can (statistically) detect the other remote process crashed.

Is this correct?

I see Erlang runtime has heart heart — kernel v10.2.6 but it needs a special command line flag so I’m not sure this one is the main mechanism.

Thank you,
–emi

Supervisors and children are linked which ensures a supervisor receives an exit signal when a child terminates.

Children should be running on the same node as the supervisor for exactly the reason you recognized. I once considered implementing a supervisor which could handle remote children but quickly realized that it made far less sense than designed for a supervisor at each node.

Erlang supports linking, monitoring and forwarding signals over the distribution channel. Distribution protocol has control messages for remote linking, monitoring, and forwarding signals: Distribution Protocol — erts v15.2.6

Based on some wireshark tracing and reading the docs, remote monitoring works something like this: suppose a process on node A monitors process P on node B.

  • If P exists, but node B keeps running, B sends a control message to A, and VM runtime on A forwards it to the local process.
  • If Erlang VM on B goes down, but the host keeps running, B’s OS TCP/IP stack will automatically close all connections opened by B. As such, A will be notified immediately via a TCP segment with FIN flag. Then A can conclude that all processes on B are down, and send signals to the local process.
  • Finally, if the network between the nodes is totally dead and no packets can be exchanged, both TCP and Erlang distribution protocol have keep alive and timeout mechanisms that can detect closed distributed connection after a while.

To answer the original question, no, supervisors don’t have any explicit heartbeat mechanism, they instead rely on signals, but the underlying remote linking mechanism does. I’ve never tried attaching a remote process to a supervisor, but it might actually work. Whether remote supervision is a good idea or not, I don’t know.

-heart flag is completely unrelated to the distributed Erlang. Heart process implements a watchdog mechanism that restarts a local Erlang VM, should it becomes unresponsive due to overload or VM lock-up.

1 Like

The more I read the more it seems to me the design is actually quite simple:

  • connection between nodes is most likely using a heartbeat (either directly or though a normal socket which has keep-alive)
  • local processes are managed through the runtime and exit / crash is easily detected

So:

  • on a local node: when connection to a remote node is lost, a noconnection synthetic exit signal is created for all processes on that node (that are locally linked) Processes — Erlang System Documentation v27.3.3
  • otherwise, assuming a working connection, exit / crash signals are sent over the network between nodes