Monitors and links: what happens under the hood?

dagyu · April 9, 2022, 12:46pm

Hi everyone,
I’m doing a master thesis and one of the topics I’m studying is fault tolerance in Erlang.
Specifically, I would like to deepen the two mechanisms underlying the management of fault tolerance in Erlang: monitors and links.
Looking on the internet I always find a high level description of their behavior but I would be interested in finding out what happens under the hood, in particular I would like to find out how they have been implemented, for example I would like to understand something like that:

If processes A and B are linked together, and process B crashes without emitting any signal to A. How does A notice B’s death? I suppose there is some timeout as is usually done in these cases, but how can I understand exactly what is going on in Erlang?

If anyone can help me or give me some resources that could help me find answers I would be very grateful.

Best regards, Gaetano D’Agostino

EDIT: Instead of process B crashes without emitting any signal to A I mean B dies but for some reason the EXIT signal doesn’t reach A

garazdawi · April 9, 2022, 1:42pm

Hello!

If processes A and B are linked together, and process B crashes without emitting any signal to A.

If processes A and B are linked and B crashes, then an EXIT signal will always be sent to A from B. The reception of the EXIT signal is how process A knows that B has terminated.

There is a lot of information available in the Processes chapter of the Erlang Reference Manual about how the different signals in Erlang work.

If anything is unclear, feel free to ask more questions and I will do my best to answere.

Qqwy · April 9, 2022, 3:23pm

Linking and monitoring are both implemented inside Erlang’s runtime system itself. In other words: outside of the ‘normal’ process-based logic you deal with in user code.
This means that Erlang is able to provide guarantees like that an exit signal will always be sent, even when a process crashed unexpectedly.

This is a big difference to other implementations of actors, which are usually implemented fully in user code. Because of that, they cannot give the same strong guarantees.

dagyu · April 9, 2022, 4:22pm

I apologize but I realized that my description was not clear. I meant B dies but for some reason the EXIT signal doesn’t reach A, in this case how does A notice B’s death?

dagyu · April 9, 2022, 4:23pm

I would be interested in going into more implementation details. This is too high level answer, I would like to understand how this is guaranteed?

garazdawi · April 9, 2022, 7:17pm

As @Qqwy said, the exit signal is guaranteed to arrive eventually, so there is no need to deal with the scenario when it does not arrive.

On a local node, it is guaranteed by using locks and other synchronisation mechanism.

When distributed, it is guaranteed by using a reliable transport (tcp) and counting the process as dead if the tcp connection between two nodes breaks.

Qqwy · April 10, 2022, 10:04am

Processes in Erlang are executed by schedulers. These are OS-level threads; usually the Erlang runtime system starts one per CPU core.
An Erlang process contains its own stack, heap and current code location. This allows schedulers to run an Erlang process for a little bit, and then switch to another Erlang process and execute that one for a little bit.
‘Traditional’ synchronization primitives like mutex-based locks are used to make sure that a process is only ever running at a single scheduler at any given time, and also that messages can be put in a process’ mailbox at any time from anywhere without the possibility of a data race.

Whenever a problem is encountered in the code running in a process, this is raised as an exception. When uncaught by any Erlang code (and also when for instance erlang:exit/2 is called), it will trigger a special part of the scheduler’s process-execution code that shuts down the process (and garbage collect it) as well as preparing an exit signal for each of the processes listed in the shut down process’ list of linked/monitored processes.

For local processes, the exit signals are forwarded directly to each of the monitoring/linked processes. For remote processes, the exit signals are forwarded through TCP to the remote node. In that case, the runtime system on the remote node will take care to place the exit signals in the memory of the designated processes.
What happens when these signals are received depends on the configuration of the process receiving it. For instance, a linked process normally will immediately crash itself when an exit signal was received, but if it is ‘trapping exits’, the signal will instead be converted into a normal message that will be placed in its mailbox.

Erlang’s runtime system is programmed in C. If you want to go even deeper than this explanation, you’ll probably have to look at the C code itself. However, the code is quite complex and does not make for ‘light reading’ .

yannmar · October 8, 2022, 10:33pm

Had near the same question. And it seems the answer is: all sorts of anomalies are possible when network is not healthy, and everything works as declared only when either network is healthy or network is not involved (processes run on the same node). That’s quite typical for distributed software.

It looks like the best publicly available research is this paper: Programming Distributed Erlang Applications: Pitfalls and Recipes, it clearly shows with experiment that signals can be lost and that’s confirmed by erlang spec as well. Quote from that paper:

From studying further the Erlang literature we can see that the
phenomenon is actually acknowledged in Barklund and Virdings
carefully written natural language semantics for Erlang [?]. We
quote:
10.6.2. Order of signals
. . . It is guaranteed that if a process P1 dispatches two sig-
nals s1 and s2 to the same process P2, in that order, then
signal s1 will never arrive after s2 at P2. It is ensured that
whenever possible, a signal dispatched to a process should
eventually arrive at it. There are situations when it is not rea-
sonable to require that all signals arrive at their destination,
in particular when a signal is sent to a process on a different
node and communication between the nodes is temporarily
lost.
Note that in this context a message is a signal instance. In other
words, there are no promises regarding safe delivery of signals (ex-
cept no reordering), especially during temporary communication
failures.

rickard · October 10, 2022, 9:38am

Yes, signals can be lost when a connection between nodes goes down. If you monitor a remote process, you’ll get notified by a DOWN signal containing the exit reason noconnection when a connection goes down. The monitored process might still be alive and you might have lost signals from/to that process.

It works in a similar fashion with a link between processes on two different nodes. When the connection goes down, the link is broken at both ends and both processes will receive an EXIT signal with exit reason noconnection. The linked processes might still be alive and you might have lost other signals between the processes.

Assuming that the connection does not go down, no signals will be lost. That is, by monitoring the connection, another process over the connection or linking over the connection, you can determine if signals might have been lost.