I’m doing a master thesis and one of the topics I’m studying is fault tolerance in Erlang.
Specifically, I would like to deepen the two mechanisms underlying the management of fault tolerance in Erlang: monitors and links.
Looking on the internet I always find a high level description of their behavior but I would be interested in finding out what happens under the hood, in particular I would like to find out how they have been implemented, for example I would like to understand something like that:
If processes A and B are linked together, and process B crashes without emitting any signal to A. How does A notice B’s death? I suppose there is some timeout as is usually done in these cases, but how can I understand exactly what is going on in Erlang?
If anyone can help me or give me some resources that could help me find answers I would be very grateful.
Best regards, Gaetano D’Agostino
EDIT: Instead of process B crashes without emitting any signal to A I mean B dies but for some reason the EXIT signal doesn’t reach A
If processes A and B are linked together, and process B crashes without emitting any signal to A.
If processes A and B are linked and B crashes, then an EXIT signal will always be sent to A from B. The reception of the EXIT signal is how process A knows that B has terminated.
There is a lot of information available in the Processes chapter of the Erlang Reference Manual about how the different signals in Erlang work.
If anything is unclear, feel free to ask more questions and I will do my best to answere.
Linking and monitoring are both implemented inside Erlang’s runtime system itself. In other words: outside of the ‘normal’ process-based logic you deal with in user code.
This means that Erlang is able to provide guarantees like that an exit signal will always be sent, even when a process crashed unexpectedly.
This is a big difference to other implementations of actors, which are usually implemented fully in user code. Because of that, they cannot give the same strong guarantees.
I apologize but I realized that my description was not clear. I meant B dies but for some reason the EXIT signal doesn’t reach A, in this case how does A notice B’s death?
I would be interested in going into more implementation details. This is too high level answer, I would like to understand how this is guaranteed?
As @Qqwy said, the exit signal is guaranteed to arrive eventually, so there is no need to deal with the scenario when it does not arrive.
On a local node, it is guaranteed by using locks and other synchronisation mechanism.
When distributed, it is guaranteed by using a reliable transport (tcp) and counting the process as dead if the tcp connection between two nodes breaks.
Processes in Erlang are executed by schedulers. These are OS-level threads; usually the Erlang runtime system starts one per CPU core.
An Erlang process contains its own stack, heap and current code location. This allows schedulers to run an Erlang process for a little bit, and then switch to another Erlang process and execute that one for a little bit.
‘Traditional’ synchronization primitives like mutex-based locks are used to make sure that a process is only ever running at a single scheduler at any given time, and also that messages can be put in a process’ mailbox at any time from anywhere without the possibility of a data race.
Whenever a problem is encountered in the code running in a process, this is raised as an exception. When uncaught by any Erlang code (and also when for instance
erlang:exit/2 is called), it will trigger a special part of the scheduler’s process-execution code that shuts down the process (and garbage collect it) as well as preparing an exit signal for each of the processes listed in the shut down process’ list of linked/monitored processes.
For local processes, the exit signals are forwarded directly to each of the monitoring/linked processes. For remote processes, the exit signals are forwarded through TCP to the remote node. In that case, the runtime system on the remote node will take care to place the exit signals in the memory of the designated processes.
What happens when these signals are received depends on the configuration of the process receiving it. For instance, a linked process normally will immediately crash itself when an exit signal was received, but if it is ‘trapping exits’, the signal will instead be converted into a normal message that will be placed in its mailbox.
Erlang’s runtime system is programmed in C. If you want to go even deeper than this explanation, you’ll probably have to look at the C code itself. However, the code is quite complex and does not make for ‘light reading’ .