In my development system (laptop, Ubuntu 23.04, OTP 26.0, incus/lxc containers) I noticed that resuming a working session after a power suspension would lose all the disterl connections.
On the console I can see warning:
'global' at node <X@host> requested disconnect from node <Y@host> in order to prevent overlapping partitions
as expected.
I was wondering if there is a way to catch global
messages in every node, and try to re-establish their connections.
It seems that the nodedown
and nodeup
messages are missed when suspending/unsupending the computer and all the distribution topology is lost.
Ok, to answer to my own question: I did a bit of research and sharing the results in case someone else might be interested.
The initial reason for my post is that I usually develop distributed erlang
applications on my laptop using a bunch of incus
hypervisor-based containers.
And when commuting from/to office, I found that suspending and resuming my computer resulted in a loss of connections for nodes running on separate containers, but also for nodes running on the same container in the host itself.
So my idea was to watch for suspend/resume events from the operating system and make sure that all the nodes are reconnected before doing anything else.
On Ubuntu, to be notified of a system resuming after suspend (not 100% sure about hybernate though), the only practical way I found is by subscribing to signals from the DBUS interface of systemd-logind
.
Quoting from org.freedesktop.login1 :
" The PrepareForShutdown
, PrepareForShutdownWithMetadata
, and PrepareForSleep
signals are sent right before (with the argument "true
") or after (with the argument "false
“) the system goes down for reboot/poweroff and suspend/hibernate, respectively.”
Although there is some support for DBUS
from erlang
with GitHub - jeanparpaillon/erlang-dbus: Erlang DBUS implementation (forked from unmaintained erlang-dbus) this package requires a some non trivial work to integrate with rebar3
projects and OTP 26.
So, I built a proof of concept using a small python
script to watch for supend/resume events and then relay the events on a TCP connection.
The core of the python
script is like:
bus = dbus.SystemBus()
bus.add_signal_receiver(
relay_sleep_resume, # the relay-to-erlang-node function
'PrepareForSleep', # signal name
'org.freedesktop.login1.Manager', # interface
'org.freedesktop.login1' # bus name
)
I considered using tools like GitHub - Pyrlang/Pyrlang: Erlang node implemented in Python 3.5+ (Asyncio-based) or GitHub - hdima/erlport: ErlPort - connect Erlang to other languages to just send a message to an erlang process, but it seems that both packages aren’t working with OTP 23 and following, and fixing and I chose not to invest too much time on an admittely non-production border case.
After a few weeks of test, the hybrid python-erlang
solution seems to work pretty well, even after a few Ubuntu system’s upgrades, and I will likely continue to use it while doing nomadic development work.
A small correction on my first post: the loss of disterl
connection doesn’t happen if/when the resuming happens with the host (computer) keeping the same network configuration it had when suspended.
So, if you suspend and resume a laptop in the same network, the nodes reconnects automagically.
However, if you suspend your laptop and resume into a different network (like suspend at office, resume at a different location), then the issue is very likely (and understandable).