Nodes reconnection after computer suspend

In my development system (laptop, Ubuntu 23.04, OTP 26.0, incus/lxc containers) I noticed that resuming a working session after a power suspension would lose all the disterl connections.
On the console I can see warning:
'global' at node <X@host> requested disconnect from node <Y@host> in order to prevent overlapping partitions
as expected.
I was wondering if there is a way to catch global messages in every node, and try to re-establish their connections.
It seems that the nodedown and nodeupmessages are missed when suspending/unsupending the computer and all the distribution topology is lost.

1 Like

Ok, to answer to my own question: I did a bit of research and sharing the results in case someone else might be interested.

The initial reason for my post is that I usually develop distributed erlang applications on my laptop using a bunch of incus hypervisor-based containers.
And when commuting from/to office, I found that suspending and resuming my computer resulted in a loss of connections for nodes running on separate containers, but also for nodes running on the same container in the host itself.

So my idea was to watch for suspend/resume events from the operating system and make sure that all the nodes are reconnected before doing anything else.

On Ubuntu, to be notified of a system resuming after suspend (not 100% sure about hybernate though), the only practical way I found is by subscribing to signals from the DBUS interface of systemd-logind.

Quoting from org.freedesktop.login1 :
" The PrepareForShutdown , PrepareForShutdownWithMetadata , and PrepareForSleep signals are sent right before (with the argument "true ") or after (with the argument "false “) the system goes down for reboot/poweroff and suspend/hibernate, respectively.”

Although there is some support for DBUS from erlang with GitHub - jeanparpaillon/erlang-dbus: Erlang DBUS implementation (forked from unmaintained erlang-dbus) this package requires a some non trivial work to integrate with rebar3 projects and OTP 26.

So, I built a proof of concept using a small python script to watch for supend/resume events and then relay the events on a TCP connection.
The core of the python script is like:

bus = dbus.SystemBus()               
bus.add_signal_receiver(               
    relay_sleep_resume,                       # the relay-to-erlang-node function
   'PrepareForSleep',                         # signal name
    'org.freedesktop.login1.Manager',         # interface
    'org.freedesktop.login1'                  # bus name
)

I considered using tools like GitHub - Pyrlang/Pyrlang: Erlang node implemented in Python 3.5+ (Asyncio-based) or GitHub - hdima/erlport: ErlPort - connect Erlang to other languages to just send a message to an erlang process, but it seems that both packages aren’t working with OTP 23 and following, and fixing and I chose not to invest too much time on an admittely non-production border case.

After a few weeks of test, the hybrid python-erlang solution seems to work pretty well, even after a few Ubuntu system’s upgrades, and I will likely continue to use it while doing nomadic development work.

A small correction on my first post: the loss of disterl connection doesn’t happen if/when the resuming happens with the host (computer) keeping the same network configuration it had when suspended.

So, if you suspend and resume a laptop in the same network, the nodes reconnects automagically.
However, if you suspend your laptop and resume into a different network (like suspend at office, resume at a different location), then the issue is very likely (and understandable).

1 Like