Preventing overlapping partitions

Wondersye · January 5, 2025, 5:44pm

Hi,

I have a case where I could not get rid of a (non-systematic, yet very frequent) warning: 'global' at node server@HSRV requested disconnect from node 'controller@HCLIENT' in order to prevent overlapping partitions.

The settings are just:

two hosts run on a LAN: HSRV and HCLIENT
a priori: non-standard EPMD ports are used consistently, firewalls prevent neither EPMD traffic nor inter-VM one, short names are used and DNS is consistent, network is quite reliable, Erlang 27.2 used on both sides
on HSRV, an Erlang server runs continuously
on HCLIENT, first a monitor client node is run, then - while it is still running - a controller client node is started, it triggers an operation on the server, and then it stops
apparently, the “overlapping partitions” warning is triggered on the server when the controller node terminates
as expected, if only a single client node runs, no such “overlapping partitions” warning is issued
I suppose that most potential race conditions (e.g. for ‘global’ to be in sync) are avoided by inserting comfortable delays (timer:sleep/1) at various points; no amount of killing EPMD instances first, use of global:sync/0, or, on controller termination, global:disconnect/0 or init:stop/1, seemed to solve that issue
of course with -kernel prevent_overlapping_partitions false this warning vanishes

Any thought on what could cause such warning to be emitted?

Thanks in advance for any hint!

Best regards,

Olivier.

starbelly · January 9, 2025, 11:37am

This sounds like the fix for overlapping partitioning is doing exactly what it should be doing. While, you don’t say exactly how the third node is stopped, you do seem to indicate that it is a brutal kill. Can you shed some more light on exactly how you are terminating said node? Or, have you tried gracefully shutting down this node vs slamming it on the floor (assumption) ?

Edit :

Additional question, is your intention to have a fully connected cluster in this setup? If not, you can surely avoid doing so, and make this problem go away without worrying about gracefully shutting down, etc.

Wondersye · January 9, 2025, 6:58pm

Hi,

Thanks for your message.

The third node (the controller one) is wanted to appear, to perform its short, one-off task and then gracefully shutdown (so no brutal kill at all is involved here); only the monitor node (on the same host) and the server node are supposed to remain afterwards.

For the controller node to stop without triggering on the server the aforementioned warning, I tried to run, from the controller main process (spawned thanks to to erl -run ...): init:stop(_StatusCode=0); then added before global:disconnect(), timer:sleep(500), global:sync(); or tried halt/1 - each time with no luck. The most puzzling element is that this warning is not systematic.

As for the intention, I just want each of the client nodes (monitor and controller) to be able interact peacefully with the server. As a side effect, a fully-connected graph of nodes must be created by Erlang (both clients being then able to interact) - yet it does not matter for the current use case.
The only two hosts involved do not seem to have any problem to interact (in both ways).

I was wondering whether there would be a way of having more contextual runtime information regarding why such a warning is triggered, as currently I fail to see a network partition there.

starbelly · January 13, 2025, 3:15pm

Hmm, global:disconnect/0 by itself should do the trick. It may be because you have a bit of latency between nodes, such that your server node gets the signal from your monitor node before it gets the signal from the controller node itself. You do mention a bit of firewalling, but from you described doesn’t sound like it should be the issue, but the firewall details are not clear either, such as no inter-vm traffic, yet you are able to form a fully connected cluster.

I suppose, in regard to DNS, does this mean you’re using short names? DNS may be consistent but also be consistently slow. These are all kind of shots in the dark. I’m interested to test 27.2 myself, while I think the major difference from 26 to 27 (I tested with 26 a minute ago, but a cluster all on the same machine, so zero latency).

Your setup also sounds pretty basic, but it is not clear how you’re using global, is it the bare min usage of global (i.e., just using it for the actual clustering), or are you doing something more with it?

When I’ve experienced problems with overlapping partition disconnects, it was always in the context of network blips. Of course, enumerating all the nodes in the cluster in no particular order and doing a node disconnect to each will also trigger the behavior.

Wondersye · February 9, 2025, 1:57pm

Hi,

Thanks Bryan for your response, and sorry for the longer delay - due to an unrelated large refactoring of the code, I could not test it for some time.

I think I found the origin for this 'global' at node server@HSRV requested disconnect from node 'controller@HCLIENT' in order to prevent overlapping partitions. warning, even though I do not see why it should be triggered in that case.

Indeed, in the controller code, the PID of a service running on the server node is needed at some point. This service is registered only locally on the server node (rather than globally), so that the other nodes may have their own instance of this service if wanted (to avoid any clash in the global naming registry, as this service is not meant to be a singleton). The controller fetches that PID thanks to rpc:call(ServerNode, _Mod=erlang, _Fun=whereis, _Args=[RegisteredServiceName]). A remote PID (e.g. 11118.139.0) is successfully returned, but apparently this triggers said warning.

Last piece of evidence: for this warning to be issued, the monitor node must be already connected to the server one, before the controller node does so; the monitor node happens to use the same local service on the server (whose PID was fetched by the same logic).

So, my question: am I somehow violating some expectations by doing so? Does fetching remotely a local PID that way from multiple nodes may cause issues?

Thanks in advance for any hint!

Best regards,

Olivier.

Wondersye · February 9, 2025, 4:30pm

After some more testing, by adding timer:sleep/1 calls to pinpoint when the warning is actually issued (knowing that this controller node is short-lived): it happens non-systematically (often, but not always), and when it happens it is actually when the controller node terminates (thanks to init:stop(_StatusCode=0); if ever it mattered, this controller node was launched with erl -noshell -noinput -run [...], it uses short names, and it changed cookie at start-up, before doing anything sensible)

Wondersye · June 25, 2025, 10:41am

(since using version 28.0, this issue never appeared again, whereas the user code and configuration were not significantly altered, so I suppose that some change/fix was made in that matter in the Erlang runtime)

terry-xiaoyu · July 3, 2025, 7:27am

If you want connect a node to a cluster without joining the global_name_server,

net_kernel:hidden_connect_node/1

will do the trick.

starbelly · July 3, 2025, 8:23pm

I would not mark this as a solution. AFAIK nothing changed around overlapping partitions in OTP 28. You might be lucking out because some code here and there saw some optimizations, but you’ll hit this eventually