Preventing overlapping partitions

Hi,

I have a case where I could not get rid of a (non-systematic, yet very frequent) warning: 'global' at node server@HSRV requested disconnect from node 'controller@HCLIENT' in order to prevent overlapping partitions.

The settings are just:

  • two hosts run on a LAN: HSRV and HCLIENT
  • a priori: non-standard EPMD ports are used consistently, firewalls prevent neither EPMD traffic nor inter-VM one, short names are used and DNS is consistent, network is quite reliable, Erlang 27.2 used on both sides
  • on HSRV, an Erlang server runs continuously
  • on HCLIENT, first a monitor client node is run, then - while it is still running - a controller client node is started, it triggers an operation on the server, and then it stops
  • apparently, the “overlapping partitions” warning is triggered on the server when the controller node terminates
  • as expected, if only a single client node runs, no such “overlapping partitions” warning is issued
  • I suppose that most potential race conditions (e.g. for ‘global’ to be in sync) are avoided by inserting comfortable delays (timer:sleep/1) at various points; no amount of killing EPMD instances first, use of global:sync/0, or, on controller termination, global:disconnect/0 or init:stop/1, seemed to solve that issue
  • of course with -kernel prevent_overlapping_partitions false this warning vanishes

Any thought on what could cause such warning to be emitted?

Thanks in advance for any hint!

Best regards,

Olivier.

2 Likes

This sounds like the fix for overlapping partitioning is doing exactly what it should be doing. While, you don’t say exactly how the third node is stopped, you do seem to indicate that it is a brutal kill. Can you shed some more light on exactly how you are terminating said node? Or, have you tried gracefully shutting down this node vs slamming it on the floor (assumption) ?

Edit :

Additional question, is your intention to have a fully connected cluster in this setup? If not, you can surely avoid doing so, and make this problem go away without worrying about gracefully shutting down, etc.

Hi,

Thanks for your message.

The third node (the controller one) is wanted to appear, to perform its short, one-off task and then gracefully shutdown (so no brutal kill at all is involved here); only the monitor node (on the same host) and the server node are supposed to remain afterwards.

For the controller node to stop without triggering on the server the aforementioned warning, I tried to run, from the controller main process (spawned thanks to to erl -run ...): init:stop(_StatusCode=0); then added before global:disconnect(), timer:sleep(500), global:sync(); or tried halt/1 - each time with no luck. The most puzzling element is that this warning is not systematic.

As for the intention, I just want each of the client nodes (monitor and controller) to be able interact peacefully with the server. As a side effect, a fully-connected graph of nodes must be created by Erlang (both clients being then able to interact) - yet it does not matter for the current use case.
The only two hosts involved do not seem to have any problem to interact (in both ways).

I was wondering whether there would be a way of having more contextual runtime information regarding why such a warning is triggered, as currently I fail to see a network partition there.

1 Like

Hmm, global:disconnect/0 by itself should do the trick. It may be because you have a bit of latency between nodes, such that your server node gets the signal from your monitor node before it gets the signal from the controller node itself. You do mention a bit of firewalling, but from you described doesn’t sound like it should be the issue, but the firewall details are not clear either, such as no inter-vm traffic, yet you are able to form a fully connected cluster.

I suppose, in regard to DNS, does this mean you’re using short names? DNS may be consistent but also be consistently slow. These are all kind of shots in the dark. I’m interested to test 27.2 myself, while I think the major difference from 26 to 27 (I tested with 26 a minute ago, but a cluster all on the same machine, so zero latency).

Your setup also sounds pretty basic, but it is not clear how you’re using global, is it the bare min usage of global (i.e., just using it for the actual clustering), or are you doing something more with it?

When I’ve experienced problems with overlapping partition disconnects, it was always in the context of network blips. Of course, enumerating all the nodes in the cluster in no particular order and doing a node disconnect to each will also trigger the behavior.