Hi,
I have a case where I could not get rid of a (non-systematic, yet very frequent) warning: 'global' at node server@HSRV requested disconnect from node 'controller@HCLIENT' in order to prevent overlapping partitions
.
The settings are just:
- two hosts run on a LAN:
HSRV
and HCLIENT
- a priori: non-standard EPMD ports are used consistently, firewalls prevent neither EPMD traffic nor inter-VM one, short names are used and DNS is consistent, network is quite reliable, Erlang 27.2 used on both sides
- on HSRV, an Erlang server runs continuously
- on HCLIENT, first a monitor client node is run, then - while it is still running - a controller client node is started, it triggers an operation on the server, and then it stops
- apparently, the âoverlapping partitionsâ warning is triggered on the server when the controller node terminates
- as expected, if only a single client node runs, no such âoverlapping partitionsâ warning is issued
- I suppose that most potential race conditions (e.g. for âglobalâ to be in sync) are avoided by inserting comfortable delays (timer:sleep/1) at various points; no amount of killing EPMD instances first, use of global:sync/0, or, on controller termination, global:disconnect/0 or init:stop/1, seemed to solve that issue
- of course with
-kernel prevent_overlapping_partitions false
this warning vanishes
Any thought on what could cause such warning to be emitted?
Thanks in advance for any hint!
Best regards,
Olivier.
2 Likes
This sounds like the fix for overlapping partitioning is doing exactly what it should be doing. While, you donât say exactly how the third node is stopped, you do seem to indicate that it is a brutal kill. Can you shed some more light on exactly how you are terminating said node? Or, have you tried gracefully shutting down this node vs slamming it on the floor (assumption) ?
Edit :
Additional question, is your intention to have a fully connected cluster in this setup? If not, you can surely avoid doing so, and make this problem go away without worrying about gracefully shutting down, etc.
Hi,
Thanks for your message.
The third node (the controller one) is wanted to appear, to perform its short, one-off task and then gracefully shutdown (so no brutal kill at all is involved here); only the monitor node (on the same host) and the server node are supposed to remain afterwards.
For the controller node to stop without triggering on the server the aforementioned warning, I tried to run, from the controller main process (spawned thanks to to erl -run ...
): init:stop(_StatusCode=0)
; then added before global:disconnect(), timer:sleep(500), global:sync()
; or tried halt/1
- each time with no luck. The most puzzling element is that this warning is not systematic.
As for the intention, I just want each of the client nodes (monitor and controller) to be able interact peacefully with the server. As a side effect, a fully-connected graph of nodes must be created by Erlang (both clients being then able to interact) - yet it does not matter for the current use case.
The only two hosts involved do not seem to have any problem to interact (in both ways).
I was wondering whether there would be a way of having more contextual runtime information regarding why such a warning is triggered, as currently I fail to see a network partition there.
1 Like
Hmm, global:disconnect/0
by itself should do the trick. It may be because you have a bit of latency between nodes, such that your server node gets the signal from your monitor node before it gets the signal from the controller node itself. You do mention a bit of firewalling, but from you described doesnât sound like it should be the issue, but the firewall details are not clear either, such as no inter-vm traffic, yet you are able to form a fully connected cluster.
I suppose, in regard to DNS, does this mean youâre using short names? DNS may be consistent but also be consistently slow. These are all kind of shots in the dark. Iâm interested to test 27.2 myself, while I think the major difference from 26 to 27 (I tested with 26 a minute ago, but a cluster all on the same machine, so zero latency).
Your setup also sounds pretty basic, but it is not clear how youâre using global, is it the bare min usage of global (i.e., just using it for the actual clustering), or are you doing something more with it?
When Iâve experienced problems with overlapping partition disconnects, it was always in the context of network blips. Of course, enumerating all the nodes in the cluster in no particular order and doing a node disconnect to each will also trigger the behavior.
Hi,
Thanks Bryan for your response, and sorry for the longer delay - due to an unrelated large refactoring of the code, I could not test it for some time.
I think I found the origin for this 'global' at node server@HSRV requested disconnect from node 'controller@HCLIENT' in order to prevent overlapping partitions.
warning, even though I do not see why it should be triggered in that case.
Indeed, in the controller code, the PID of a service running on the server node is needed at some point. This service is registered only locally on the server node (rather than globally), so that the other nodes may have their own instance of this service if wanted (to avoid any clash in the global naming registry, as this service is not meant to be a singleton). The controller fetches that PID thanks to rpc:call(ServerNode, _Mod=erlang, _Fun=whereis, _Args=[RegisteredServiceName])
. A remote PID (e.g. 11118.139.0) is successfully returned, but apparently this triggers said warning.
Last piece of evidence: for this warning to be issued, the monitor node must be already connected to the server one, before the controller node does so; the monitor node happens to use the same local service on the server (whose PID was fetched by the same logic).
So, my question: am I somehow violating some expectations by doing so? Does fetching remotely a local PID that way from multiple nodes may cause issues?
Thanks in advance for any hint!
Best regards,
Olivier.
After some more testing, by adding timer:sleep/1 calls to pinpoint when the warning is actually issued (knowing that this controller node is short-lived): it happens non-systematically (often, but not always), and when it happens it is actually when the controller node terminates (thanks to init:stop(_StatusCode=0
); if ever it mattered, this controller node was launched with erl -noshell -noinput -run [...]
, it uses short names, and it changed cookie at start-up, before doing anything sensible)
1 Like