In order to prevent overlapping partitions. Why?

Hello al!

Got error when trying to connect to remote Erlang node:

$ erl -name remote -setcookie 12345 -remsh test@host.test.domain
Erlang/OTP 25 [erts-13.2.2.10] [source] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns] [dtrace]

Eshell V13.2.2.10  (abort with ^G)
=WARNING REPORT==== 25-Oct-2024::11:55:50.825851 ===
'global' at node 'remote@workstation.host' requested disconnect from node 'test@host.test.domain' in order to prevent overlapping partitions
*** ERROR: Shell process terminated! (^G to start new job) ***
=WARNING REPORT==== 25-Oct-2024::11:55:50.827319 ===
'global' at node 'remote@workstation.host' disconnected node 'test@host.test.domain' in order to prevent overlapping partitions
(test@host.test.domain)1>

What does this error mean? Got failed with seeking any explanation of this kind of error.
This error happens only when trying to connect remotely from workstation, when trying to connect from server itself - all is working.

Found some explanations of this error:

I recently ran into this as well. In my case it was because the systems couldn’t form a full mesh because not all hosts were reachable by all other hosts (hostnames weren’t resolvable the same everywhere). I also found the error not really descriptive given the problem.

As you have already found I believe, this was a long standing bug in OTP fixed starting with OTP 26 (iirc). You can disable this, but I wouldn’t advise it, instead and IMHO the cluster should be designed such that it can handle a potential rolling disconnect (what happens in a fully connected mesh when an overlap is detected).

I can give you one example of what can happen if you disable in a TL;DR : global can become inconsistent such that a node or some nodes may appear to be connected, but in fact are not, after an overlap occurs.

Additionally, I think the best documentation for this is private, and it probably should be, but its documentation of the algorithim

That’s what was so surprising on my end encountering this recently. The full mesh was never able to form a full cluster in the first place, hence the error not being about disconnects at all, but rather the failure of establishing connection.

1 Like

Oh that’s interesting! So, you were trying to form a fully connected cluster but never could because of the error?

Yes, but not because of the error, but because of hostname misconfigurations. This was a bit confusing at first, because my mental model did associate the error with disconnections in an established mesh not with ones failing to form.

1 Like