Nodes fail to join en Erlang cluster completely

huss · March 7, 2023, 2:57pm

We recently started getting strange behaviour in our Erlang cluster (pretty large cluster running a mix of OTP versions 24.0.2, 24.3.3 and 24.3.4.5). We noticed that after a node restarted it didn’t automatically re-connect to all nodes in the cluster. Doing a net_adm:ping to the missing nodes would re-establish the connection without any apparent problems.
I have tested by starting an Erlang node (with the correct cookie and net_ticktime) and pinging one of the nodes in the cluster to make it join the cluster. I then see that it will establish connections to a number of nodes but not all. After net_startuptime I get a shower of warnings from global (global: xxx failed to connect to yyy), one for each of the not connected nodes. Changing net_startuptime does not help, it only delays the warning messages.
Digging through the docs I found the big red warning in the global doc about prevent_overlapping_partitions. I figure this may be the root cause of our problems, but I was wondering about a couple of things in connection with this.

Is my reasoning about the cause of the problems correct? Can I verify that it is indeed the overlapping partitions problem?
When enabling this parameter, will it come into effect immediately or does it require a restart?
The info about it says that it needs to be enabled on all nodes in the network in order to work. Does this mean that the update must be done more or less simultaneously on all nodes?

starbelly · March 8, 2023, 3:49am

This is an OTP 25 behavioural change (fix), you don’t mention running OTP 25. Also, if prevent_overlapping_partitions was part of your problem, you would definitely see warnings / errors in the log when a overlap is detected and a node reconnect ensues.

Probably a silly question, but have you tried verifying connectivity via the cli (e.g, using telnet)?

huss · March 9, 2023, 9:54am

As I understood it the fix has also been released for OTP24 but is disabled by default. I don’t think that our problems are due to the fix, but rather the underlying problem that the fix addresses.
Connectivity is fine, the cluster is working fine. The problem is if a node restarts (or a new node attempts to join the cluster). A restarting node will be connected to all nodes in its sync_nodes_optional list, but not all nodes that these in turn are connected to.
A minimal example would be to have a cluster containing two nodes N1 and N2, which are connected and working fine. A node N3 that attempts to join the cluster by connecting to N1 will succeed and get connected to N1, but not to N2 and a global: N3 failed to connect to N2 would appear after seven seconds.
/Håkan

huss · March 9, 2023, 9:56am

global: N3 failed to connect to N2 is the warning that would be issued. Sorry for any confusion.

starbelly · March 10, 2023, 4:07pm

Right you are.

I tried to replicate your problem with a few different scenarios based on the information but was unsuccessful.

Question: Are you using -connect_all false ?

huss · March 20, 2023, 8:56am

I haven’t been able to reproduce it either, but both our test and prod clusters are in this state. They have been running for many years during which ther have been hot and cold upgrades, OTP uprades, OS upgrades, network problems, etc. I have no idea what it is that has landed them in this state. It could even be that the OTP state is fine, but there is something strange in the network configuration or something else. The minimal example i gave with three nodes is only a hypothetical scenario.
That we noticed the problem only recently doesn’t mean that it hasn’t been there for a long time. We have an application running on all nodes that maintains the cluster connections by pinging all nodes in sync_nodes_* regularly (I don’t really know why, it predates all current members of the team). This will cover up the problem since explicit connects seem to work. We recently added a couple of nodes that had fewer nodes in the sync config. This may have made the problem visible.
We don’t set connect_all.
It seems like some kind of deadlock is occurring. In the (hypothetical) three-node scenario with nodes N1 and N2 running and connected joining a new node N3 to the cluster by pinging N1 I would after a while give the global warning, as I described before. And as I described, I can explicitly join N3 and N2 together using the exact same connect call that is used by global (erlang:monitor_node(N2, true, [allow_passive_connect]).), as well as with other calls. However, any attempt to connect N3 and N2 during the net_startuptime period between the N3-N1 connect call and the display of the warnings will fail. So it seems that either N3 or N2 is blocked during this period. I have started to look into the global connect protocol but I haven’t really made much progress.
/Håkan