What could cause a node to take a long time to join an existing cluster?

Are there any reasons besides network instability that might cause a node to take a long time (~60s avg) to join an existing cluster?

What is the best way to debug this?

1 Like

How do you detect when the node joined the cluster? Also, what OTP version are you running?

I’d start with the usual sequence of verifying that the joining node can resolve DNS (or any other method you’re using to resolve host names to IP addresses). Try telnet, ensure you can connect to the port (which you should be able to extract using epmd -names, unless you have epmd-less setup).

2 Likes

Max, thanks for your reply.

I’m running OTP 24 and trying to connect the nodes with net_adm:ping/1.

I’ve verified that the machines can connect to each other (through ping and telnet via epmd port) so there shouldn’t be any DNS or networking issues at play here. Hence why I’m a bit stuck on how to debug this.

I started seeing this behavior after we migrated to using a new deploy tool (we now use Ansible rather than makefiles). Though, I’m not sure how this could have affected erlang clustering – it is my understanding that all erlang needs to cluster is 1) for the machines to be able to communicate to each other and 2) the erlang.cookie to be the same. I have verified both of these conditions to be true yet for some reason the nodes are unable to cluster for a relatively long period of time.

1 Like

Could you try using net_kernel:connect_node?
Also, since it looks you’re using bundled epmd, could you try the name resolution this way:

erl_epmd:port_please('nodename', "hostname.domain.name", 7000).
1 Like

I managed to solve my issue by rebooting my machine. I suspect changing the hostname of the machine without rebooting may have been causing these issues.

However, I’m curious, what is the difference between net_adm:ping and net_kernel:connect_node? Also, is it normal that erl_epmd:port_please returns noport on clustered machines?

1 Like

net_adm:ping does a lot more than just net_kernel:connect (e.g. that’s an extra gen_server:call to a remote node net_kernel gen_server). It executes extra code paths, and may actually fail if implicit connections are disabled (see erl -kernel dist_auto_connect false).

4 Likes