Occasional nodedown/noconnection issues

holger · June 6, 2024, 10:57am

We’re running a mildly loaded ejabberd instance (something like 50k Erlang processes), the OS load is usually below 1. The instance is queried once every few minutes by local clients using rpc/erpc (to perform simple monitoring/maintenance tasks). After upgrading from OTP 23.2.1 to 26.2.5, those clients occasionally fail with nodedown/noconnection (a few times per day). The only thing I tried so far is setting net_setuptime 30, which made no difference.

I guess I’ll get that issue tracked down eventually, but if anyone has any ideas, please go ahead!

Led · June 7, 2024, 4:50am

Have you tried kernel’s {connect_all, false} for local clients?

holger · June 7, 2024, 10:57am

It’s a single-node setup, so the local clients just talk to the local server on the loopback interface with no additional (remote) nodes involved. Which is why those connection issues seem so unexpected to me.

So I guess no difference is to be expected from that option? (But thank you for the idea!)

starbelly · June 8, 2024, 7:38pm

There’s a few possibilities here, but since you mention going from 23 to 26, one that comes to mind is overlapping partition prevention. Let’s say your clients are external to the single node you mention (e.g., they are escripts, other nodes on the same server, etc.), what could happen is two clients issue an erpc call (spawn request) around the same time such that client A gets its response and happily disconnects, the other client sent its request at just the right time, but when client A disconnected, a rolling disconnect between all nodes ensued, and now client B gets a nodedown/noconnection.

If this were the case, you would see a log message on your main node about global disconnecting from other nodes to prevent overlapping partitions.

The above has a lot of assumptions in it, namely around what these clients are. If those assumptions are some what correct, then you can try disabling overlap prevention using a kernel parameter -kernel prevent_overlapping_partitions false (you should read about the caveats of this though).

Now the shot in the dark is out of the way. Can you tell us more about the clients and the setup in general?

holger · June 11, 2024, 4:28pm

It’s all escripts using erpc to talk to the ejabberd node, or shell scripts that wrap around our ejabberdctl tool, which in turn invokes erl -hidden and (for some reason) lets it perform an explicit net_kernel:connect_node/1 call followed by rpc calls.

These jobs are scheduled to run every few minutes by systemd.timers. The number of minutes between invocations should be large enough to ensure only a single client is running at any given time, but I can’t entirely rule out them getting into each others way, so I was thinking along those lines as well.

So, your suggestion sounds like a great thing to try. However, the issue stopped showing up for 3+ days now (hence my late reply). No idea why, “I changed nothing.” If we do run into the problem again, I’ll look into playing with prevent_overlapping_partitions. Many thanks for that idea.

starbelly · June 11, 2024, 9:50pm

Yeah, that wouldn’t be an issue with hidden nodes AFAIK. Hope it doesn’t crop back up!