Distributed Erlang connected nodes debugging

tzumby · September 27, 2023, 11:33pm

Hi there,

I’m looking into ways of debugging socket connections in a disterl cluster. What’s the best way to inspect
the reasons why nodes disconnect ? I have been trying a 30 node cluster and I noticed some flakiness when running erlang:length(erlang:nodes()). from one of the connected nodes.

Wondering what type of digging I can do to look into why that would happen. I read about :sys.get_state(:net_kernel) and I used that to see all the ports and connections opened. Would I be able to just monitor one of the Ports to get more info about why it would disconnect. I first suspected that this may be caused by the net_ticktime / interval somehow, but I bumped that to get a minT of more than 1 min and I still saw nodes going down before that minute runs out.

Running on Erlang/OTP 26 [erts-14.0.1]

Thanks in advance!

starbelly · September 28, 2023, 10:22am

Silly question, but do you have logging turned on? I ask because it sounds like you don’t observe an issue until you check the size of the cluster.

There’s already a lot of useful information logged via OTP. That said, have you looked at net_kernel:monitor/1 or net_kernel:monitor/2 ?

Other question, is this an experiment on a single machine or distributed across multiple machines and possibly in many different regions?

max-au · September 29, 2023, 3:54am

While not claiming that to be a best method, but I did use tcpdump for a similar investigation. Alternatively, if the machine you’re running this has GUI, you could try wireshark.

starbelly · September 29, 2023, 4:03am

^ I’m a fan I usually head here after the problem is still elusive. Which makes me think of something terribly nice to have for wireshark would be. There’s a dist erl plugin, but it is quite cryptic, would be nice to enhance this

I also had one more question for @tzumby This might be yet another silly question, but I wonder if there’s an expectation problem. I’m making an assumption here, so pardon me for that, but I wonder if the total number of nodes in what I assume (that’s twice) is a fully connected cluster totals to 30, but when you run erlangs:nodes/0 you’re seeing 29?

If by chance that is true, it is to be expected, the list of nodes returned via erlang:nodes/0 will not include the node you’re operating on, thus the expectation should be 29. Just a shot in the dark

tzumby · September 29, 2023, 11:17am

Thanks @starbelly and @max-au!

I was thinking that tcpdump-ing the traffic would be my last resort, but that’s a great idea. Before that, while digging in the erlang docs, I found out about inet::i() so after seeing that, I found a better way to get all the ports from the :net_kernel state, loop over all the open ports and trace that with erlang::trace. So I’m sitting there running flush(). in the console.

I get some info I already knew though:

Shell got {trace_ts, #PID<0.1349.0>, receive, {tcp_closed, #Port<0.24>}, 1,
 {1695, 869472, 819890}}
Shell got {trace_ts, #PID<0.1349.0>, getting_unlinked, #Port<0.24>, 1,
 {1695, 869472, 819893}}
Shell got {trace_ts, #PID<0.1349.0>, exit, tcp_closed, 1, {1695, 869472, 819897}}
Shell got {trace_ts, #PID<0.1349.0>, out_exited, 0, 1, {1695, 869472, 819900}}

Just to test this out, if I connect just two nodes and close the console in one of them, I’m getting something different:

Shell got {trace_ts,#Port<0.9>,send,
                    {tcp_closed,#Port<0.9>},
                    <0.97.0>,4,
                    {1695,985570,733677}}
Shell got {trace_ts,#Port<0.9>,closed,normal,4,{1695,985570,733686}}

I did get this trace message though, which I found odd. I don’t know enough about Erlang to tell if this
could be the problem though:

Shell got {trace_ts, #PID<0.1343.0>, gc_major_start,
 [
  {wordsize, 6},
  {old_heap_block_size, 0},
  {heap_block_size, 610},
  {mbuf_size, 0},
  {recent_size, 205},
  {stack_size, 21},
  {old_heap_size, 0},
  {heap_size, 583},
  {bin_vheap_size, 53},
  {bin_vheap_block_size, 46422},
  {bin_old_vheap_size, 0},
  {bin_old_vheap_block_size, 46422}
 ], 1, {1695, 869451, 524642}}
Shell got {trace_ts, #PID<0.1343.0>, gc_major_end,
 [
   {wordsize, 553},
   {old_heap_block_size, 0},
   {heap_block_size, 233},
   {mbuf_size, 0},
   {recent_size, 30},
   {stack_size, 21},
   {old_heap_size, 0},
   {heap_size, 30},
   {bin_vheap_size, 53},
   {bin_vheap_block_size, 46422},
   {bin_old_vheap_size, 0},
   {bin_old_vheap_block_size, 46422}
 ], 1, {1695, 869451, 524649}}

Also, worth mentioning, the size of the cluster fluctuates widely, sometimes only 5 are connected, then a few seconds later i can see the number got up every time I run it, back to 29, then dropping again. I can’t tell if it coincides with the garbage collection yet.

starbelly · September 29, 2023, 3:55pm

The gc is slightly interesting, but I think it’s only semi-related. A major gc here makes sense if there’s a lot of messages being slung around, but it would indeed to be quite a lot There’s also not enough information about what your cluster is doing in order to gauge whether that’s abnormal or not.

FWIW, you can also do such traces using observer.

That said, I have a hunch, without knowing your setup. I have seen where overlap detection can cause a rolling disconnect across a fully connected cluster (this is the point though), and if you have a misbehaving node (and perhaps nodes take turns misbehaving), this can explain the seemingly random disconnects, etc. If you had logging enabled, you would see this in the console however, but you have not stated whether logging is enabled or disabled. You could try disabling this feature, once again as a shot in the dark. You can start up your node(s) with the following param -kernel prevent_overlapping_partitions false. This also makes an assumption about the major version of Erlang/OTP you are on, but without a lot of information to go on, assumptions are inevitable

That shot in the dark aside, can you provide more information about your setup? Major Version of OTP, how the nodes are distributed, and what your cluster is doing . That last point can tell us a lot in regard to your gc concern. If you have a cluster connected and not doing anything, then the gc event is indeed more interesting (assuming default gc settings are in place).

max-au · September 29, 2023, 4:45pm

GC messages are normal and expected. I second @starbelly suggestion to try turning global off (running with -connect_all false).

Now, for the net_ticktime, I would actually suggest to lower it to 30 seconds. It will increase the frequency with which tick messages are sent. Hypothesis that I am trying to test is, you may have some firewall/network device that plays smart and drops your “idle” network connections. Adding some traffic to these connections may force the device to keep the connection open.

tzumby · September 29, 2023, 5:07pm

I have run into the very issue @starbelly is describing with :global disconnecting nodes because it detected partitions initially.

I’m running this on Erlang/OTP 26 [erts-14.0.1] by the way. And indeed, I’m running without :global
and starting the distribution manually. The net_ticktime you see here is actually something I was exploring, and maybe I made it worse :). I was thinking that I’m flooding the TCP connection with too many ticks and that’s what was causing it, I’ll try and lower that to 30 as @max-au suggested. Should I leave the net_tickintensity at 4 ?

-kernel net_setuptime 30
-kernel net_ticktime 120
-kernel connect_all false
-kernel start_distribution false

starbelly · September 29, 2023, 5:51pm

Yes leave it as is, it’ll cause a flood, but that’s the point, and I agree with @max-au test here.

Edit:

If this holds true, then it’s time to break out tcpdump or wireshark But also, you say you’re starting the distribution manually, are you quite sure global isn’t at play here?

tzumby · September 30, 2023, 1:40am

It looks like the net_ticktime setting of 30, all else the same fixes the flakiness entirely. Thanks so much for your help!

@starbelly good point about global, I’m starting the distribution with net_kernel:start(app, #{name_domain: longnames}) and also using a very simple module I wrote in place of epmd that just responds with a static port (I’m running one app per ec2 instance). from what I understand, this doesn’t start global or rely on global to maintain the mesh. but i could be wrong ?

max-au · September 30, 2023, 3:20am

That sounds like a network device (or a specific TCP/IP stack implementation) that is not exactly RFC-compliant. Which is, indeed, a common issue, making this guess not too wild.
These ticks are tiny, and won’t create any visible network load. Even for 40.000 nodes in the cluster, it’s not too expensive, we’ve been running it that way (with 30 sec tick time).

starbelly · September 30, 2023, 7:32pm

Nope, that makes sense. If you really don’t want nodes to ever disconnect you might want to look at -kernel dist_auto_connect never-kernel (i.e., you don’t want a node to try to perform a call on another node and connect automatically).

Interesting about the net ticktime, because you had set it 120 seconds, and an intensity of 4, which means a tick every 30, which should have been sufficient, as iirc, aws load balancers and what have you default to 60 seconds. Maybe an edge with time

I myself would want to understand this more. Also, question what led you to bumping up tick time to 120?

tzumby · October 1, 2023, 1:12pm

I was wondering if that ticktime still applies to super large clusters. Wouldn’t the overall increase of chatter solve for that idle connections dropped problem ?

I don’t have any experience running clusters this large and this is mostly for an exploration for learning. But something @starbelly mentioned about never wanting nodes to disconnect made me think, maybe the first question I should have asked: “Should the network be absolutely stable with no nodes going in and out ?” My assumption was that at idle where the only traffic is the mesh network maintenance.

I’m also not sure how the network is maintained without global. I was reading about how when it detects a node down it broadcasts to all the network and causes other nodes to disconnect from the dead nodes. Guessing that just doesn’t happen w/o global ?

tzumby · October 1, 2023, 1:13pm

The 120 figure is purely random, just doubled up the default

starbelly · October 1, 2023, 5:10pm

Firstly, I made a typo, with -kernel dist_auto_connect never, no automagic will happen if say you perform an rpc call to a remote node. You may also set this option to once vs the default always. It all depends on the properties you need to solve your problems

You should not be experiencing the instability you were before, and as stated, and IMHO I would like further into that. I like to understand problems when there’s time though Maxim’s assessment that there may be conns detected as idle by a middle box and a RST or FIN/ACK is done because it’s trying to “help” you.

That said, you can never count on anything network wise being stable, that’s distributed myth #1 - the network is reliable. And you are in the cloud, so you really have to expect and design for the unexpected.

I think also something to think about is how much do you plan to scale this cluster? Will it always stay around 30ish? Do you plans for hundreds? It’s still not clear to me whether you want a fully connected cluster or not or whether you are building a partially connected setup (i.e., you can scale to any number of nodes), whereas fully connected clusters come at a price and have their limitations therein (i.e., combinatorial increase in traffic per each node added at least far as ticks and such go).

FWIW a fully connected cluster can scale up to ~ 300 nodes (at least). If you’re aiming at thousands, this is not what you want

Global will still monitor nodes if you go with -kernel connect_all false and even with -kernel dist_auto_connect never. Global subscribes to the events that are emitted by net_kernel (nodeup, nodedown, etc.) and will take action on those. For example, if you use both options above, and you manually start distribution, and you manually connect all three nodes to each other to form a fully connected cluster, and one of the nodes goes down, the other nodes will see this and global will perform a disconnect from it. You then to monitor for nodeup events in another proc and connect them all to the node that jump came back up. In other words, responsibilities that global had are now your responsibility.

Also note, and this is iirc, with those two options on you won’t be able to make a lot of use of global.

Pretty much, yes.

I think it’ll help us to understand what you need, but it sounds like you are experimenting which is good , thus you’re not sure what properties you desire.

Still, if I was in your shoes, the first thing I would do is set things back to how they were and begin to understand why were getting disconnects. The way it reads to me is that some ticks were not making it across regions, the other possibility is simply the default timeouts for some of these components in AWS, specifically the default timeout of 60. However, you are right, with 30 nodes, you should not hit that case, so it sounds more to me like ticks didn’t make it through. Still, it would be interesting to revert your settings (120 net tick time), and bump the default timeouts else where in aws and see what you get. Interesting problems!

Edit:

I forgot to note, if you have -connect_all false, you won’t have to disable overlapping partition prevention, yet, it is something you will have to take into account and design for (the feature exists for great good!).