Gun gen_statem process queue messages stoped being handled

I use gun as a http client in my application and use opened connections to handle multiple requests.
It usually works ok, but recently my application stoped handling requests and it turned out that gun connections processes have lists of unhandled messages in their mailboxes, like this:

rp(process_info(<0.31175.48>, messages)).
{messages,[{'$gen_cast',{request,<0.11099.53>,
                                 #Ref<0.3234177338.3490709505.118665>,<<"POST">>,
                                 <<"/v1/topic/get">>,
                                  [{<<"content-type">>,<<"application/json">>}],
                                 <<"{}">>,infinity}},
           {'$gen_cast',{request,<0.12338.53>,
                                 #Ref<0.3234177338.3095658499.71403>,<<"POST">>,
                                 <<"/v1/topic/get">>,
                                  [{<<"content-type">>,<<"application/json">>}],
                                 <<"{}">>,infinity}}]}

And every new request just adds a new message to that mailbox.

I’ve looked into gun’s code and I see that it should handle such messages with no problem in ‘connected’ state, here is the line from gun:

connected(cast, {request, ReplyTo, StreamRef, Method, Path, Headers, Body, InitialFlow},
		State=#state{origin_host=Host, origin_port=Port,
			protocol=Protocol, protocol_state=ProtoState, cookie_store=CookieStore0,
			event_handler=EvHandler, event_handler_state=EvHandlerState0}) ->
	{Commands, CookieStore, EvHandlerState} = Protocol:request(ProtoState,
		dereference_stream_ref(StreamRef, State), ReplyTo,
		Method, Host, Port, Path, Headers, Body,
		InitialFlow, CookieStore0, EvHandler, EvHandlerState0),
	commands(Commands, State#state{cookie_store=CookieStore,
		event_handler_state=EvHandlerState});

I tried to see the status of that connection process (with big timeout just in case) :
sys:get_status(<0.31175.48>, 600000).

But it didn’t work, I’ve just found the new system message in gun’s connection process mailbox:
{system,{<0.22180.130>,#Ref<0.3234177338.4042522626.87030>}, get_status}

So I can’t find out what state is this gen_statem server is in and I don’t know what caused it to switch into this state from ‘connected’ state.
The process_info looks like this:

process_info(<0.31175.48>).          
[{current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,8},
 {links,[<0.1197.0>]},
 {dictionary,[{'$initial_call',{gun,init,1}},
              {'$ancestors',[gun_conns_sup,gun_sup,<0.1195.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1194.0>},
 {total_heap_size,8370},
 {heap_size,4185},
 {stack_size,63},
 {reductions,1402917},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6}]},
 {suspending,[]}]

The protocol used by connection is http2 and secure layer is on.

So I’m stuck and I can’t figure out reasons why that could happen. I’m sure that this process was in state ‘connected’ because I stored it’s pid in my pool of connections when the connection was established.
Has anybody experienced similar problems with gun (the connection process is alive but all requests through this connection fail) and have any advice on what can be done to figure out and solve the problem?

As the current function is gen:do_call/4, it looks like this process makes a gen_server (or gen_statem) call and that’s where it’s blocked. Check the backtrace of the process to see where it’s making that gen_server call.

Thanks, what’s the way of doing it?
I tried to erl -remsh to my app and open trace port, but got error:

dbg:tracer(port, dbg:trace_port(ip, 50000)).
** exception error: undefined function dbg:trace_port/2

The erlang version on the server is 23.

Something like {_, BT} = process_info(backtrace, pid(0,31175,48)), io:format("~s~n", [BT]).

Thanks, I did
rp(process_Info(<0.3238.0>, backtrace)).

and got this:

{backtrace,<<"Program counter: 0x00007f3542681490 (gen:do_call/4 + 280)\n"
"y(0)     []\n"
"y(1)     infinity\n"
"y(2)     []\n"
"y(3)     #Ref<0.3234177338.3271557122.27248>\n"
"y(4)     <0.21289.50>\n"
"\n"
"0x00007f34ab0c1630 Return addr 0x00007f34c72fb808 (gen_statem:call_dirty/4 + 104)\n"
"y(0)     infinity\n"
"y(1)     {start,infinity}\n"
"y(2)     <0.21289.50>\n"
"y(3)     Catch 0x00007f34c72fb840 (gen_statem:call_dirty/4 + 160)\n"
"\n"
"0x00007f34ab0c1658 Return addr 0x00007f3542a7e1c8 (ssl_gen_statem:call/2 + 72)\n"
"y(0)     Catch 0x00007f3542a7e1d8 (ssl_gen_statem:call/2 + 88)\n"
"\n"
"0x00007f34ab0c1668 Return addr 0x00007f3542a751d0 (ssl_gen_statem:handshake/2 + 152)\n"
"y(0)     []\n"
"y(1)     {sslsocket,{gen_tcp,#Port<0.5296215>,tls_connection,undefined},[<0.21289.50>,<0.22818.50>]}\n"
"\n"
"0x00007f34ab0c1680 Return addr 0x00007f3542a94790 (tls_gen_connection:start_fsm/8 + 1240)\n"
"y(0)     []\n"
"y(1)     []\n"
"y(2)     []\n"
"y(3)     []\n"
"y(4)     []\n"
"y(5)     []\n"
"y(6)     []\n"
"y(7)     []\n"
"y(8)     []\n"
"y(9)     []\n"
"y(10)    Catch 0x00007f3542a947a0 (tls_gen_connection:start_fsm/8 + 1256)\n"
"\n"
"0x00007f34ab0c16e0 Return addr 0x00007f3542a74fc8 (ssl_gen_statem:connect/8 + 104)\n"
"y(0)     Catch 0x00007f3542a74fd8 (ssl_gen_statem:connect/8 + 120)\n"
"\n"
"0x00007f34ab0c16f0 Return addr 0x00007f34c5189890 (gun:normal_tls_handshake/4 + 384)\n"
"y(0)     []\n"
"y(1)     gun_default_event_h\n"
"y(2)     []\n"
"y(3)     [http2,http]\n"
"y(4)     #{socket=>#Port<0.5296215>,timeout=>infinity,tls_opts=>[{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}},{server_name_indication,\"management.me\"},{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}}]}\n"
"y(5)     {state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}\n"
"y(6)     undefined\n"
"\n"
"0x00007f34ab0c1730 Return addr 0x00007f34c5188518 (gun:initial_tls_handshake/3 + 416)\n"
"y(0)     []\n"
"y(1)     []\n"
"y(2)     2\n"
"y(3)     []\n"
"y(4)     []\n"
"\n"
"0x00007f34ab0c1760 Return addr 0x00007f34c72fea10 (gen_statem:loop_state_callback/11 + 872)\n"
"y(0)     []\n"
"y(1)     []\n"
"y(2)     true\n"
"y(3)     false\n"
"y(4)     []\n"
"y(5)     false\n"
"y(6)     []\n"
"y(7)     {initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}}\n"
"y(8)     [{internal,{retries,2,#Port<0.5296215>}}]\n"
"y(9)     {state,{initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}},[],#{},false}\n"
"y(10)    []\n"
"y(11)    {params,state_functions,false,<0.1197.0>,[gun],<0.3238.0>,infinity}\n"
"y(12)    Catch 0x00007f34c72fea80 (gen_statem:loop_state_callback/11 + 984)\n"
"\n"
"0x00007f34ab0c17d0 Return addr 0x00007f34c7c83c00 (proc_lib:init_p/5 + 264)\n"
"y(0)     []\n"
"y(1)     []\n"
"y(2)     []\n"
"y(3)     Catch 0x00007f34c7c83c10 (proc_lib:init_p/5 + 280)\n"
"\n"
"0x00007f34ab0c17f8 Return addr 0x000000000092aa38 (<terminate process normally>)\n"
"">>}

Your FSM is currently stuck in a call to another FSM. In this case it appears to be related to TLS connection handling, during the handshake. The pid of that FSM Is <0.21289.50>.

Since that call has an infinity timeout, your own FSM is gonna be stuck until the other one returns; if it’s hung (because it’s waiting on your FSM or on a broken external connection), then you’re also going to be stuck forever.

Your options are likely to be one of:

  • put a timeout on the call (and possibly handle the errors)
  • move the call to be asynchronous to a sub-process so even if it hangs it doesn’t block the FSM
  • investigate and resolve the issue with the TLS handshake to make sure it can’t happen (may not always be possible if the causes are outside of your control)

@MononcQc, Thanks for reply!
I do have timeout on my requests:

StreamRef = gun:get(Connection, Path, Headers),
gun:await(Connection, StreamRef, Timeout)

The problem is that I expected if Connection is alive I can use it and it appears that is not always the case.