Ssl:close stuck in prim_inet:recv0/3 on OTP 28.1.1

On a 28.1.1 server (or a couple) we noticed several SSL connections stuck in a closing state. The caller called ssl:close and is stuck in a gen:call. The ssl_gen_statem process called a gen_tcp:recv/3 with a timeout but it never received a response message from the port. The port is still alive. When calling gen_tcp:recv/3 for the same socket from the shell it returns {error,enotconn}.

Backtrace of the calling process

> bt(<0.31332.0>).
Program counter: 0x0000f45fd675b210 (gen:do_call/4 + 536)
y(0)     []
y(1)     []
y(2)     []
y(3)     #Ref<0.3750388521.1181220866.236756>

0x0000f45f755d5850 Return addr 0x0000f45fd6908ca8 (gen_statem:call/3 + 656)
y(0)     {close,5000}
y(1)     <0.31328.0>
y(2)     Catch 0x0000f45fd6908cf4 (gen_statem:call/3 + 732)

0x0000f45f755d5870 Return addr 0x0000f45fd76bcff0 (ssl_gen_statem:call/2 + 104)
y(0)     Catch 0x0000f45fd76bd010 (ssl_gen_statem:call/2 + 136)

0x0000f45f755d5880 Return addr 0x0000f45fd76b4228 (ssl_gen_statem:close/2 + 72)

Stacktrace of the ssl_gen_statem process

> recon:info(<0.31328.0>).
[{meta,[{registered_name,[]},
        {dictionary,[{'$initial_call',{ssl_gen_statem,init,1}},
                     {'$ancestors',[<0.31326.0>,tls_connection_sup,tls_sup,
                                    ssl_connection_sup,ssl_sup,<0.91.0>]},
                     {tls_role,server},
                     {'$process_label',{tls,server,
...
        {status,waiting}]},
 {signals,[{links,[<0.31326.0>,#Port<0.255>]},
           {monitors,[{process,<0.31332.0>}]},
           {monitored_by,[<0.31332.0>]},
           {trap_exit,true}]},
 {location,[{initial_call,{proc_lib,init_p,5}},
            {current_stacktrace,[{prim_inet,recv0,3,[]},
                                 {tls_gen_connection,close,4,
                                                     [{file,"tls_gen_connection.erl"},{line,572}]},
                                 {ssl_gen_statem,handle_call,4,
                                                 [{file,"ssl_gen_statem.erl"},{line,764}]},
                                 {gen_statem,loop_state_callback,11,
                                             [{file,"gen_statem.erl"},{line,3748}]},
                                 {proc_lib,init_p_do_apply,3,
                                           [{file,"proc_lib.erl"},{line,333}]}]}]},
 {memory_used,[{memory,34664},
               {message_queue_len,0},

Backtrace of the same ssl_gen_statem process

> bt(<0.31328.0>).
Program counter: 0x0000f45fd660ad44 (prim_inet:recv0/3 + 196)
y(0)     0
y(1)     #Port<0.255>

0x0000f45f729616f8 Return addr 0x0000f45fd76a5554 (tls_gen_connection:close/4 + 260)

0x0000f45f72961700 Return addr 0x0000f45fd76b814c (ssl_gen_statem:handle_call/4 + 3916)
y(0)     {state,#Ref<0.3750388521.1165361153.253007>,

The port still exists

> recon:port_info(#Port<0.255>).
[{meta,[{id,2040},{name,"tcp_inet"},{os_pid,undefined}]},
 {signals,[{connected,<0.31328.0>},
           {links,[<0.31328.0>]},
           {monitors,[]}]},
 {io,[{input,0},{output,608216479}]},
 {memory_used,[{memory,48},{queue_size,0}]},
 {type,[{statistics,[{recv_oct,30602051},
                     {recv_cnt,658470},
                     {recv_max,314},
                     {recv_avg,46},
                     {recv_dvi,0},
                     {send_oct,587038959},
                     {send_cnt,756340},
                     {send_max,9943},
                     {send_avg,776},
                     {send_pend,0}]},
        {options,[{active,false},
                  {buffer,128},
                  {delay_send,false},
                  {exit_on_close,true},
                  {header,0},
                  {high_watermark,8192},
                  {low_watermark,4096},
                  {mode,binary},
                  {packet,0},
                  {packet_size,0},
                  {send_timeout,30000}]}]}]
		  
> gen_tcp:recv(#Port<0.255>, 0, 5000).
{error,enotconn}

Unfortunately beam.smp is stripped of debug symbols. Is this a known issue? Why the port did not respond? Is there any way to debug this without debug symbols? Is an erl_crash.dump any useful in this case?

(The node is still running in this state, it is running on Ubuntu 24.04 with kernel 6.14.0-1017-azure aarch64.)

1 Like

What state does the kernel think that the socket is in:

netstat -tan | grep portnumber

Or, as i’d hope its in FIN_WAIT, closing or last_ack

netstat -tan | grep -E 'FIN_WAIT|CLOSING|LAST_ACK'

This should narrow down if the operating system is still dealing with an open socket or if the internals of the ssl library are somehow not reporting to erlang for the state change.

Disclaimer: not an expert in this area, just working the problem as I would think to.

Thanks for the response.

I forgot to mention that the operating system doesn’t know about this socket any more. netstat -tan | grep portnumber is empty. The OS socket is closed and gone.

1 Like