Gen_udp server stuck in send

gonzalobf · November 2, 2022, 2:52pm

We have a production system with a gen_udp server that forwards the messages it receives:

-module(udp_server).
...
sendq(Socket, Addr, Port) ->
    receive
        Data -> gen_udp:send(Socket, Addr, Port, Data)
    % yield in case of code reloading.
    after 1000 -> ok
    end,
    ?MODULE:sendq(Socket, Addr, Port).

We believe the the process was stuck and the queue was growing. We checked the queue multiple times and we only saw it growing. We couldn’t identify why it wasn’t progressing with the queue, but we were able to get some information and a core dump before restarting the system.

(prod@prod)1> erlang:process_info(<0.728.0>). 
[{current_function,{prim_inet,do_sendto,4}},
 {initial_call,{udp_server,sendq,3}},
 {status,running},
 {message_queue_len,1551728},
 {links,[<0.727.0>]},
 {dictionary,[]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.386.0>},
 {total_heap_size,16401542},
 {heap_size,8912793},
 {stack_size,12},
 {reductions,1355953309888},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,0}]},
 {suspending,[]}]

(prod@prod)3> rp(erlang:process_info(<0.728.0>, backtrace)).
{backtrace,<<"Program counter: 0x00007f91ab078830 (prim_inet:do_sendto/4 + 568)\ny(0)     []\ny(1)     []\ny(2)     []\ny(3)     #Port<0.95>\ny(4)     []\n\n0x00007f813f761cb8 Return addr 0x00007f813f761ce0 (unknown function)\n\n0x00007f813f761cc0 Return addr 0x00007f90e243fc44 (udp_server:sendq/3 + 196)\ny(0)     6050\ny(1)     {224,122,0,50}\ny(2)     #Port<0.95>\n\n0x00007f813f761ce0 Return addr 0x0000000000000000 (invalid)\n\n0x00007f813f761ce8 Return addr 0x00007f91ab518e38 (<terminate process normally>)\n">>}

I was trying to figure out if the OS was slow processing the messages or if we sent a big message and that was causing the process to appear stuck. I tried to inspect the MessageQueue of the process with gdb, but I might be doing something wrong:

(gdb) set $etp_pmem_proc = ((Process *) 0x7f90eb35f1c0)
(gdb) etp-process-info-x $etp_pmem_proc
  Pid: <0.728.0>
  State: running | active | prq-prio-normal | usr-prio-normal | act-prio-normal

  Flags: delay-gc heap-grow 
  Current function: unknown
  I: #Cp<prim_inet:do_sendto/4+0x268>
  Heap size: 8912793
  Old-heap size: 0
  Mbuf size: 12
  Msgq len: 2123698 (inner=2123572, outer=126)
  Parent: <0.727.0>
  Pointer: (Process*)0x7f90eb35f1c0
  Msgq Flags: on-heap 
  --- Inner signal queue (message queue) ---
    [#1:[#HeapBinary<0x6,(nil)>,#RefcBinary<0x582,0x7f813b7a29a0,0x7f82740de5e8,0x7f82740de600,(nil)>] @from= <0.729.0>,
     #2:[#HeapBinary<0x6,(nil)>,#RefcBinary<0x582,0x7f813b7a29d0,0x7f82740deb98,0x7f82740debb0,(nil)>] @from= <0.729.0>,
...
(gdb) etp-msgq (($etp_pmem_proc)->sig_qs)
Attempt to take address of value not located in memory.

Any suggestions about what I can check to identify the problem?

version: OTP25.1

Abdelghani · November 6, 2022, 11:12pm

Hello @gonzalobf, I think your code is a bit wrong it’s always a good practice to use atoms when receiving something because you can never know what you may catch in your receive, you can do better by :

loop(Socket, Addr, Port) ->
receive
       {data, Data} ->
                  gen_udp:send(Socket, Addr, Port, Data),
                  loop(Socket, Addr, Port)
end.

if that does not fix this bug you may try another version of OTP, good luck.

gonzalobf · November 7, 2022, 5:08pm

@Abdelghani I don’t think there is anything wrong with the code I pasted before but I think you are right about that is usually a good practice to pattern match in the receive.

I did little bit more of progress:

The port looked okey:

---
  Pix: 760
  Port: #Port<0.95>
  Name: udp_inet
  State: connected binary-io port-specific-lock
  Scheduler flags:
  Connected: <0.727.0>
  Pointer: (Port*)0x55ad515bc4d8
---

I found the scheduler that was running the process. It was running this function that is putting marks in the queue.

(gdb) thread 11
[Switching to thread 11 (LWP 2084)]
#0  0x000055ad4f8d94ce in erts_msgq_set_save_next (c_p=0x7f90eb35f1c0) at beam/erl_proc_sig_queue.h:2141
2141	    if (sigp && ERTS_SIG_IS_RECV_MARKER(sigp))

So I think my process was waiting for the response of erlang:port_command from here.

I inspected again the process and the inner queue had 2M messages:

(gdb) p ($etp_pmem_proc)->sig_qs
$46 = {first = 0x7f8d5e864eb0, last = 0x7f8ddb4325f8, save = 0x7f8c93c745f8, cont = 0x0, cont_last = 0x7f90eb35f320, nmsigs = {next = 0x0, last = 0x0}, recv_mrk_blk = 0x0, len = 2123572, flags = 2}

And the response was the first element in the outer queue. The queue only had 126 messages.

(gdb) p ($etp_pmem_proc)->sig_inq
$47 = {first = 0x7f8ef75ba630, last = 0x7f8ef755c578, len = 126, nmsigs = {next = 0x0, last = 0x0}}

(gdb) etp-sigq-int ($etp_pmem_proc)->sig_inq.first ($etp_pmem_proc)->sig_qs.save ($etp_pmem_proc)->sig_inq.last
    [#8816:{inet_reply,#Port<0.95>,ok} @token= undefined @from= #Port<0.95>,
     #8817:[#HeapBinary<0x6,(nil)>,#RefcBinary<0xbe,(nil),0x7f890c9aa010,0x7f890c9aa028,(nil)>] @from= <0.729.0>,
     #8818:[#HeapBinary<0x6,(nil)>,#RefcBinary<0x7e,(nil),0x7f867cf67860,0x7f867cf67878,(nil)>] @from= <0.729.0>,
     #8819:#HeapBinary<0x18,0x4,(nil),0x2f67d088e> @from= <0.730.0>,
....

I suspect I don’t have any other option than waiting to happen again, so I can connect to the beam and see what the process is really doing.

I’m wondering if doing gen_udp:send can be slowdown so much because the queue size of the caller to the point that it looks like it is stuck. The reduction number was increasing so it should be doing some kind of work.

Abdelghani · November 7, 2022, 7:41pm

@gonzalobf Iam sorry but why you did all that ? as my knowledge there is nothing wrong with OTP internal implementation in either TCP or UDP especially Network Drivers, I say that because I have already this situation and I have read tons of C code in the huge inet_drv.c and finally I found everything in highly performance.
I don’t know exatly what causes your problem but Iam sure that the best way is to edit your Erlang Code and keep things as simple as you can.
Otherwise, if you can patch OTP and recompile it, then you can surely debug your problem by insert some io:format and you will have exactly where the process was hunged, I hope that will helpful.

mmin · November 7, 2022, 9:43pm

Is there a possibility that all those messages were received at the very short time period or were they accumulated over longer period of time?

One possible fix that came to my mind is using {message_queue_data, off_heap} for the udp_server if it has to receive a very large amount of messages in short periods of time. There is a forum thread about it you may find helpful: {message_queue_data, off_heap} use case .

gonzalobf · November 9, 2022, 1:14pm

Thanks @mmin .{message_queue_data, off_heap} is a interesting configuration for this process and it would definitely reduce the gc the process does and maybe improve the senders performance. However, we didn’t see the process performing a lot of in gc so I didn’t think that could be the problem.

I wrote a benchmark trying to confirm if off_heap was quicker, but the benchmark suggests that gen_udp:send execution time increases linearly with the caller’s queue (for both cases). I decided to open an issue on github so the OTP team know about this behaviour and decide about it.

I would post here what they tell me about this.