We have a production system with a gen_udp server that forwards the messages it receives:
-module(udp_server).
...
sendq(Socket, Addr, Port) ->
receive
Data -> gen_udp:send(Socket, Addr, Port, Data)
% yield in case of code reloading.
after 1000 -> ok
end,
?MODULE:sendq(Socket, Addr, Port).
We believe the the process was stuck and the queue was growing. We checked the queue multiple times and we only saw it growing. We couldn’t identify why it wasn’t progressing with the queue, but we were able to get some information and a core dump before restarting the system.
(prod@prod)1> erlang:process_info(<0.728.0>).
[{current_function,{prim_inet,do_sendto,4}},
{initial_call,{udp_server,sendq,3}},
{status,running},
{message_queue_len,1551728},
{links,[<0.727.0>]},
{dictionary,[]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.386.0>},
{total_heap_size,16401542},
{heap_size,8912793},
{stack_size,12},
{reductions,1355953309888},
{garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,0}]},
{suspending,[]}]
(prod@prod)3> rp(erlang:process_info(<0.728.0>, backtrace)).
{backtrace,<<"Program counter: 0x00007f91ab078830 (prim_inet:do_sendto/4 + 568)\ny(0) []\ny(1) []\ny(2) []\ny(3) #Port<0.95>\ny(4) []\n\n0x00007f813f761cb8 Return addr 0x00007f813f761ce0 (unknown function)\n\n0x00007f813f761cc0 Return addr 0x00007f90e243fc44 (udp_server:sendq/3 + 196)\ny(0) 6050\ny(1) {224,122,0,50}\ny(2) #Port<0.95>\n\n0x00007f813f761ce0 Return addr 0x0000000000000000 (invalid)\n\n0x00007f813f761ce8 Return addr 0x00007f91ab518e38 (<terminate process normally>)\n">>}
I was trying to figure out if the OS was slow processing the messages or if we sent a big message and that was causing the process to appear stuck. I tried to inspect the MessageQueue of the process with gdb, but I might be doing something wrong:
(gdb) set $etp_pmem_proc = ((Process *) 0x7f90eb35f1c0)
(gdb) etp-process-info-x $etp_pmem_proc
Pid: <0.728.0>
State: running | active | prq-prio-normal | usr-prio-normal | act-prio-normal
Flags: delay-gc heap-grow
Current function: unknown
I: #Cp<prim_inet:do_sendto/4+0x268>
Heap size: 8912793
Old-heap size: 0
Mbuf size: 12
Msgq len: 2123698 (inner=2123572, outer=126)
Parent: <0.727.0>
Pointer: (Process*)0x7f90eb35f1c0
Msgq Flags: on-heap
--- Inner signal queue (message queue) ---
[#1:[#HeapBinary<0x6,(nil)>,#RefcBinary<0x582,0x7f813b7a29a0,0x7f82740de5e8,0x7f82740de600,(nil)>] @from= <0.729.0>,
#2:[#HeapBinary<0x6,(nil)>,#RefcBinary<0x582,0x7f813b7a29d0,0x7f82740deb98,0x7f82740debb0,(nil)>] @from= <0.729.0>,
...
(gdb) etp-msgq (($etp_pmem_proc)->sig_qs)
Attempt to take address of value not located in memory.
Any suggestions about what I can check to identify the problem?
version: OTP25.1