Gen_tcp.recv randomly gets stuck

jjedele · February 4, 2025, 9:56am

Hey,

Erlang (rather Elixir) noob here I have some trouble using the gen_tcp module to implement a simple TCP server program (project in question: GitHub - jjedele/dicom.ex: A DICOM library for Elixir).

Posting this here after reading the material I found about gen_tcp, but nothing seemed to describe my problem. Maybe an experienced Erlanger sees right away what’s wrong.

Problem: The transfer works most of the time, but sometimes, it randomly gets stuck in the gen_tcp.recv call. (I can send the same data with the same tool multiple times, sometimes it works, in few cases it doesn’t. That’s why I assume there’s something wrong with my network code.)

This is how I set it up:

Connection acceptor (dicom.ex/lib/dicom_net/endpoint.ex at main · jjedele/dicom.ex · GitHub):

{:ok, listen_socket} =
      :gen_tcp.listen(port, [:binary, active: false, reuseaddr: true])

....

 case :gen_tcp.accept(listen_socket, 100) do
    {:ok, client_socket} ->
      {:ok, assoc_pid} =
        DicomNet.Association.start(%{socket: client_socket, event_listener: self()})

:gen_tcp.controlling_process(client_socket, assoc_pid)

And the problematic receive code in the connection handler (dicom.ex/lib/dicom_net/association.ex at main · jjedele/dicom.ex · GitHub):

:gen_tcp.recv(socket, 0)

(I tried also adding a timeout, but in that case I’m just getting repeated timeouts from the recv).

The closest thing to my problem I found is this old SO thread: erlang - How to use gen_tcp:recv correctly - Stack Overflow
But neither the update of the original poster nor the answer seem to make sense to me.

LeonardB · February 4, 2025, 11:19am

I’d recommend reading: https://learnyousomeerlang.com/buckets-of-sockets

jjedele · February 4, 2025, 1:29pm

Hey,

thank you for the link, it’s a nice book. Is there a specific section/part you think is useful to check? I read that chapter, and while the project and general setup is a bit different than mine, I don’t really see what part I’m missing/misunderstanding which would lead to the problem I’m seeing.

As far I understand, the whole part about active mode is not something I need right now, since I need fine-grained control over the connection state.

I’m already changing the ownership of my connection socket to the handler process so I can recv from there, and most of the time this also works exactly as expected.

The flush() command they are using seems to be related to the Erlang shell and not to socket code?

I tried explicitly setting {packet, raw} as option, but I doesn’t change anything. (Which would also have been surprising since, again, most of the time it works.)

It feels to me like I have some problem with the socket buffer or something, like sometimes it wouldn’t be enough data for recv to return something. But as I understand the gen_tcp.recv documentation, calling it with Length=0 should lead to it always returning what’s available without constraints.

Or maybe I’m creating some problem because I don’t explicitly close the sockets on server-side? I was assuming it’s enough if the client closes the connection, but I’ll try to debug a bit more in this direction.

jstimps · February 4, 2025, 1:45pm

What is the size of your gen_server’s buffer (:buffer in your state map) when your program isn’t behaving as expected? Reasoning: recv may block when there is no data remaining, so perhaps you’ve just buffered it all.

Aside: once you do find the cause of your issue, or if you run out of things to debug, I do suggest you refactor to use active mode. It will allow you to simplify your gen_server and it will probably run faster.

jjedele · February 4, 2025, 7:54pm

@jstimps Thank you - that was the right pointer! Seems pretty stupid retrospectively, but of course, I cannot simply assume there’s only one protocol unit per read call.

Thx also for the pointer with active mode, will look into that. Was also considering if this is a nice use case to try out gen_statem.