What does "stealing" resources mean in NIFs?

NAR · January 30, 2024, 9:31am

Hello!

We’re working on a NIF wrapper around a 3rd party C library and we’re currently using OTP 24. The C library opens a connection to a remote host and we send and receive messages over this connection. The library has functions to open/close/send data over the socket, but does not hide the socket completely, it is in a public member of a struct and the example code from the library calls select(2) on the socket to wait for incoming data. For this reason our NIF we’re using enif_select on the same socket. We store the socket (actually the struct containing the socket) in a resource object and pass this resource around between the NIF calls. However, I’ve noticed this message in our logs sometimes:

driver_select(0x00007f11100d9128, 73, ERL_DRV_READ ERL_DRV_USE, 1) by tcp_inet driver #Port<0.7554> stealing control of fd=73 from resource our_nif_module:channel_resource with in-pid <0.1716.0>

That channel_resource is the resource, it’s a C struct with a single member (the struct from the 3rd party library containing the socket). That <0,1716,0> pid is the Erlang process that created and used the resource (I think). This message is logged from a different process (I don’t know what kind of process, there’s no trace of its pid anywhere else in the logs). I found that this message comes from erl_check_io.c and found a comment in the enif_select implementation that says “Changing resource is considered stealing. Changing process and/or ref is ok (I think?).” What does this mean and how can I avoid this message? Our Erlang code does not pass these resources between processes, a connection is handled by a single gen_statem.

Usually there’s no traffic over this socket outside business hours. This stealing message got logged around 01:30 and the messages did not restart in the morning (which was our problem), but I’m not sure these are related.

jhogberg · January 30, 2024, 9:59am

It means that the given port has selected the same file descriptor as the NIF, which is often indicative of a use-after-free. Do you always ERL_NIF_SELECT_STOP the socket before closing it?

Use ERL_NIF_SELECT_STOP as mode in order to safely close an event object that has been passed to enif_select. The stop callback of the resource obj will be called when it is safe to close the event object. This safe way of closing event objects must be used even if all notifications have been received (or cancelled) and no further calls to enif_select have been made. ERL_NIF_SELECT_STOP will first cancel any selected events before it calls or schedules the stop callback. Arguments pid and ref are ignored when ERL_NIF_SELECT_STOP is specified.

NAR · January 30, 2024, 10:23am

Thanks! I think this could be the problem. Our code does close the socket (using a call from the 3rd party library) in some cases and the logs do not indicate that anything happened, but we’re not using ERL_NIF_SELECT_STOP. Maybe the remote end closed the connection and we didn’t notice it (that would also explain why we stopped getting messages). We definitely need to use ERL_NIF_SELECT_STOP,thanks again!

NAR · February 16, 2024, 4:55pm

I had some time to work on this. The code now sets up the stop callback and when the connection closes, it calls enif_select(env, socket, ERL_NIF_SELECT_STOP, res, NULL, ok_atom). This triggers the stop callback that calls into the 3rd party library to properly closes the connection. Seems to work, I no longer get the “stealing resources” message.

I was wondering - how to handle errors in the stop callback? The stop callback returns void. The “connection close” call into the 3rd party library might fail (at least it has a return value), but I’m not sure it can actually fail. Of course, can’t really do anything with this error (other than logging), so it’s not that important…

jhogberg · February 16, 2024, 5:17pm

The stop callback runs when it’s safe to close the connection, but you’re not forced to do so then and there: if closing your resource is super gnarly and/or possibly takes forever, then you can delegate that to something else (e.g. sending a message to a process that does it for you).