Potential ssl_gen_statem bug (pool timeouts from NimblePool)

Szetty · September 15, 2024, 2:41pm

Hello there,

I am Arnold, a fellow Elixir programmer. Have been working on multiple Elixir projects approximately last 10 years. I love the language, the BEAM, the community, all the tooling around it.

As almost any project, ours has some 3rd party integrations too, for which we are now mostly using Finch to achieve HTTP integration. Recently, we observed that one of the integrations (namely Tesseract, deployed in our Kubernetes cluster, external to the caller system) is causing us issues, more specifically pool timeouts from NimblePool, which is used by Finch for connection pooling.

At first, I did what any “deeper problem” unaware programmer would do and increased the limit. Then we put Tesseract integration on a different Finch instance, with different pool, etc. Occurrence of these errors drastically decreased, but rarely it did still occur. Again, I have done what would many programmers do: added more logging and more metadata.

Soon I was faced with a situation that made no sense: we were receiving these pool timeout errors, but in the exact same moment when querying the pool state, we could observe that we had most connections still available (usually 150+ from a total of 200). This was the point I started to think that something is fishy here, so I have started to go down this rabbit hole, wanting to really get to the bottom of it. I have added tracing that was triggered by telemetry events produced by Finch.

First I was tracing NimblePool and related Finch modules, and found out that the problem lies in closing the connections:

# 03:51:24.508 #PID<0.5256.0> NimblePool.init/1
# #PID<0.278690.0> :ssl_gen_statem.init/1 <<< {:"$gen_call", {#PID<0.5256.0>, #Reference<0.3935724038.2842689537.162284>}, {:close, 5000}}
...
# 03:53:27.071 #PID<0.5256.0> NimblePool.init/1
# <<< {#Reference<0.3935724038.2842689537.162284>, :ok}

As we can see closing this connection took 2 minutes, even though the specified timeout is 5 seconds! In this period of 2 minutes NimblePool is blocked on waiting for this connection to close and cannot handle any messages it gets, thus the timeout for connection checkout.

As I figured the problem is not with NimblePool I decided to go deeper and try to understand what happens exactly and how we can improve it (although at this point I had a feeling it is not something we could easily solve, issue was pretty rare, happening only for this single integration and related to closing SSL connections):

# 12:12:10.683 #PID<0.250975.0> :ssl_gen_statem.init/1
# :ssl_gen_statem.terminate({:close, 5000}, :connection, ...)

...

# 12:12:39.090 #PID<0.250975.0> (:dead)
# :ssl_gen_statem.terminate/3 -> :ok

# 12:12:39.090 #PID<0.5565.0> NimblePool.init/1
# :ssl_gen_statem.close/2 -> :ok

Timestamps and durations are different for different traces, as they were traced in different times, but the problem is the same.

As I dig through the Erlang code, I found the function that terminates the connection (will illustrate only the clause interesting for us according to the trace above):

terminate(Reason, connection, #state{static_env = #static_env{
                                                     protocol_cb = Connection,
                                                     transport_cb = Transport,
                                                     socket = Socket},
                                     connection_states = ConnectionStates
                                    } = State) ->

    handle_trusted_certs_db(State),
    Alert = terminate_alert(Reason),
    %% Send the termination ALERT if possible
    catch Connection:send_alert_in_connection(Alert, State),
    Connection:close({timeout, ?DEFAULT_TIMEOUT}, Socket, Transport, ConnectionStates);

Looking through the code first I thought that the parts that do have timeout should be the problematic ones (?DEFAULT_TIMEOUT is 5 seconds). Analysing further I found out that Connection is tls_gen_connection and counter-intuitively tls_gen_connection:close/4 is the one that takes the most time:

# 20:38:33.425 #PID<0.563206.0> :ssl_gen_statem.init/1
# :ssl_gen_statem.terminate({:close, 5000}, :connection, ...)

# 20:38:33.430 #PID<0.563206.0> :ssl_gen_statem.init/1
# :tls_gen_connection.send_alert_in_connection({:alert, 1, 0, %{line: 1853, file: ~c"ssl_gen_statem.erl", mfa: {:ssl_gen_statem, :terminate_alert, 1}}, :undefined, :undefined}, ...)

... 5 seconds timeout ...

# 20:38:38.432 #PID<0.563206.0> :ssl_gen_statem.init/1
# :tls_gen_connection.send_alert_in_connection/2 -> {:throw, {:stop, {:shutdown, :own_alert}, {:state, {:static_env, :client, :gen_tcp, :tls_gen_connection, :tcp, :tcp_closed, :tcp_error, :tcp_passive, ~c"k8sgazo9v3lvvg.octoscreen.de", 443, #Port<0.45959>, #Reference<0.870037478.4226154500.13393>, #Reference<0.870037478.4226154500.13392>, :ssl_client_session_cache_db, {:ssl_crl_cache, ...}, {}, {}, 0}, %{}, {:protocol_buffers, {:undefined, {[], 0, []}}, "", []}, {[""], 0, []}}}}

# 20:38:38.432 #PID<0.563206.0> :ssl_gen_statem.init/1
# :tls_gen_connection.close({:timeout, 5000}, "<socket>", :gen_tcp, ...)

...

# 20:40:22.901 #PID<0.563206.0> (:dead)
# :tls_gen_connection.close/4 -> :ok

# 20:40:22.901 #PID<0.563206.0> (:dead)
# :ssl_gen_statem.terminate/3 -> :ok

Checking the function in tls_gen_connection (again showing only the relevant clauses):

%% User closes or recursive call!
close({close, Timeout}, Socket, Transport = gen_tcp, _) ->
    tls_socket:setopts(Transport, Socket, [{active, false}]),
    Transport:shutdown(Socket, write),
    _ = Transport:recv(Socket, 0, Timeout),
    ok;
...
%% Other
close(_, Socket, Transport, _) ->
    tls_socket:close(Transport, Socket).

As we can see from the trace close from tls_gen_connection is called with {Timeout, 5000}, ..., but there is no specific close for this argument so it will enter the %%Other case. Is this is intentional ? Or maybe it should be called with {close, Timeout} instead of {timeout, Timeout} ?

starbelly · September 16, 2024, 6:24pm

Do you have a simple reproducer that doesn’t involve third party libraries?

Szetty · September 17, 2024, 7:32am

hello, unfortunately I do not, as I said it happens very rarely; I have traced the problem until the inet level, but cannot do more unless we setup Wireshark or something similar, but we do not have the expertise and time for this; something gets stuck with the connection on TCP level. Anyway I am not looking to resolve the problem on this level, was strictly asking why tls_gen_connection is called with a weird first argument ({timeout, 5000} instead of the expected {close, 5000}). If it is a bug, we may need to adapt it, so that the timeout is applied; if it is intentional I would prefer to have a reason why it is and should be this way (although if it is intentional we can simply send ignored too, which would be more intuitive).

starbelly · September 17, 2024, 1:35pm

Hmm, have you considered adjusting your gen_tcp options (more likely inet options)? If I trace the code myself, this should go straight to inet:close (using default settings). It’s more likely the behavior you’re seeing is a result of waiting to receive data or an acknowledgement from the remote side. You could also adjust the time wait setting at the OS level as well. I would take a look at that and see what the value is. You could confirm this hypothesis by checking out socket stats (e.g., netstat, etc.) Not sure what OS you are using exactly, some type of linux surely, if so check tcp_fin_timeout setting, it’s possible it’s set to 120 seconds in your case.

ingela · September 17, 2024, 3:09pm

What version of the ssl application are you running?

Szetty · September 18, 2024, 7:54am

These are all SSL related applications we have:

[
  {:ssl_verify_fun, ~c"SSL verification library", ~c"1.1.7"},
  {:ssl, ~c"Erlang/OTP SSL application", ~c"11.1.4"}
]

Szetty · September 18, 2024, 7:59am

have you considered adjusting your gen_tcp options (more likely inet options)?

no, but this could be a good solution if we figure out what can cause it

It’s more likely the behavior you’re seeing is a result of waiting to receive data or an acknowledgement from the remote side

it is possible yes, I was thinking about something similar too

You could also adjust the time wait setting at the OS level as well. I would take a look at that and see what the value is. You could confirm this hypothesis by checking out socket stats (e.g., netstat, etc.) Not sure what OS you are using exactly, some type of linux surely, if so check tcp_fin_timeout setting, it’s possible it’s set to 120 seconds in your case.

thanks, will consider that too

These all said, my original questions still remain: why is the timeout NOT applied when calling close in tls_gen_connection and why is even {timeout, 5000} sent if the parameter is ignored anyway ?

starbelly · September 18, 2024, 12:44pm

You are totally right, I did not, and I did not say why I did not, so let me correct that. The context here is not clear, I also assumed (and rightly) that @ingela would jump in, because she is awesome like that, what’s more she has all the context. I could make assumptions about why, but I’d rather not, and let the expert of this code answer that question.

ingela · September 18, 2024, 1:21pm

I am a little preoccupied right now, I will look into it further when I have time. But details about what you are doing when it happens could help.

Szetty · September 18, 2024, 4:25pm

Sure, take your time. I feel details how I got to this code are not relevant to the questions, but even then I think I have shared all the information I felt relevant in the original post. In case you really think there is something else missing I could probably share more if you point out specific details that might be relevant.

Maria-12648430 · September 20, 2024, 8:46am

The way I read it, it is not that no timeout is applied here, not in the sense of “wait forever”: If the catchall clause of close is hit, the underlying transport (gen_tcp in your case, judging from the traces) socket will be closed immediately.

It is not ignored, not in the sense of “nothing happens” or “won’t be handled”. It will hit the catchall clause of close and be handled like anything else that fails to hit a previous clause.

Now I don’t know if {timeout, 5000} is correct, or if {close, 5000} would be more appropriate. However, from a higher-up perspective, both modules are behaving correctly: The code calling tls_gen_connection:close specifies, in the first parameter, the reason why it should close. How tls_gen_connection:close does handle that reason is no business of the calling code.

If the calling code was sending, as you suggest, something like ignore instead, it would be smarter than it should be, knowing internals about how close handles it.

Szetty · September 25, 2024, 6:52am

The way I read it, it is not that no timeout is applied here, not in the sense of “wait forever”: If the catchall clause of close is hit, the underlying transport (gen_tcp in your case, judging from the traces) socket will be closed immediately .

It is not “wait forever” but also the timeout that is specified (5 seconds) is not respected, I see a “system timeout” in prim_inet that may be hit (although could not confirm it). All in all the problem is that closing can take multiple minutes and every HTTP communication using this given pool of connections is paused, which is not desirable.

It is not ignored, not in the sense of “nothing happens” or “won’t be handled”. It will hit the catchall clause of close and be handled like anything else that fails to hit a previous clause.

Correct, as I said it happens pretty rarely (mostly couple of times per day with thousands of requests every hour, and only with this single service, it is not even the most used one or the biggest one). Surely something bad happens with the connection due to some specific data related to this service, but anyway my point is about the timeout only. Should it be applied ? If not, why not ? Personally I think that probably there is a mistake and the caller intended to use the timeout, at least that is what sending {timeout, 5000} very clearly tells me.

Now I don’t know if {timeout, 5000} is correct, or if {close, 5000} would be more appropriate. However, from a higher-up perspective, both modules are behaving correctly: The code calling tls_gen_connection:close specifies, in the first parameter, the reason why it should close. How tls_gen_connection:close does handle that reason is no business of the calling code.
If the calling code was sending, as you suggest, something like ignore instead, it would be smarter than it should be, knowing internals about how close handles it.

While I mostly agree with this, I also think that {timeout, 5000} is not good, when I see tls_gen_connection:close({timeout, 5000}, ...) I will think of something like: “Calling the TLS connection to close using the timeout of 5000 miliseconds”. More than that, the interface does not specify this option, so it seems very random. Given ignored is not good because of what you mentioned above, simply close or something similar would be more fit.

Maria-12648430 · September 25, 2024, 1:58pm

Well… the catchall clause of tls_gen_connection/close calls tls_socket:close(Transport, Socket) using the given Transport and Socket. From the traces you posted, I see that in your case Transport is gen_tcp, and for that tls_socket:close just calls inet:close. This again should not be blocking at all, that is, unless the linger option of the socket is set to {true, Time}, and Time>0 (see the description of linger here) No idea what the default is, though (the docs don’t say), I would assume {false, _}, or maybe something taken from the OS?

Well, I really don’t know what is intended, this is outside of my turf Let’s see what @ingela unearths when she has time

Szetty · September 29, 2024, 12:19pm

The linger socket opts return {false, 0} so not related to that:

# 23:58:18.925 #PID<0.1529722.0> :ssl_gen_statem.init/1
# :prim_inet.getopt(#Port<0.98690>, :linger)

# 23:58:18.925 #PID<0.1529722.0> :ssl_gen_statem.init/1
# :prim_inet.getopt/2 -> {:ok, {false, 0}}

But something still blocks on the prim_inet level, I could not trace exactly where, because for some reason Rexbug (redbug) did not show all calls in prim_inet, but I could find that prim_inet:close/1 is the one that will block.

starbelly · September 29, 2024, 3:13pm

That is {false, 0}, but if you read the options :

    {false, _} - close/1 or shutdown/2 returns immediately, not waiting for data to be flushed, with closing happening in the background.

    {true, 0} - Aborts the connection when it is closed. Discards any data still remaining in the send buffers and sends RST to the peer.

    This avoids TCP's TIME_WAIT state, but leaves open the possibility that another "incarnation" of this connection being created.

    {true, Time} when Time > 0 - close/1 or shutdown/2 will not return until all queued messages for the socket have been successfully sent or the linger timeout (Time) has been reached.

This is what I was suggesting before. So, if you want to really abort immediately at both levels (os and erlang), you want {true, 0}, alternatively, you can adjust your os tcp stack settings such that you simply don’t wait so long.

Szetty · September 29, 2024, 8:18pm

Right, we will most probably try those as I said, but the interesting thing is that it says that {false, _} - close/1 or shutdown/2 returns immediately, which is not the case for us, or maybe it is not documentation related to prim_inet.

Anyway thanks for clarifying what we can try even more. I am still waiting for the answer on the intended functionality for ssl_gen_statem closing as I want to get to the bottom of this in case there could be rare bugs because of it like we experienced.

ingela · September 30, 2024, 10:49am

Finally I had some time to look into this and I think it is a bug, that the handling of shutdown due to own alerts specifics a {timeout, Timeout} instead of {close, Timeout} . That the timeout is as long as 2 minutes can be explained as we do not set any timeout in the “catch all”-clause and TCP default is know to be 2 minutes for at least some systems. Looking at the code right now I feel this might have gone wrong in connection to making the default timeout used for own alerts configurable. I will make a PR to address this shortly.

ingela · October 4, 2024, 10:11am

ssl: Correct own alert handling by IngelaAndin · Pull Request #8901 · erlang/otp (github.com)