Hello there,
I am Arnold, a fellow Elixir programmer. Have been working on multiple Elixir projects approximately last 10 years. I love the language, the BEAM, the community, all the tooling around it.
As almost any project, ours has some 3rd party integrations too, for which we are now mostly using Finch to achieve HTTP integration. Recently, we observed that one of the integrations (namely Tesseract, deployed in our Kubernetes cluster, external to the caller system) is causing us issues, more specifically pool timeouts from NimblePool, which is used by Finch for connection pooling.
At first, I did what any “deeper problem” unaware programmer would do and increased the limit. Then we put Tesseract integration on a different Finch instance, with different pool, etc. Occurrence of these errors drastically decreased, but rarely it did still occur. Again, I have done what would many programmers do: added more logging and more metadata.
Soon I was faced with a situation that made no sense: we were receiving these pool timeout errors, but in the exact same moment when querying the pool state, we could observe that we had most connections still available (usually 150+ from a total of 200). This was the point I started to think that something is fishy here, so I have started to go down this rabbit hole, wanting to really get to the bottom of it. I have added tracing that was triggered by telemetry events produced by Finch.
First I was tracing NimblePool and related Finch modules, and found out that the problem lies in closing the connections:
# 03:51:24.508 #PID<0.5256.0> NimblePool.init/1
# #PID<0.278690.0> :ssl_gen_statem.init/1 <<< {:"$gen_call", {#PID<0.5256.0>, #Reference<0.3935724038.2842689537.162284>}, {:close, 5000}}
...
# 03:53:27.071 #PID<0.5256.0> NimblePool.init/1
# <<< {#Reference<0.3935724038.2842689537.162284>, :ok}
As we can see closing this connection took 2 minutes, even though the specified timeout is 5 seconds! In this period of 2 minutes NimblePool is blocked on waiting for this connection to close and cannot handle any messages it gets, thus the timeout for connection checkout.
As I figured the problem is not with NimblePool I decided to go deeper and try to understand what happens exactly and how we can improve it (although at this point I had a feeling it is not something we could easily solve, issue was pretty rare, happening only for this single integration and related to closing SSL connections):
# 12:12:10.683 #PID<0.250975.0> :ssl_gen_statem.init/1
# :ssl_gen_statem.terminate({:close, 5000}, :connection, ...)
...
# 12:12:39.090 #PID<0.250975.0> (:dead)
# :ssl_gen_statem.terminate/3 -> :ok
# 12:12:39.090 #PID<0.5565.0> NimblePool.init/1
# :ssl_gen_statem.close/2 -> :ok
Timestamps and durations are different for different traces, as they were traced in different times, but the problem is the same.
As I dig through the Erlang code, I found the function that terminates the connection (will illustrate only the clause interesting for us according to the trace above):
terminate(Reason, connection, #state{static_env = #static_env{
protocol_cb = Connection,
transport_cb = Transport,
socket = Socket},
connection_states = ConnectionStates
} = State) ->
handle_trusted_certs_db(State),
Alert = terminate_alert(Reason),
%% Send the termination ALERT if possible
catch Connection:send_alert_in_connection(Alert, State),
Connection:close({timeout, ?DEFAULT_TIMEOUT}, Socket, Transport, ConnectionStates);
Looking through the code first I thought that the parts that do have timeout should be the problematic ones (?DEFAULT_TIMEOUT
is 5 seconds). Analysing further I found out that Connection
is tls_gen_connection
and counter-intuitively tls_gen_connection:close/4
is the one that takes the most time:
# 20:38:33.425 #PID<0.563206.0> :ssl_gen_statem.init/1
# :ssl_gen_statem.terminate({:close, 5000}, :connection, ...)
# 20:38:33.430 #PID<0.563206.0> :ssl_gen_statem.init/1
# :tls_gen_connection.send_alert_in_connection({:alert, 1, 0, %{line: 1853, file: ~c"ssl_gen_statem.erl", mfa: {:ssl_gen_statem, :terminate_alert, 1}}, :undefined, :undefined}, ...)
... 5 seconds timeout ...
# 20:38:38.432 #PID<0.563206.0> :ssl_gen_statem.init/1
# :tls_gen_connection.send_alert_in_connection/2 -> {:throw, {:stop, {:shutdown, :own_alert}, {:state, {:static_env, :client, :gen_tcp, :tls_gen_connection, :tcp, :tcp_closed, :tcp_error, :tcp_passive, ~c"k8sgazo9v3lvvg.octoscreen.de", 443, #Port<0.45959>, #Reference<0.870037478.4226154500.13393>, #Reference<0.870037478.4226154500.13392>, :ssl_client_session_cache_db, {:ssl_crl_cache, ...}, {}, {}, 0}, %{}, {:protocol_buffers, {:undefined, {[], 0, []}}, "", []}, {[""], 0, []}}}}
# 20:38:38.432 #PID<0.563206.0> :ssl_gen_statem.init/1
# :tls_gen_connection.close({:timeout, 5000}, "<socket>", :gen_tcp, ...)
...
# 20:40:22.901 #PID<0.563206.0> (:dead)
# :tls_gen_connection.close/4 -> :ok
# 20:40:22.901 #PID<0.563206.0> (:dead)
# :ssl_gen_statem.terminate/3 -> :ok
Checking the function in tls_gen_connection
(again showing only the relevant clauses):
%% User closes or recursive call!
close({close, Timeout}, Socket, Transport = gen_tcp, _) ->
tls_socket:setopts(Transport, Socket, [{active, false}]),
Transport:shutdown(Socket, write),
_ = Transport:recv(Socket, 0, Timeout),
ok;
...
%% Other
close(_, Socket, Transport, _) ->
tls_socket:close(Transport, Socket).
As we can see from the trace close
from tls_gen_connection
is called with {Timeout, 5000}, ...
, but there is no specific close for this argument so it will enter the %%Other
case. Is this is intentional ? Or maybe it should be called with {close, Timeout}
instead of {timeout, Timeout}
?