After upgrading to OTP 25 SSL calls to services timeout during SSL handshake

Hi,

This is a cross-post from Elixir forums, since both forums have the experts.

Let me first preface this by saying I don’t think the problem lies in Erlang/Elixir.

Here’s our setup:

Production: Deployed on AWS as a cluster of k8s managed Docker containers, each running a single Erlang VM.

Development: Single Docker container (same specification as production) or simply running natively on a Mac.

The application periodically makes SSL client REST calls to a couple of servers.

Calls are made via HTTPoison which uses hackney, certificates managed by certifi

We were running OTP 22 and Elixir 11.X; we recently decided to upgrade to OTP 25 and Elixir 14. This also involved upgrading most of the dependencies.

Here is the problem. SSL calls to the services all timeout during SSL handshake. Here’s what is odd:

  1. I can make SSL calls to new websites
  2. Everything works in the development environment
  3. DNS resolution and TCP connectivity works

I simplified things a little with this snippet:

  opts = [{:log_level, :debug}, {:verify, :verify_peer}, {:customize_hostname_check, [match_fun:  fn(_ip, _x) -> true end]}, {:cacerts, :certifi.cacerts()}]

  with {:ok, port} <- :gen_tcp.connect(%{addr: {104, 18, 128, 69}, port: 443, family: :inet}, [], 5000),
       _ <- IO.puts("Connected on 443"),
       _ <- :inet.peername(port) |> IO.inspect(label: :peer_info),
       {:ok, ssl_port} <- :ssl.connect(port, opts, 5000)
   do
           IO.inspect(:ssl.getstat(ssl_port), label: :success)
           :gen_tcp.close(port)
  else
    error -> IO.inspect(error, label: :error)
  end

The :gen_tcp.connect works, the :ssl.connect fails. I have tried a number of options combinations.

I enabled ssl debugging and got this:

>>> TLS 1.3 Handshake, ClientHello
[{client_version,{3,3}},
 {random,
     <<134,152,230,87,40,72,129,119,150,175,252,27,246,202,8,211,103,12,48,
       226,219,167,183,191,234,12,174,147,127,25,155,53>>},
 {session_id,<<>>},
 {cookie,undefined},
 {cipher_suites,
     ["TLS_EMPTY_RENEGOTIATION_INFO_SCSV","TLS_AES_256_GCM_SHA384",
      "TLS_AES_128_GCM_SHA256","TLS_CHACHA20_POLY1305_SHA256",
      "TLS_AES_128_CCM_SHA256","TLS_AES_128_CCM_8_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_ECDSA_WITH_AES_256_CCM","TLS_ECDHE_ECDSA_WITH_AES_256_CCM_8",
      "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384",
      "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384",
      "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
      "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_128_CCM","TLS_ECDHE_ECDSA_WITH_AES_128_CCM_8",
      "TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384",
      "TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384",
      "TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256",
      "TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256",
      "TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256",
      "TLS_DHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_DHE_DSS_WITH_AES_256_GCM_SHA384",
      "TLS_DHE_RSA_WITH_AES_256_CBC_SHA256",
      "TLS_DHE_DSS_WITH_AES_256_CBC_SHA256",
      "TLS_DHE_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_DHE_DSS_WITH_AES_128_GCM_SHA256",
      "TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256",
      "TLS_DHE_RSA_WITH_AES_128_CBC_SHA256",
      "TLS_DHE_DSS_WITH_AES_128_CBC_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA",
      "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA",
      "TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA",
      "TLS_ECDH_RSA_WITH_AES_256_CBC_SHA",
      "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA",
      "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA",
      "TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA",
      "TLS_ECDH_RSA_WITH_AES_128_CBC_SHA","TLS_DHE_RSA_WITH_AES_256_CBC_SHA",
      "TLS_DHE_DSS_WITH_AES_256_CBC_SHA","TLS_DHE_RSA_WITH_AES_128_CBC_SHA",
      "TLS_DHE_DSS_WITH_AES_128_CBC_SHA"]},
 {compression_methods,[0]},
 {extensions,
     #{alpn => undefined,
       client_hello_versions => {client_hello_versions,[{3,4},{3,3}]},
       ec_point_formats => {ec_point_formats,[0]},
       elliptic_curves => {supported_groups,[x25519,x448,secp256r1,secp384r1]},
       key_share =>
           {key_share_client_hello,
               [{key_share_entry,x25519,
                    <<43,88,142,125,18,179,171,46,174,221,187,47,152,87,31,
                      192,187,126,240,39,122,23,222,102,173,223,129,197,121,
                      230,203,75>>}]},
       max_frag_enum => undefined,next_protocol_negotiation => undefined,
       renegotiation_info => {renegotiation_info,undefined},
       signature_algs =>
           {signature_algorithms,
               [eddsa_ed25519,eddsa_ed448,ecdsa_secp521r1_sha512,
                ecdsa_secp384r1_sha384,ecdsa_secp256r1_sha256,
                rsa_pss_pss_sha512,rsa_pss_pss_sha384,rsa_pss_pss_sha256,
                rsa_pss_rsae_sha512,rsa_pss_rsae_sha384,rsa_pss_rsae_sha256,
                {sha512,ecdsa},
                {sha512,rsa},
                {sha384,ecdsa},
                {sha384,rsa},
                {sha256,ecdsa},
                {sha256,rsa},
                {sha224,ecdsa},
                {sha224,rsa},
                {sha,ecdsa},
                {sha,rsa},
                {sha,dsa}]},
       signature_algs_cert => undefined,sni => undefined,srp => undefined}}]
writing (269 bytes) TLS 1.2 Record Protocol, handshake
0000 - 16 03 03 01 08 01 00 01  04 03 03 86 98 e6 57 28    ..............W(
0010 - 48 81 77 96 af fc 1b f6  ca 08 d3 67 0c 30 e2 db    H.w........g.0..
0020 - a7 b7 bf ea 0c ae 93 7f  19 9b 35 00 00 62 00 ff    ..........5..b..
0030 - 13 02 13 01 13 03 13 04  13 05 c0 2c c0 30 c0 ad    ...........,.0..
0040 - c0 af c0 24 c0 28 cc a9  cc a8 c0 2b c0 2f c0 ac    ...$.(.....+./..
0050 - c0 ae c0 2e c0 32 c0 26  c0 2a c0 2d c0 31 c0 23    .....2.&.*.-.1.#
0060 - c0 27 c0 25 c0 29 00 9f  00 a3 00 6b 00 6a 00 9e    .'.%.).....k.j..
0070 - 00 a2 cc aa 00 67 00 40  c0 0a c0 14 c0 05 c0 0f    .....g.@........
0080 - c0 09 c0 13 c0 04 c0 0e  00 39 00 38 00 33 00 32    .........9.8.3.2
0090 - 01 00 00 79 00 0d 00 2e  00 2c 08 07 08 08 06 03    ...y.....,......
00a0 - 05 03 04 03 08 0b 08 0a  08 09 08 06 08 05 08 04    ................
00b0 - 06 03 06 01 05 03 05 01  04 03 04 01 03 03 03 01    ................
00c0 - 02 03 02 01 02 02 00 33  00 26 00 24 00 1d 00 20    .......3.&.$...
00d0 - 2b 58 8e 7d 12 b3 ab 2e  ae dd bb 2f 98 57 1f c0    +X.}......./.W..
00e0 - bb 7e f0 27 7a 17 de 66  ad df 81 c5 79 e6 cb 4b    .~.'z..f....y..K
00f0 - 00 0a 00 0a 00 08 00 1d  00 1e 00 17 00 18 00 0b    ................
0100 - 00 02 01 00 00 2b 00 05  04 03 04 03 03             .....+.......

error: {:error, :timeout}

We simply do not get the response back to the first TLS handshake.

I’m thinking that there is some k8s or other caching going on, and I’ll ask our k8s expert tomorrow.

I was wondering if anyone else has hit this issue?

Thanks

2 Likes

We just upgraded to OTP 25.1 and elixir 1.14. Parts of application do use hackney iirc, regardless if there’s a an underlying ssl issues period, then I would expect tls problems across the board. We haven’t experienced any issues of any kind.

A cheap experiment you could try is to use an alternative http client to rule out hackney. So, the simplest would be httpc, but also mint (if you don’t need a pool), or finch (if you need a pool to carry out experiments).

Edit:

Important to note we do not run on k8s, we’re on bare metal.

1 Like

I tried just using gen_tcp and ssl directly. Also some sites do work. My suspicion is that it is k8s related. I just need to prove it.

Have you tried forcing TLS v1.2?

1 Like

I also suggest tcpdump and/or WireShark to spy on communications and see whether there is any unexpected packet flying in or out.

1 Like

Yes, made no difference

Yeah, that’s difficult. It’s a pretty high security service. Pretty much every port except things like epmd are shutdown.

One thing I have noticed is that session_id is an empty binary. Not sure if that counts for anything?

Openssl is a good tool for troubleshooting ssl certificate issues, I tried the below on your endpoint and got a successful response

openssl s_client -connect 104.18.128.69:443

You can also pass -CAfile and point to your CA plus a host of other options.

For my own future reference openssl s_server allows you to start a TLS server with a key & certificate & and you can connect to it using openssl s_client.

1 Like

Actually that was helpful. The “bad” websites don’t even work just using openssl. However, the openssl library we are using is pretty old, although it worked with OTP R22. My thought is that there maybe some compatibility issues. I’m going to upgrade it.

2 Likes

So a quick update. Apparently there is a squid proxy in the way, that no one told us engineers about, that objected to the newer ciphers in R25.

Which leads me to another question. Can the list of ciphers we support be specified on startup?

Unfortunately, no. The parameters that can be configured globally are documented here. The list includes TLS and DTLS protocol versions, but not ciphers…

1 Like

We had a similar problem with the SSL handshake hanging and timing out. It happened with OTP 25 and some versions of OTP 24.x, but not 24.0 or 23.0. It happened after some back and forth between the client and server during the TLS negotiations, just before the part where the server would normally provide its certificate.

We resolved the issue by forcing the connection to use TLS v1.2.

ssl:start().
ssl:connect("localhost", 8443, [{log_level, debug}, {versions, ['tlsv1.2']}]).
1 Like