Why does Erlang's SSH client fail if there's an old 'EXIT' message in the caller's mailbox?

Hi,

If there’s an {‘EXIT’, _, normal} message in a process’ mailbox when calling ssh:connect(), the SSH library behaves differently to when the mailbox is empty. This is unexpected.

Workaround: don’t start SSH from a process that’s trapping exits, and make sure there aren’t any old trapped exits hanging around.

Test case below. In OTP 23, both fails() and works() behave the same. That’s the expected behaviour. In OTP 26, fails() behaves differently to works().

Matthias


%% Throwaway to debug SSH problems
%%
%% The Erlang SSH library doesn't seem to like old 'EXIT' messages left
%% lying around in the message queue. The mssage gets matched in
%% ssh_connection_handler:handshake/3; it looks like this behaviour was
%% added by 1963f9501
%%
%% I can reproduce the problem in otp_src_26.0.2
%% I cannot reproduce the problem in otp_src_23.1
%%
%%
-module(test_ssh_crash).

-export([works/0, fails/0]).

works() ->
    crypto:start(),
    ssh:start(),
    {ok, _SSH} = ssh:connect("localhost", 22,
			    [{user, "bogus"}, {password, "bogus"},
			     {silently_accept_hosts, true}]).

fails() ->
    self() ! {'EXIT', self(), normal},
    works().

What exactly do you mean by “behave differently”? :thinking:

Maybe you should raise an issue with OTP for this matter?

1 Like

Oh, sorry. Forgot that not everyone has a working Erlang system to try it on. Here’s what happens when I run the code I posted above:

Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit:ns]

Eshell V14.0.2 (press Ctrl+G to abort, type help(). for help)
1> c(test_ssh_crash).
{ok,test_ssh_crash}
2> test_ssh_crash:works().
=INFO REPORT==== 4-Mar-2024::14:59:02.255336 ===
Erlang SSH client version: 5.0.1 (OpenSSL 3.0.11 19 Sep 2023).
Address: 127.0.0.1:39566
Peer server version: "SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u2"
Peer address: 127.0.0.1:22
Disconnects with code = 14 [RFC4253 11.1]: Unable to connect using the available authentication methods
State = {userauth,client}
Module = ssh_fsm_userauth_client, Line = 96.
Details:
  User auth failed for: "bogus"

** exception error: no match of right hand side value {error,
                                                       "Unable to connect using the available authentication methods"}
     in function  test_ssh_crash:works/0 (test_ssh_crash.erl, line 19)
3> test_ssh_crash:fails().
** exception error: no match of right hand side value {error,{exit,normal}}
     in function  test_ssh_crash:works/0 (test_ssh_crash.erl, line 19)
4>

So, in the works() case, the Erlang SSH implementation correctly figures out that you can’t log in with a bogus username/password.

In the fails() case, the Erlang SSH implementation just crashes without properly attempting a log in. That’s not what I expect it to do.

Can someone from OTP comment on whether you prefer this type of thing as a forum question first, or if you’d rather I go straight to reporting a bug?

Why do you even have EXIT messages waiting in your mbox? Can you share a bit more on the use case? What’s the purpose of keeping exit messages in your mailbox?

Without knowing anything about our ssh implementation this do look like a bug,
and those are easier to keep track on in github issues.

If we don’t think it is a bug we will just close it :slight_smile:

2 Likes

(comment to mmin)

Any process that traps exits, i.e.

erlang:process_flag(trap_exit, true)

is going to get ‘EXIT’ messages, and you’re not going to know exactly when they arrive, in general.

Wildcard-matching them, as ‘ssh_connection_handler’ does, is a surprising thing for a library to do.

I think the problem is that there is already an EXIT message in the processes message queue so when it checks what has arrived, an ok or error, then it finds the EXIT message. There is no way a process can find out who sent the message as all it sees is just the message itself. And there is no way around this.

It is something you have to keep track of and make sure there are no “leftover” messages in the message queue.

1 Like

Are you confusing ssh_connection_handler.erl doing the wrong thing with my test code doing the wrong thing? My code does the wrong thing because… that’s the point of a test case. It’s supposed to demonstrate the crash in a way where it’s obvious what’s triggering the problem.

ssh_connection_handler.erl should not care if a process unrelated to SSH exits normally.

Or, running it in reverse, let’s try your proposed solution, which is “make sure there are no leftover messages in the message queue”. We’ll do that with a flush:

robert() ->
    crypto:start(),
    ssh:start(),
    erlang:process_flag(trap_exit, true),
    spawn_link(fun() -> timer:sleep(40) end),
    timer:sleep(40),
    flush(),
    {ok, _SSH} = ssh:connect("localhost", 22,
			    [{user, "bogus"}, {password, "bogus"},
			     {silently_accept_hosts, true}]).

flush() ->
    receive
        _ -> flush()
    after 0 ->
            done
    end.

Here’s what it does:

test_ssh_crash:robert().
** exception error: no match of right hand side value {error,{exit,normal}}
     in function  test_ssh_crash:robert/0 (test_ssh_crash.erl, line 34)

That’s bad. It should have complained that the login credentials were wrong (assuming we’re running on a machine with an SSH server that has password auth enabled).

Now, you could point out that my test case is unnecessarily complicated and racy, and simplify it:

robert() ->
    crypto:start(),
    ssh:start(),
    erlang:process_flag(trap_exit, true),
    spawn_link(fun() -> ok end),
    {ok, _SSH} = ssh:connect("localhost", 22,
			    [{user, "bogus"}, {password, "bogus"},
			     {silently_accept_hosts, true}]).

And now we’re pretty much back to what I posted in the first place. Sure, test case looks contrived. Such is life for minimal test cases.

But: it’s not reasonable for an SSH client to (quietly!) require that the calling code isn’t trapping exits and doesn’t have any linked processes that might exit during the handshake phase of the SSH connection.


I can think of a few possible fixes:

  1. Documentation fix. Put a warning in the docs that the SSH client can’t handle being run in an environment where exits are trapped and a linked process might exit, even normally. Easy but it’s going to catch someone out.

  2. Precondition check. The SSH client checks there are no ‘EXIT’ messages in the queue at startup, plus checks that trap_exit is false. Better, but hacky.

  3. Wrap the SSH client in another process, i.e. make sure the precondition is true. Also hacky.

  4. Fix ssh_connection_handler so it doesn’t go matching ‘EXIT’ messages for processes that are none of its business. A bit fiddly to implement and then difficult to test well enough to be confident that it doesn’t break something else.

I didn’t submit a patch because… I’m lazy.

This is indeed a bug, and the only feasible option is to fix it along the lines of (4). (3) can be used as a temporary workaround until then.

For the record, this is the issue that was opened at OTP for this matter: In OTP 26, Erlang's SSH client fails to connect if it receives an {'EXIT', Pid, normal} message from an unrelated process. · Issue #8223 · erlang/otp · GitHub.

1 Like

… and this is my attempt at a fix: `ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits by Maria-12648430 · Pull Request #8226 · erlang/otp · GitHub

1 Like

Sorry, I think you read way to much in to my comment. I was not being specific here but just trying to point out a common error which can be difficult to realise what is happening. That is that you can have left over messages in your message-queue which don’t go a way by themselves and can match instead of the messages you really want.

2 Likes