`open_port` and zombie processes

wojtekmach · December 13, 2023, 10:27pm

Elixir has a disclaimer about ports and zombie OS processes, Port — Elixir v1.15.7, basically unless the spawned external program checks if its stdio is closed, it’s not automatically terminated. I’m sure there are situations where this is desirable but there’s a ton of programs out there that cannot be easily (i.e. without things like wrapper shell scripts) interacted with using ports because of this problem.

I wonder if being able to opt-in to such automatic termination was ever discussed?

Something like this perhaps?

- P = open_port({spawn, "python3 -m http.server"}, []).
+ P = open_port({spawn, "python3 -m http.server"}, [kill]).

zabrane · December 14, 2023, 3:38pm

@wojtekmach just use the battle tested erlexec. You’ll be able to control stdin/stdout and many other things. No more zombies…

starbelly · December 16, 2023, 4:53pm

Hey @wojtekmach, is this just around in case the VM crashes (and specifically crashes in a way where there is a moment for recourse)?

garazdawi · December 18, 2023, 8:37am

I think we have talked about it before, but it was a long time ago and I can’t recall if there was a technical reason not to do it, or just lack of priority/time.

That being said, I think now that we have erl_child_setup on Unix, we could employ a similar method as erlexec to make sure port programs terminate properly. On Windows it seems like you can join processes into the same job and then they will all be terminated when the emulator terminates.

adamwight · February 17, 2025, 11:42pm

I wrote draft PR #9453 which solves this issue on some platforms (those with prctl) by always killing all spawned child processes according to their process group, whenever the VM terminates. In my case, the child was a long-running rsync and should only continue if managed under the BEAM. I started with a one-off port service wrapper in my downstream project, but there are smells such as the emphatic documentation in Elixir which point in the direction of a language enhancement.

My naive assumption is that the children should be cleaned up in most cases, and if this is desirable then the default could be to kill them and a few exceptions would be easy enough to implement by isolating the child or grandchild in a new process group. I don’t have enough BEAM experience to guess if this is true, or to reason about how the migration to such a default might be accomplished, ie. whether existing applications are broken anyway (undergoing unplanned BEAM destruction) when they hit this edge case.

+1 that the intermediate erl_child_setup already provides much of the machinery to make this possible to do in a straightforward way!

It’s not clear to me whether erlexec can forward every case of abnormal VM termination, it seems to be trapping several specific signals and mostly waiting for pipe failures, but the library has a lot of wisdom to offer so perhaps it’s showing the most portable approach already. Its exec:run supports a kill_group flag similar to wojtekmach’s suggestion, which hints at such an option being useful.

adamwight · February 18, 2025, 6:59pm

After looking at the erlexec library, I realized that erl_child_setup already included sufficient code to detect parent VM termination: simply reading from the beam command pipe and detecting an error or close event. I’ve updated my patch to rely on the existing code rather than using prctl, it still works and is much more portable.

adamwight · March 7, 2025, 6:13am

The PR is ready to go, and after a bit of discussion the scope was limited to taking a small step forward: at VM exit (clean or hard), Erlang (specifically, the forker subprocess) will always try to kill all spawned children with a SIGTERM on unix.

The Windows implementation will take some more effort.

A kill or kill_group (entire child process group) flag to open_port would be an interesting possibility, but making this the default behavior on port_close seemed a bit too disruptive for now. Existing tests suggest that application developers might be relying on a semi-synchronous, polite behavior of Port ! {close, self()} and causing this to immediately terminate the child could break usages.

There was also general grumbling about improving the external process API altogether, which can happen in follow-up work.

adamwight · March 8, 2025, 6:43am

Several people spoke in favor of giving port_close and at the low level, terminate_port a default behavior of killing the spawned process, and I’m also leaning in that direction. In case others are thinking about this problem, I wanted to share an excellent example of why it could be risky: closing is sometimes used for its side effect of no longer receiving messages from the port-connected external program. In peer.erl, when connection is not standard_io, the peer erl will be launched with the -detached flag and further communication relies on distribution. The peer port is closed and any messages received from it are treated as an error condition.

In this simple case, I believe port_close can just as easily be replaced by unlink. But maybe there are cases where the port will still receive miscellaneous messages?

Here’s a diff which lets all tests pass, for the curious:

--- a/lib/stdlib/src/peer.erl
+++ b/lib/stdlib/src/peer.erl
@@ -581,10 +581,8 @@ init([Notify, Options]) ->
             _ ->
                 Port = open_port({spawn_executable, Exec},
                                  [{args, FinalArgs}, {env, Env}, hide, binary]),
-                %% peer can close the port before we get here which will cause
-                %%  port_close to throw. Catch this and ignore.
-                catch erlang:port_close(Port),
-                receive {'EXIT', Port, _} -> undefined end
+                unlink(Port),
+                receive {'EXIT', Port, _} -> undefined after 0 -> undefined end
         end,
 
     %% Remove the default 'halt' shutdown option if present; the default is