EPMD becomes zombie when starting my release inside Docker

Hi guys,

It all started when I saw this message upon login on my Ubuntu server:
There are 2 zombie processes.

I tried to locate these zombies:

# ps axo stat,ppid,pid,comm | grep -w defunct
Zs     58904   59064 epmd <defunct>
Z      58904   59065 epmd <defunct>

The ppid=58904 points to my Erlang release running inside the Docker container:

# ps auxwww | grep 58904
ubuntu     58904  8.8  0.2 2362552 182348 ?      Ssl  16:03   0:28 /opt/taurus/bin/taurus -Bd -Bi -- -root /opt/taurus -bindir /opt/taurus/erts-14.1.1/bin -progname opt/taurus/bin/taurus -- -home /home/ubuntu -- -noshell -noinput -boot /opt/taurus/releases/latest/start -mode embedded -boot_var SYSTEM_LIB_DIR /opt/taurus/lib -config /opt/taurus/config/docker/app.config -sname taurus -setcookie awesome_cookie -- -- foreground --

I’m using nothing special in my vm.args:

-sname taurus
-setcookie awesome_cookie

I narrowed down the issue and this situation happens early (less than 5sec) after I started my container.

Inside the container, I see this:

$ ps auxwww | grep epmd | grep -v grep
ubuntu        78  0.0  0.0   3740   100 ?        S    16:03   0:00 /opt/taurus/erts-14.1.1/bin/epmd -daemon
ubuntu       132  0.0  0.0      0     0 ?        Zs   16:03   0:00 [epmd] <defunct>
ubuntu       133  0.0  0.0      0     0 ?        Z    16:03   0:00 [epmd] <defunct>

In addition to the 2 defunct epmd, a new one is happily running.

This might be related to this RabbitMQ issue. But honestly, I’ve no clue.

Environment:

Ubuntu 22.04 LTS
Erlang 26.1.2

Help appreciated as I’m new to using Erlang within Docker.

1 Like

I also tried the suggestion here, but I still get these epmd zombies:

-kernel inet_dist_use_interface 127.0.0.1

You don’t need epmd: Running Erlang Releases without EPMD on OTP 23.1+ · Erlware Blog

2 Likes

@tsloughter Thanks for the pointer. What about the following settings in my vm.args?

-env ERL_DIST_PORT 4369
-erl_epmd_port 4369
-start_epmd false

Your solution worked perfectly.

My container is totally isolated and beside manually connecting to it using remote_console (relx), it doesn’t connect to any other node.

Would you recommend the above in PROD?

@tsloughter When trying to connect to my node inside the container, I get this error:

$ echo $ERL_DIST_PORT
4369
$ bash -x /opt/taurus/bin/taurus remote_console
[...]
+ erl_rpc erlang is_alive
+ echo 'Node is not running!'

What did I do wrong?

I’ve build my release using this rebar3 version:

$ rebar3 --version
rebar 3.22.0 on Erlang/OTP 26 Erts 14.1.1

Looks right. I’d simplify the vm.args just in case though, you don’t need any of that, the taurus script generated by rebar3/relx will handle everything if you set ERL_DIST_PORT.

But note you can’t do this with -env ERL_DIST_PORT 4369 since that is an arg to Erlang VM and thus won’t be picked up by the shell script taurus. Since you are using Docker you’ll want to set it in the Dockerfile or with -e or under environment if using docker compose.

@tsloughter That’s correct. I’ve set ERL_DIST_PORT in my Dockerfile to make things as described in your post. But i’m still unable to connect to my node using remote_console :frowning:

Hm, weird. It is a tough one to debug without being able to play with it. Is it public by chance or can you reproduce in a public repo?

@tsloughter not publicly accessible unfortunately. Just for me to understand: in your case, you disabled epmd and were able to remote_console into the node inside Docker?
If yes, I must be doing something wrong.

Yea. And setting ERL_DIST_PORT will automatically add -start_epmd false to the args, no need to disable manually.

@tsloughter it seems that erl_call can’t connect to my node.

$  echo $ERL_DIST_PORT
4369
$ /bin/sh -x /opt/taurus/bin/taurus remote_console
[...]
/opt/taurus/erts-14.1.1/bin/erl_call -R -c awesome_cookie -address 4369 -timeout 60 -a erlang is_alive
erl_call: failed to connect to node with address ":4369"
+ result=
+ code=1
+ [ 1 -eq 0 ]
+ return 1
+ echo Node is not running!
Node is not running!

$ ss -antp | grep 4369
$

This is strange. If i deploy my app outside the container, the above steps worked as expected and I can remote_console. But from within the Docker container, my release isn’t listening on port 4369.

@tsloughter found it :slight_smile:
The only thing that need to be set is ERL_DIST_PORT environment variable.

Thanks a lot @tsloughter. You made my day.

2 Likes