How to start_link a gen_server on an arbitrary node

mat-hek · October 17, 2022, 8:02am

Hi there,

I’d like to start_link a gen_server like this:

rpc:call(Node, gen_server, start_link, [my_server, Options])

however, this fails, because the server gets linked to some process started by rpc:call, not to the process that executes the call. At the elixir forum someone suggested using spawn_request, though it seems it doesn’t solve the problem, as it links to an intermediary process instead of the target one. Using gen_server:start and then link/1 would work most of the time, but is not atomic. Spawning an intermediary process with spawn_link/2 and starting the server from there to create a chain of links is also some option, but its behaviour differs a bit and requires extra logic to handle two PIDs.

What is the correct way to solve this?

mmin · October 17, 2022, 8:40am

Feel the pain, have been there many times…

This should be alternative to gen_server:start, link:
erlang:spawn_link(Node, gen_server, start, [my_server, Args, Opts])..

From the doc: “A link is created between the calling process and the new process, atomically.”

mat-hek · October 17, 2022, 10:25am

It doesn’t work like that, unfortunately. It just spawns a new process on the given node and executes the passed function inside it. So in my case it basically works like the rpc:call I mentioned.

mmin · October 17, 2022, 2:04pm

You’re right, my answer is wrong…

One solution that came to my mind is to transform transitive links to the direct one.
So you transform caller_proc <---> erpc_proc <---> gen_server_proc to caller_proc <---> gen_server_proc. This could be achieved by passing the caller’s PID to the my_server:init/1, so during the init it can call link(Caller). After gen_server:spawn_link returns, you unlink the erpc process from the gen_server and you unlink the caller proces from the erpc process. So, at any time there is at least one link to the spawned gen_server. I don’t know which guarantees for signal delivery hold, but exit signal should not be lost, but may be sent twice. It’s not a clean solution because you have to modify your server’s code, but this is also a possible solution.

So, we have nodes ‘a’ and ‘b’ started and connected. On node ‘a’ you execute:

1> RemoteGen1 = fun(C) ->
                 {ok, P} = gen_server:start_link({global, my_server}, my_server, [{caller, C}], []),
                 unlink(C),
                 unlink(P),
                {ok, P} end.
2> S = self().
3> F = fun() -> RemoteGen(S) end.
4> spawn_link(NodeB, F).

Don’t forget to call link(Caller) in my_server:init/1. I tried it locally and seems to work. It should be better then gen_server:start, link because it does handle signal sent before the link command, but as I already mentioned, downside is that you need to modify servers code. I wonder can you do this without editing the servers code…

Another downside is that you need your own mechanism to return the server’s PID to the caller process…

max-au · October 17, 2022, 3:53pm

This appears to be an attempt to create a “remote supervisor”, where supervisor process is located on a different node. I don’t recall this to be officially supported or recommended, but OTP is so transparent that one can easily achieve this behaviour using proc_lib directly:

proc_lib:spawn_link(WorkerNode, gen, init_it, [gen_server, self(), self(), my_server, [], []]).

It does not look nice and uses undocumented internals. I wonder what’s the use-case you’re after, why do you need a gen_server supervised remotely (linked to the remote process). Would not it work better if you have local supervision tree?

dischoen · October 17, 2022, 4:12pm

I found this answer on stackoverflow to be helpful:

Short form: have some application on the remote node and delegate the process creation and supervision to the remote application.
This way the coupling is more loose, which in general should be more robust.

mat-hek · October 17, 2022, 4:29pm

Thanks for the answer, though if

then I doubt that I want to go for it

I have a ‘job manager’ process that receives jobs to execute and spawns workers on demand. If a worker dies, the manager cancels the entire work related to the given job and notifies the process that requested the job, so it’s a very basic ‘supervision’. Workers handling the same job can talk to each other via message passing. If the current node is busy, the manager can spawn a worker on another node - here I need the remote start_link. The nodes are well connected in a local cluster, so there’s no need to worry about netsplits and so on (at least for now).

I imagine that having a separate supervision tree on each node would require a significant refactor and introduce a lot more complexity to the system.

max-au · October 17, 2022, 5:29pm

It is indeed basic supervision. If not the requirement to let the workers talk to each other, I’d say this would be a classic spawn_request use-case.
Do you need your workers to be gen_servers? If yes, what you could do is to leverage proc_lib:spawn_link functionality to start a process on the remote node, and use gen_server:enter_loop in the worker code instead of gen_server:start_link. E.g.

-module(worker).
start_remote() ->
    gen_server:enter_loop(?MODULE, [], init([]), ...)

mat-hek · October 17, 2022, 5:55pm

Wow, that’s really cool! However, it requires me to re-implement the gen_server’s startup logic, that is executing init/1, waiting in the parent process until it finishes, handling the values it returns (like continue) etc. Or am I missing something?

max-au · October 17, 2022, 8:06pm

I believe so. However you may cut some corners and handle only the {ok, State} = init() case and rely on #letitcrash for everything else.

I guess it might be possible to change proc_lib to make start_link support remote (distributed) spawn (similar to what spawn_link/4 does). I am not aware why this hasn’t been done in the first place.