Using links/monitors to return the result of a short lived process/function

rickard · April 3, 2023, 3:21pm

I don’t see this as violating the purpose of monitors. Both the message signal and the 'DOWN' signal are signals with payloads that can be converted into messages that you can match on in a receive expression. In both cases you need to tag the data, so that you can distinguish between different scenarios. The exit reason is also there for propagating a value. What kind of value to propagate depends on what you do.

In general I’d refrain from using links for this, though, even though links and 'EXIT' signals are very similar to monitors and 'DOWN' signals. This since a link can be operated on in both ends, and you may need to modify the trap_exit state on the “client” which will alter how it reacts to other incoming exit signals. However, in some special case it might be useful to use a link, though.

The, in OTP 23 introduced, erpc module is based on the monitor approach. In the erpc:call() case we know very little about the code that is going to execute, but in order to distinguish between different scenarios, like the process was terminated by an exit signal, the process was terminated due to an uncaught exception, and an ordinary return value, the only thing you need to do is to tag the exit reason with a termination type (return/throw/exit/error) and a reference created by the “client”.

The monitor approach has also been used in OTP itself for much longer than that. One of the reasons spawn_monitor() was introduced was due to a bug that we found when timing changed at the introduction of the SMP runtime system. The code essentially did:

Pid = erlang:spawn(fun () -> exit(do_something_and_produce_a_result()) end),
Mon = erlang:monitor(process, Pid),
receive
    {'DOWN', Mon, process, Pid, Result} ->
       Result
end

Due to the changed timing, the process identified by Pid every now and then had terminated prior to the monitor being set up which resulted in the result noproc instead of the actual result. This could happen in the non-SMP runtime as well, but we had never seen it since you needed to run out of reductions exactly after the call to spawn() for it to happen. It can be noted that this was actually one of very few bugs in the Erlang code of OTP that was exposed by the changed timing of the SMP runtime system. We had expected more issues to appear.

The spawn_request() BIF was partly introduced in order to make it possible to do things like this efficiently also in the distributed case (both the asynchronous part of it and the monitor part of it). It is a building block of erpc since implementing it with the synchronous spawn BIFs would make multicall very inefficient and timeouts would never be able to trigger prior to the response from the synchronous spawn. It is, however, not the only purpose of spawn_request(), it is for use in any scenario where you want to spawn processes, and is a much better spawn primitive than the other old synchronous spawn BIFs. This since you may need to spawn asynchronously also in other scenarios when you need to spawn processes. You can also implement the old spawn primitives using spawn_request(), but you cannot do it the other way around, at least not without ending up with a very inefficent spawn_request(). This is also more or less how it is implemented in the distributed case, there is only one distributed asynchronous spawn primitive which all distributed spawn operations are based on.

max-au · April 3, 2023, 10:50pm

I half-typed the reply “we are using it for improving performance, because sending a normal message and then terminating the process is less efficient than just terminating the process with a reason containing result”.

Then I deleted my reply. Because, funny enough, I never challenged that statement, and can’t really tell if exiting with a reason is indeed more performant. I don’t have a machine set up correctly to make this test, but as soon as I get it, I’ll try to.

max-au · April 3, 2023, 10:55pm

Minor correction: these aren’t messages, these are signals, and their handling is somewhat different to normal messaging. To begin with, non-message signals (NMs) have priority over message signals. All signals still follow the primary ordering rule (“two signals sent from process A to process B arrive in the same order”), but general messaging order is violated (an EXIT signal may be delivered before preceding normal message).

This difference may be important when implementing something relying on message/signal delivery.

rickard · April 4, 2023, 2:25am

All signals, including messages, adhere to the signal ordering guarantee of the language. Messages do not violate this and do not have less priority than other signals. Note that a message is not delivered/received when you take it out of the message queue using the receive expression, but when it is put into the message queue. I guess a better name for the receive expression could have been something like fetch_message_from_queue in order not to confuse the actual reception of the message with fetching it from the queue.

max-au · April 4, 2023, 3:05am

What I meant is, from the user perspective, non-message signal processing has priority over messages. To give an example, delivered EXIT signal (terminating linked process) would be processed before messages in the queue. This has specific implications for linked processes: imagine process A that spawn_link process B. Then, process B evaluates something, sends a message to process A, and immediately terminates with abnormal reason. Linked process A may (but also may not) be terminated before the normal message is fetched (with receive instruction) from the queue.

This part of the signal ordering guarantee is not exactly intuitive. Specifically, the distinction between “message delivery” and “message fetching”. In my experience, I found it easier to explain with a notion of priority: non-message signals are processed before message signals. Given that implementation does exactly that (attempts to fetch and process all non-message signals before doing other work), it is more intuitive explanation.

juhlig · April 4, 2023, 9:30am

I think even from a user perspective this way of seeing things is inaccurate and may lead to wrong conclusions.

Rather than one kind of signal having priority over other kinds of signals, I picture it as different signals being processed differently. I’m not saying that this is the way things really work, but I believe it is a more accurate picture of how things happen.

I think of a process as having a front desk with a clerk that handles the signals:

If a message signal arrives, it is simply put on a pile on the desk of the process, where it might eventually be picked up by the actual process (or not)
If an exit signal arrives, it depends on whether the process is trapping exits:
- if no, the process is stopped no matter what it is currently doing, the office is closed, and maybe notifications are sent out to other (linked or monitoring) offices (ie, processes with their own clerks), and the office gets closed
- if yes, the exit signal is converted to a message, which ends up on the pile on the desk like all the other messages
If a monitor signal arrives, it is converted to a message, which goes to the pile on the desk
etc…

I think that in this view, there is nothing counter-intuitive. The signal ordering is consistent and guaranteed, and IMO intuitive.

Using your example of processes A and B, A’s clerk gets the message signal from B and puts it on the desk. Then it gets the exit signal from B and shuts down the office of A.

So, it may be that A picks the message up and does something accordingly before the exit signal from B arrives at the front desk and A’s clerk shuts down the office; but it may also be the other way round and A’s clerk receives the exit signal from B and shuts down the office before A can even pick up the message, or act upon it.

juhlig · April 4, 2023, 10:20am

Yes. And the implications of using a monitor as opposed to using a link: If the spawning process dies, the initialization the spawned process is doing (judging from the initial example) has no point any more, and should be stopped, rather than being orphaned. It’s simply a matter of cleaning up properly.

If the spawned process dies, the initial example code will stop anyway, so what’s the difference?

Let’s assume I’m willing to take that liberty.

This may be true for your and other specific use cases. But it can’t be generalized.

Maria-12648430 · April 4, 2023, 10:35am

Keep me informed

jimdigriz · April 4, 2023, 11:26am

For some, the main process may consist of a scheduling or retry component that spawns a worker to avoid blocking the main event loop.

If you nuke the main process with the worker, it will also burn the message queue and any outstanding requests with it.

The request queue could be reworked, but why bother when retrying is considered ‘normal’ and the recovery is “respawn the worker”.

jimdigriz · April 4, 2023, 11:29am

This is literally my point.

My argument only differs in that I do not think it is fair to throw a generalised solution at people asking for advice.

Doing so makes things worse and helps no one.

rickard · April 4, 2023, 3:11pm

I understand that different users have different models of how things work which is perfectly fine. However, I don’t find this model good since it made you state that “general messaging order is violated” which it isn’t true.

A number of releases ago, our documentation seriously lacked regarding how signaling worked, but I don’t think that this is the case anymore. I think it explains this quite well. In light of this documentation, I think it is quite intuitive when looking at it for what it actually is:

The receive expression only operate on the message queue. It fetches a message from it or wait for a message to fetch.
A message can only enter the message queue when a signal is delivered.
When a signal is delivered, an action is taken. For some signals, this action is to convert the signal into a message and move it into the end of the message queue. For other signals, other actions are taken.
Signals are delivered independent of what the receiving process is doing.

juhlig · April 4, 2023, 5:22pm

Hm, fair enough. OTOH, the OP asked a rather general question I think. And I don’t think answering such a question with a too special use case is all that helpful, either. I mean, @RoadRunnr is probably not trying to build a DNS server. (Actually, I don’t know… it would be pretty embarassing if in fact he was )

That aside, I don’t think we can give specialized advice on a forum like this. It largely depends. The general solution is what works for ~90% of the problems. And I think @Maria-12648430 (repeatedly) said that there may be reasons for doing things differently, in which case it is justified.

Look, let’s not fight, ok? Can we somehow agree that neither too generalized nor too specialized advice is the pinnacle of helpfulness?

Maria-12648430 · April 4, 2023, 5:24pm

YES!!! That IS (part of) what I was saying all the time

juhlig · April 4, 2023, 5:25pm

Hush, I’m trying to sort out your mess here

Maria-12648430 · April 4, 2023, 5:28pm

Thanks, I guess

max-au · April 15, 2023, 10:18pm

Oh-kay. As I suspected, there may or may not be a visible performance difference, depending on the size of the returned term.

Code:

-module(links).

-export([message/0, monitor/0]).

-export([message_fun/1, monitor_fun/0]).

create_term() ->
   atom. %% fastest term possible
   %% alternative expensive-to-copy term: lists:seq(1, 1000).

use(Term) ->
    is_atom(Term).

message() ->
    _Pid = spawn(?MODULE, message_fun, [self()]),
    receive
        {ok, Reply} ->
            use(Reply)
    end.

message_fun(ReplyTo) ->
    Reply = create_term(),
    ReplyTo ! {ok, Reply}.

monitor() ->
    _PidRef = erlang:spawn_monitor(?MODULE, monitor_fun, []),
    receive
        {'DOWN', _Mon, process, _Pid, {ok, Reply}} ->
            use(Reply)
    end.

monitor_fun() ->
    Reply = create_term(),
    exit({ok, Reply}).

Result:

./erlperf 'links:message().' 'links:monitor().' -s 15 -w 2
Code                 ||   Samples       Avg   StdDev    Median      P99  Iteration    Rel
links:message().      1        15    952 Ki    0.15%    952 Ki   955 Ki    1050 ns   100%
links:monitor().      1        15    729 Ki    0.51%    730 Ki   735 Ki    1372 ns    77%

I tried various settings (e.g. bumping concurrency), results are reproducible: sending a reply message is consistently more performant. perf explains the difference: when no monitors are created, it’s a bit cheaper to spawn and terminate the process. Plus, monitor tuple ({'DOWN', ...}) is more expensive than just an atom.

But of course, as soon as the return value turns into a more complex term (e.g. list of 1000 elements, see commented out portion), overhead of adding monitor is no longer noticeable:

./erlperf 'links:message().' 'links:monitor().' -s 15 -w 1
Code                 ||   Samples       Avg   StdDev    Median      P99  Iteration    Rel
links:message().      1        15     55932    0.34%     55860    56360   17880 ns   100%
links:monitor().      1        15     55543    0.12%     55563    55619   18005 ns    99%

In other words, I would completely disregard any performance considerations deciding whether to use links, monitors, or just message passing.

eproxus · April 16, 2023, 7:23am

I think the premise was that if you had to use a monitor anyway, it’s cheaper to send the reply inside the monitor message instead of in an additional message before the monitor signal.