The Many-to-One Parallel Signal Sending Optimization (blog post)

kjellwinblad · November 5, 2021, 5:56pm

I have just posted a blog post about a new signal sending optimization (that most likely will be included in Erlang/OTP 25) to the erlang.org blog. The optimization makes sending signals from multiple processes to a single process much more scalable on multicore systems. Comments are very welcome!

OvermindDL1 · November 5, 2021, 7:18pm

I just saw that post, looks fascinating!

I’m curious what caused that issue to be looked at so this fix came about? Is there a backstory?

kjellwinblad · November 6, 2021, 6:16am

Excellent question. Thank you! The optimization was not triggered by an actual application where we found that the signal queue was a scalability bottleneck. I don’t know know if such applications exist, but I would not be surprised if they do. One type of application that might benefit from the optimization is logging systems, where one process aggregates log entries coming from multiple processes. It would be interesting to hear about it if someone knows any real-world application where the signal queue might be a scalability bottleneck. As I also mention in the blog post, people may have avoided writing applications in a certain way to avoid scalability problems with the signal queue. Now, with the new optimization, it is possible to write applications differently that might be more efficient. Also, even if the signal queue is not a scalability problem today, it might be in the future if processes keep getting more cores.

I am interested in synchronization algorithms and have a background in doing academic research about such algorithms. Therefore, I was thinking about which places in the Erlang VM could have contended hot-spots. After some discussion with @rickard, I realized that there was a potential to make signal sending scale much better with the number of processes that send signals to a single process. This is how the optimization came to be

domi · November 6, 2021, 8:44am

There was a presentation about hitting the limits of the standard Elixir logger last year:

Adventures in High Performance Logging (codesync.global)

I’d be curious how much a difference this makes! The benchmarks are impressive. Really appreciate the blog post detailing how this all works too.

phild · November 7, 2021, 11:49am

I did some preliminary design work earlier this year (sadly the project was subsequently descoped) for an Erlang MME component based around an in-memory database and stateless microservices. The interfaces would have had a many-to-one relationship with the database and this was indeed a concern.

OvermindDL1 · November 8, 2021, 5:36pm

Nice, thanks for the followup! The speedup is extremely impressive I have to say, looking forward to looking at using that in the future.

starbelly · November 8, 2021, 7:46pm

What do you think this could mean for message passing between nodes? The article focuses on processes within a single node, but one of my immediate thoughts was a knock on effect. Do you have any thoughts you can share on that?

max-au · November 10, 2021, 4:32am

I haven’t tried it yet, but would it also work for messages sent over Erlang distribution? This use-case (a single gen_server acting as dispatcher and sending work items to other processes) is very standard implementation of a worker pool.

starbelly · November 10, 2021, 6:15am

^^ that’s the knock on effect I was getting at

kjellwinblad · November 10, 2021, 6:57am

No, currently, the optimization only works for signals coming from processes on the same node. All other signals are mapped to the same slot in the buffer array. My thinking was that the optimization would be less relevant for other kinds of signals as (I think) they are rarer. I would be happy if someone who knows more about how the Erlang distribution works could write about whether the optimization could be made relevant for signals coming from other nodes or if other bottlenecks are more severe for such signals. Otherwise, I will look into this a little closer and come back with an answer when I know more.

bjorng · November 10, 2021, 12:15pm

Not my area of expertise, but wanting to learn more I talked with the team.

The same optimization should be possible to do (and would probably not be hard to do) for messages coming from other nodes. Whether it is worth it is another thing, because the receiving process must also decode the received message from the external term format, and it could very well be that the time for doing that would dominate the costs for taking locks.

max-au · November 10, 2021, 11:58pm

That’s indeed what we observed! So we don’t send the message over dist. Instead, we use term_to_binary on the sender side, and dispatcher process does not do the decoding. It routes the binary to worker process, which will then does binary_to_term. Binaries are (usually) ref-counted, so there is no visible degradation.

With OTP 23 and remote spawn I found another way to achieve the desired behaviour. I reversed the flow, and sender uses remote spawn to start a new worker process, which then in turn check out necessary resources from the pool (e.g. database connection or file handle).
However this approach has a drawback - there is no built-in queuing/backpressure/load shedding mechanism on the receiver side. I have a bolt-on solution (currently pending corporate scrutiny to have it published), introducing extra processes on both sender and receiver side.

afa · November 29, 2021, 9:40am

Am I reading correctly that this will increase the throughput of gen_servers in case of multiple processes doing gen_server calls?

If yes, an application type to likely benefit seems to be message brokers, where messages from different session contexts can arrive at a single gen_server doing stuff like routing, storing messages, logging. Depends on the respective architecture, of course, but there are surely things like this in VerneMQ.

kjellwinblad · December 14, 2021, 7:01am

The current gen_server implementation uses a single process (under the hood) to handle requests that are coming as normal messages so I think this optimization could benefit gen_servers that are used by several processes. However, the optimization is only activated for processes with the {message_queue_data, off_heap} setting, so if the current gen_server implementation takes advantage of the optimization depends on how its server process is configured. Perhaps someone that knows the gen_server implementation better than I do can answer if {message_queue_data, off_heap} is activated by default for gen_servers or if it is possible to configure a gen_server to use {message_queue_data, off_heap} (@rickard @garazdawi).

garazdawi · December 14, 2021, 7:23am

It is not activated by default, but it can be passed in the spawn_opts Option when starting a gen_server or it can be set at run-time using process_flag.

pablopla · February 17, 2022, 9:26pm

I have 1 gen_server receiving messages from 100 processes and broadcasting every message to 1,000 processes. Each of the 100 processes sends 100 messages per second to the gen_server.
Will using off_heap help on the gen_server help with receiving but slow down sending to all other processes?
Should I set`message_queue_data=off_heap on the gen_server?
Should I set it on the 100 sender processes or 1,000 receiver processes?

kjellwinblad · May 4, 2022, 2:04pm

off_heap could make receiving messages more efficient (how efficient is compared to on_heap depends on the number of cores the machine has and how often messages are sent in parallel etc). My guess is that off_heap will be beneficial for the gen_server in the scenario you describe but the only way to know for sure is to measure. No off_heap should not affect the performance of sending to other processes. Note that the optimization described in the blog post is only available in Erlang/OTP 25.