Process pooling libraries in Erlang similar to Java Thread pool or Akka Router?

mohan · October 13, 2022, 3:56pm

I am looking for a Process pool in Erlang similar to Java Thread pool or Akka Router. As I do not find them in the standard Erlang language, I am seeking community help to suggest any libraries to achieve the same.

I am looking for the following capabilities in a Process pool.

Either spawning a fixed size or variable size pool which grows as load increases.
Pool Supervising so that a new process is spawned in case an existing process dies.
Different Message dispatch approaches such as the round-robin algorithm, broadcast to all processes, send the message to the least loaded process, etc

max-au · October 13, 2022, 5:05pm

Akka (and similar frameworks) needs “process pool” construct to work around a VM limitation. In Java (discounting project Loom for now) thread cannot yield anywhere except when waiting for a message receive. Erlang VM does not have such limitation, therefore you can create as many processes as needed - 100k, a million, 4 million. It works perfectly good.

In Java, it is expensive to start a new thread, and it’s cheaper to reuse threads - hence the concept of thread pool is necessary. In Erlang spawning new process is very cheap - so you can just spawn it every time you need to perform something asynchronously.

One can argue that “process pool” can be used to limit concurrency. That is a valid use-case, but implementation of such a primitive is trivial.

Another alternative is to start a simple_one_to_one supervisor and make its workers register in pg group, then you can trivially send a job to a random worker in that pool (this pool can be distributed between multiple physical servers, so you’ll get a worldwide pool of workers).

And of course you could use poolboy, pooler or a similar library from awesome Erlang list. I would however recommend explaining your use-case a bit more, as Erlang may provide better abstractions compared to non-concurrent-languages constructs like “thread pool”.

elbrujohalcon · October 14, 2022, 6:57am

I think Worker Pool satisfies all your requirements.

mohan · October 14, 2022, 4:40pm

Thanks Brujo. Let me check it

mohan · October 14, 2022, 5:03pm

Thanks Max. Appreciate your response.

When you say VM limitation with reference to Akka, do you mean that Akka Actor maps to the underlying Java thread for execution and hence, there is a limitation on the number of Actors that could be created?
What is meant by “simple_one_to_one supervisor”?

My use-cases are as follows. Please suggest me appropriate Erlang library to implement.
Use-case #1 - “Dispatching to the same worker”: My application gets event from a large number of sources. After processing each event, my application stores it in a database and displays it on the GUI. I want to maintain the order of arrival while processing the event, so that events will be displayed in the GUI in the arrival order. For maintaining the order of arrival, I plan to process events from a specific source by the same worker/process. Essentially, my dispatcher should dispatch events from a specific source (say based on IP address) to the same worker.

Use-case #2 - “Batch Processing”: Dispatch the tasks to one of the available worker/process based on some algorithm such as least loaded process, round-robin, etc

NelsonVides · October 14, 2022, 5:43pm

For both use cases I use (and also contributed to) the worker_pool library Brujo shared before. For #1 you can use hash_worker, which allows you to select a worker given some key by the producer. For example whichever ID the producer is identified from, hashing is consistent and always chooses the same worker. For #2, the library also has other strategies to select a worker like next_worker for round-robin and available_worker for least-load.

max-au · October 14, 2022, 8:45pm

Akka maps to Java threads (which is in turn expensive OS threads), so you cannot create millions of these. In Erlang you can have as many as you need.

Use-case #1 looks like you want to create an Erlang process per event stream. Every event stream is separated from all others. Unless you have millions of separate even streams, I don’t see why you want to have a single worker to process and store unrelated events. We use this design successfully. When a new event stream is registered, we basically spawn a new process which handles all events related to that stream. We keep mapping between stream ID and process ID in an ETS table, so we’re always able to find a process.

Use-case #2 Just spawn a new process. It’s cheaper than figuring out which process is “least loaded”. In most cases reusing an Erlang process is more expensive than starting a new one. That’s why I usually warn against “worker pool” concept in Erlang.

mohan · October 15, 2022, 2:50am

Thanks Max.

The number of event sources/streams would vary between a few ten thousand to a max of one million.

I propose to use one Erlang process per event stream in order to process the events in the order of arrival so that events will be displayed in the GUI in the same order. One type of event is alarm events which indicate some problem with the event source. Such an alarm event will be sent when the problem arises and another event will be sent when the problem is resolved. So, the order of arrival is important while processing the events.

mohan · October 15, 2022, 3:16am

In the Java world, each thread consumes some memory (stack space), more threads result in more context switching and hence, there is a limitation on the number of threads per core.

You mentioned that I can have as many Erlang processes as I need. How come we do not have the Java limitation mentioned above, in Erlang? Please clarify.

max-au · October 15, 2022, 4:14am

Erlang processes are “green threads” (or “fibers”), they are very lightweight. If a process is idle most of the time, it can hibernate (see erlang:hibernate/3 or corresponding reply of a gen_server callback) discarding the stack.

One can think of Erlang processes as of “runnable objects” (actors). Having 2-3 million processes per node is not a problem at all, it’s in the VM design. We were running production nodes with over 2M processes.

mohan · October 17, 2022, 1:04pm

Thanks Max. The word “green thread” made me to check the deprecated Green threads in Java and the upcoming Virtual threads in Java. The article State Of Java Loom Project explained in details how the Java Virtual threads are made lighter. Same should be applicable with Erlang process as well.

I agree with explanation that I can create any number of Erlang process and do not need any Process pool for my case where in the number of process vary from few thousand to one million.

Appreciate your responses. They helped me to understand Erlang better.

With this, I would like to close the discussion.

Maria-12648430 · October 18, 2022, 12:13pm

Well, to be precise, not any number of Erlang processes, but much more than you would be comfortable with in a Java environment (sans Project Loom).

From https://www.erlang.org/doc/efficiency_guide/advanced.html#system-limits:

The maximum number of simultaneously alive Erlang processes is by default 262,144. This limit can be configured at startup. For more information, see the +P command-line flag in the erl(1) manual page in ERTS.

And from Erlang -- erl

+P Number

Sets the maximum number of simultaneously existing processes for this system if a Number is passed as value. Valid range for Number is [1024-134217727]

NOTE: The actual maximum chosen may be much larger than the Number passed. Currently the runtime system often, but not always, chooses a value that is a power of 2. This might, however, be changed in the future. The actual value chosen can be checked by calling erlang:system_info(process_limit).

The default value is 262144

(AFAIK this number refers to the number of processes per runtime instance.)

juhlig · October 18, 2022, 12:33pm

I would recommend neither pooler nor poolboy. Both of them seem to be abandoned and have flaws.

This is much better

Anyway, pooling doesn’t make much sense in Erlang, unless there is a limited external resource involved.

mohan · October 18, 2022, 12:54pm

Thanks Maria. Agree with you that it cannot be “any number”, as it finally boils down to the underlying hardware capacity. I wanted to say that “large number” compared to Java, where from I am coming.

The default number 262,144 should be good for now for me. What is the minimum hardware spec needed to support this default number?

mohan · October 18, 2022, 12:56pm

Thanks Jan. I have decided not to use any pooler for now.

Maria-12648430 · October 18, 2022, 1:35pm

Frankly, I don’t know if there is any requirement, I never had to bother with that. Any reasonably up-to-date hardware should be up to it. Things just run slower in general on weak and faster on strong hardware. This talk, though a bit old and actually on a slightly different topic, could give you some deeper insights and answer some of your questions

Maria-12648430 · October 18, 2022, 1:43pm

Out of interest, what constitutes a “large number” in the Java world of today?

mohan · October 19, 2022, 7:25am

Hi Maria, Max number of threads is calculated based on parameters such as total memory, stacksize, etc. For example, on my 11 GB RAM Ubuntu VM, the system calculates the max thread as 95k.