I am looking for a Process pool in Erlang similar to Java Thread pool or Akka Router. As I do not find them in the standard Erlang language, I am seeking community help to suggest any libraries to achieve the same.
I am looking for the following capabilities in a Process pool.
- Either spawning a fixed size or variable size pool which grows as load increases.
- Pool Supervising so that a new process is spawned in case an existing process dies.
- Different Message dispatch approaches such as the round-robin algorithm, broadcast to all processes, send the message to the least loaded process, etc
Akka (and similar frameworks) needs “process pool” construct to work around a VM limitation. In Java (discounting project Loom for now) thread cannot yield anywhere except when waiting for a message receive. Erlang VM does not have such limitation, therefore you can create as many processes as needed - 100k, a million, 4 million. It works perfectly good.
In Java, it is expensive to start a new thread, and it’s cheaper to reuse threads - hence the concept of thread pool is necessary. In Erlang spawning new process is very cheap - so you can just spawn it every time you need to perform something asynchronously.
One can argue that “process pool” can be used to limit concurrency. That is a valid use-case, but implementation of such a primitive is trivial.
Another alternative is to start a
simple_one_to_one supervisor and make its workers register in
pg group, then you can trivially send a job to a random worker in that pool (this pool can be distributed between multiple physical servers, so you’ll get a worldwide pool of workers).
And of course you could use
pooler or a similar library from awesome Erlang list. I would however recommend explaining your use-case a bit more, as Erlang may provide better abstractions compared to non-concurrent-languages constructs like “thread pool”.
I think Worker Pool satisfies all your requirements.
Thanks Brujo. Let me check it
Thanks Max. Appreciate your response.
When you say VM limitation with reference to Akka, do you mean that Akka Actor maps to the underlying Java thread for execution and hence, there is a limitation on the number of Actors that could be created?
What is meant by “simple_one_to_one supervisor”?
My use-cases are as follows. Please suggest me appropriate Erlang library to implement.
Use-case #1 - “Dispatching to the same worker”: My application gets event from a large number of sources. After processing each event, my application stores it in a database and displays it on the GUI. I want to maintain the order of arrival while processing the event, so that events will be displayed in the GUI in the arrival order. For maintaining the order of arrival, I plan to process events from a specific source by the same worker/process. Essentially, my dispatcher should dispatch events from a specific source (say based on IP address) to the same worker.
Use-case #2 - “Batch Processing”: Dispatch the tasks to one of the available worker/process based on some algorithm such as least loaded process, round-robin, etc
For both use cases I use (and also contributed to) the worker_pool library Brujo shared before. For #1 you can use
hash_worker, which allows you to select a worker given some key by the producer. For example whichever ID the producer is identified from, hashing is consistent and always chooses the same worker. For #2, the library also has other strategies to select a worker like
next_worker for round-robin and
available_worker for least-load.
Akka maps to Java threads (which is in turn expensive OS threads), so you cannot create millions of these. In Erlang you can have as many as you need.
Use-case #1 looks like you want to create an Erlang process per event stream. Every event stream is separated from all others. Unless you have millions of separate even streams, I don’t see why you want to have a single worker to process and store unrelated events. We use this design successfully. When a new event stream is registered, we basically spawn a new process which handles all events related to that stream. We keep mapping between stream ID and process ID in an ETS table, so we’re always able to find a process.
Use-case #2 Just spawn a new process. It’s cheaper than figuring out which process is “least loaded”. In most cases reusing an Erlang process is more expensive than starting a new one. That’s why I usually warn against “worker pool” concept in Erlang.
The number of event sources/streams would vary between a few ten thousand to a max of one million.
I propose to use one Erlang process per event stream in order to process the events in the order of arrival so that events will be displayed in the GUI in the same order. One type of event is alarm events which indicate some problem with the event source. Such an alarm event will be sent when the problem arises and another event will be sent when the problem is resolved. So, the order of arrival is important while processing the events.
In the Java world, each thread consumes some memory (stack space), more threads result in more context switching and hence, there is a limitation on the number of threads per core.
You mentioned that I can have as many Erlang processes as I need. How come we do not have the Java limitation mentioned above, in Erlang? Please clarify.
Erlang processes are “green threads” (or “fibers”), they are very lightweight. If a process is idle most of the time, it can hibernate (see
erlang:hibernate/3 or corresponding reply of a gen_server callback) discarding the stack.
One can think of Erlang processes as of “runnable objects” (actors). Having 2-3 million processes per node is not a problem at all, it’s in the VM design. We were running production nodes with over 2M processes.
Thanks Max. The word “green thread” made me to check the deprecated Green threads in Java and the upcoming Virtual threads in Java. The article State Of Java Loom Project explained in details how the Java Virtual threads are made lighter. Same should be applicable with Erlang process as well.
I agree with explanation that I can create any number of Erlang process and do not need any Process pool for my case where in the number of process vary from few thousand to one million.
Appreciate your responses. They helped me to understand Erlang better.
With this, I would like to close the discussion.
Well, to be precise, not any number of Erlang processes, but much more than you would be comfortable with in a Java environment (sans Project Loom).
From Erlang -- Advanced
The maximum number of simultaneously alive Erlang processes is by default 262,144. This limit can be configured at startup. For more information, see the +P command-line flag in the erl(1) manual page in ERTS.
And from Erlang -- erl
Sets the maximum number of simultaneously existing processes for this system if a Number is passed as value. Valid range for Number is [1024-134217727]
NOTE: The actual maximum chosen may be much larger than the Number passed. Currently the runtime system often, but not always, chooses a value that is a power of 2. This might, however, be changed in the future. The actual value chosen can be checked by calling erlang:system_info(process_limit).
The default value is 262144
(AFAIK this number refers to the number of processes per runtime instance.)
I would recommend neither
poolboy. Both of them seem to be abandoned and have flaws.
This is much better
Anyway, pooling doesn’t make much sense in Erlang, unless there is a limited external resource involved.
Thanks Maria. Agree with you that it cannot be “any number”, as it finally boils down to the underlying hardware capacity. I wanted to say that “large number” compared to Java, where from I am coming.
The default number 262,144 should be good for now for me. What is the minimum hardware spec needed to support this default number?
Thanks Jan. I have decided not to use any pooler for now.
Frankly, I don’t know if there is any requirement, I never had to bother with that. Any reasonably up-to-date hardware should be up to it. Things just run slower in general on weak and faster on strong hardware. This talk, though a bit old and actually on a slightly different topic, could give you some deeper insights and answer some of your questions
Out of interest, what constitutes a “large number” in the Java world of today?
Hi Maria, Max number of threads is calculated based on parameters such as total memory, stacksize, etc. For example, on my 11 GB RAM Ubuntu VM, the system calculates the max thread as 95k.