When to use long-lived vs. ephemeral processes?

zwrob · February 14, 2025, 12:51am

Erlang REST API Design

(newbie questions)

This is a follow up to a question I posted a few weeks
ago.
So I’m trying to wrap my head around Erlang and how to use it in problems I’m
familiar with, such as REST micro-services. I’ll layout the problem domain
that’s in my head, and then ask for comments as to whether I’m on the right
track with how to approach this in Erlang…

My apologies for this being such a long question…

Main Question

When is it advantageous to have a long-lived process vs. just spinning up an
ephemeral process?

Problem in traditional micro-services architecture

Before Erlang, I would have designed something like this:

# load balancer exposes public micro-services S1-S(N)
LB -> S1, S2, S3

A micro-service would be composed of an API plane, a Control Plane (CP), and
some state:

S1 = {
    CP -> control plane endpoints (e.g. info(), etc)
    API -> application endpoints (e.g. get_resource_foo(), etc.)
}

Each micro-service owns some sort of state, be it in a Redis Cache, an S3
Bucket, or a database.

S1 -> postgres database
S2 -> redis cache
S3 -> s3 bucket

When a client request arrives, it is serviced by the micro-service. The
micro-service calls other micro-services for things like authorization, and to
get various resources:

client -> [request] -> 
    LB -> S1.get_resource_foo() -> S2.is_client_authorized()
                                   S3.get_resource_bar()
                                   calculate foo
                                   return foo

Erlang Approaches

Single Erlang Node

What I’ve been struggling with is how this sort of thing maps to Erlang. Here’s
my thought process on that.

consolidate all the micro-services into a single Erlang node
replace Redis cache with ETS
use gproc to register micro-service endpoints as well known long-lived
processes

This looks like:

LB -> Erlang Node -> (gproc) -> P1, P2, P3

Now when a client request arrives, we send a message to the well known
long-lived process:

client -> [request] -> LB -> Erlang Node -> 
            webserver -> 
                P1.get_resource_foo() -> 
                    gproc -> P2 ! is_client_authorized()
                    gproc -> P3 ! get_resource_bar()
                    calculate foo
                    return foo

In turn, the web-server and P1-P3 would just spawn ephemeral processes (E) to
handle their requests.

... Erlang Node -> webserver spawn E1 -> P1 ! get_resource_foo()
               P1 spawn E2 -> P2 ! is_client_authorized()
               P2 spawn E3 -> P3 ! get_resource_bar()
               P1 spawn E4 -> calculate foo
               return foo

This seems great, but where are the bottlenecks?

Multiple Erlang Nodes

If I have one Erlang node running the web-server with API and CP endpoints, and
several other nodes for processing:

Node1 : web-server
Node2 : worker
Node3 : worker

Now say I have shared state in ETS across this small cluster. I could
distribute processes in various ways:

Option1: node1 contains web-server endpoints AND well known long-lived
processes P1-P3, and ephemeral worker processes get created on node2-node3
Option2: node1 contains web-server and P1-P3 and all ephemeral processes are
created on node2-3
Option3: same as Option2 but spin up ephemeral processes randomly on nodes
2-3
Option4: same as Option3 but spin up ephemeral processes on nodes 1-3 that
have the least load

This seems like I’m over thinking it. A simpler option would be to just have
all the nodes the same and not spin up processes on remote nodes, just the
local node:

Option5: All nodes the same, with web-server, P1-3, and Ephemeral processes
on local node

No Long Lived Processes

This brings me to one of my core questions… why have long lived processes at
all? In this case, the problem could be simplified by just invoking the MFA
directly:

... Erlang Node -> webserver spawn E1 -> 
                spawn E2 (Module1:get_resource_foo:arguments)
                spawn E3 (Module2:is_client_authorized:arguments)
                spawn E3 (Module3:get_resource_bar:arguments)
                spawn E4 (Module4:calculate_foo:arguments)
                return foo

Now the problem looks a lot simpler:

Nodes 1-3 : web-server, code for Modules 1-4, sharing ETS

client -> [request] -> 
            LB -> Node(1-3) -> 
                webserver ->
                    webserver spawn E1 -> 
                    spawn E2 (Module1:get_resource_foo:arguments)
                    spawn E3 (Module2:is_client_authorized:arguments)
                    spawn E3 (Module3:get_resource_bar:arguments)
                    spawn E4 (Module4:calculate_foo:arguments)
                    return foo

So my primary questions are:

Am I composing this problem in Erlang correctly?
When do you use long-lived well known (e.g. registered with gproc) processes
vs. ephemeral processes?
How do you compose your nodes and distribute work to them? In this last case,
I’m using the LB to randomly spread the load across the nodes

I have other questions that I’ll keep on the back-burner, but mention here in case anyone knows:

how well does ETS handle network partitions?
if you join a node to an already running cluster, is that node available
before ETS has finished hydrating in the new node? what happens if you query
ETS before it’s been populated/hydrated?

Thank you in advance!

schnef · February 14, 2025, 7:23am

In your case, I would start much simpler, especially since you don’t have much experience with Erlang yet. After an initial setup and measuring performace, you can modify your design to whether or not you want longlived processes and multiple nodes.

Start with a simple REST api in cowboy. Cowboy creates a process for each request and with the REST handler, it’s a piece of cake. This finger exercise gives you a better sense of how Erlang works and is a first step to adjust your architecture with the help of performance tests.

g4v · February 14, 2025, 1:48pm

i feel things can simplified even more by using GitHub - webmachine/webmachine: A REST-based system for building web applications..

eproxus · February 14, 2025, 3:48pm

Cowboy already has a built-in webmachine-like implementation with its cowboy_rest handler.

zwrob · February 14, 2025, 6:35pm

schnef, g4v, eproxus -

Thank you, each of you.

Yes, my example was much too complicated. This is the sort of thing I’ll build once I fully understand all the Erlang concepts… but for learning, I agree. Smaller test cases will make the concepts clear.

I’m reading “Erlang Programming” by Cesarini and Thompson and they talk about processes a bit. Here’s my summary so far:

processes wrap state
long-lived processes (registered processes) wrap long-lived state
ephemeral processes (not registered, live short time) wrap short-lived state

Examples

So if it’s a webserver, like webmachine, etc., then the webserver process is a long-lived process b/c it’s wrapping the state of a long-lived socket. When a request comes in, the long-lived (and probably registered) webserver process spawns an ephemeral (not registered) process to wrap the state of the request (request + socket and new port).

From the book, they talk about registering a file-system process to serialize reads & writes to the filesystem. Another example was registering a database process to handle similar contention on db reads & writes. And a third example was processing packets from a socket. In that example, each packet was encapsulated by an ephemeral process.

So far, this is the pattern I’m seeing.

processes wrap state
long-lived processes wrap long-lived state
ephemeral processes wrap short-lived state

Again, thank you for your thoughts on this.

eproxus · February 15, 2025, 4:49pm

Maybe it helps to understand how Cowboy (and most other Erlang web servers, or protocol servers in general) work internally:

They usually have one to many acceptor processes (commonly called a pool) that wait for new connections on a listen socket (e.g. TCP or UDP). Once a connection comes in, they immediately spawn of a worker process and give that process a handle to the socket.

It’s only in that fresh process that any data receiving actually starts. That means the acceptor processes immediately go back and wait for another connection (leading to very low latency for accepting new connections).

The new worker process will receive data and decode the HTTP request (in this example), run some internal parsing on it and route it to a configured handler module which is then called via some agreed upon callback API (which would be your application code).

All this happens in the context of that new worker process. This has several benefits:

Very little data is copied between the acceptor and the worker (just a socket handle)
Inside the worker all data received (or otherwise created) is local on the heap of that process
You can structure your handler code in a logical procedural fashion without having to think about the above mentioned details (usually)
When the process is done and dies, Erlang can very quickly and efficiently free up the memory used by the process

Piggybacking on the already spawned worker process is usually enough for 95% of the cases in my experience. You might access some shared data like an ETS table or a database to synchronize the system state, but you normally don’t need to spawn more processes.

In the cases where you need to spawn more processes you have to figure out what is really concurrent in your system in addition to the processes already spawned by your web server.

Two examples:

You’re serving requests that uses a database as data store. No additional processes needed. The synchronization happens in the database and the processes are ”stateless” (in a permanence sort of way)
You are implementing a chat server. In order to have chat rooms you want to send new messages to all participants in the room. Then it might make sense to spawn a process for the room itself and make the worker processes send and receive messages to it (e.g. using the pg module in OTP).

igorclark · February 15, 2025, 6:53pm

Would add to @eproxus’s ace post that another really good alternative to pg for that sort of scenario is syn as it’s really good for lots of procs joining and leaving and publishing/subscribing to lots of groups.

zwrob · February 16, 2025, 2:30am

@eproxus and @schnef,

Your two explanations really helped, thank you! Your examples using webservers makes sense to me. I have a follow-up question about http clients, but I’ll raise it in a separate thread.

Thanks again!

zwrob · February 16, 2025, 2:33am

@igorclark - syn looks very cool. I’ll look into that.