I have a latency-sensitive system where a set of processes sit in
tight receive loops waiting for messages. The messages arrive from
outside the BEAM (via enif_send from a C NIF), and I’ve observed
variable delay between when the message lands in the mailbox and
when the process actually wakes up to match it.
I’m looking for ways to minimize this scheduling gap. What I’ve
explored so far:
process_flag(priority, high): helps but is a permanent
setting, not per-event
process_flag(message_queue_data, off_heap): already using this
+sbwt none / +sfwi 500: tuned, marginal effect
+K true with minimal IO threads
I’m aware of OTP 28’s priority messages (EEP-76), which look very
promising. The ability to have {ready, ...} notifications skip
ahead of any accumulated system messages would be ideal. However,
as far as I can tell, enif_send/4 only accepts an ErlNifPid
and doesn’t support sending to priority aliases or passing the [priority] option.
My questions:
Is there a way to use priority messages from NIF code in
OTP 28? Does enif_send support aliases, or is there a new
NIF API for this?
Failing that, has anyone found an effective pattern for
minimizing the delay between enif_send and the target
process’s receive? Any undocumented scheduler hints or
tricks?
Would an Erlang-side relay (C sends to pid, Erlang process
forwards to its own priority alias) be worth the extra message
hop, or would the overhead negate the benefit?
Any pointers appreciated, even undocumented or internal mechanisms
are interesting to know about.
What does your receive loop look like / what do these messages look like?
Are these processes doing other work / receiving other messages, so that you might be measuring the delay between mailbox insertion and mailbox receiving, but the process was running and doing other things; and it’s not a wake up delay?
If you’re doing a receive that matches the inbound message, the process should become runable once a message is delivered, and it should be put into a scheduler’s run queue immediately. If your run queues are large, there will be some delay (process priority would help with that). You might try +scl false ? If BEAM is migrating all the load onto fewer, busier schedulers, newly runable processes might get pushed into a busy scheduler’s runqueue and have to wait. I haven’t looked at scheduler details in a long while and there’s dragons down there.
AFAIK, priority messaging gets you to the front of the queue, but doesn’t accelerate processing. If you’re seeing a delay between delivery and wake up, it shouldn’t matter where the message is inserted into the queue?
If you’re experiencing issues with delays waking up your existing process, I don’t see why you wouldn’t also have delays waking up your proxy process.
If you can craft a minimal example, this sounds like an interesting thing to dig into.
Thanks for the pointer to your SQL adapter, interesting pattern.
Good to know about aliases being slower than pid for sending,
that confirms my suspicion that the relay approach would add
overhead rather than help.
My receive loop is hot with a few patterns, something like:
Thanks. The processes are dedicated workers, they do nothing else
between receives. Empty mailbox, few patterns match, tight
loop. So it’s not a busy doing other work situation. The
delay I’m measuring is wall time from enif_send on the C side
to when the NIF callback fires back from Erlang (round-trip).
+scl false is already set, good suggestion though.
Your point about priority messages is well taken. If the
mailbox is empty (which it should be most of the time), the
message gets matched immediately and priority ordering doesn’t
help. The bottleneck would be the scheduler noticing the
process became runnable. I’ll try to isolate that component
from the actual processing time.
I’ll put together a minimal reproducer and share it if I find something interesting. Thanks again.
Just to confirm, how are your NIFs that do enif_send running? Are they just running on the regular scheduler thread and returning relatively quickly, using yielding nifs on the regular things, or are they running on the dirty scheduler, or are they running on a separate (non-BEAM) thread? (I don’t have any more advice at the moment)
Random shot in the dark: how large are these messages? Is copying them into the message queue (or off-heap) what’s costing the time?
I don’t know of a way to optimise this – you’ve got to get the messages from the NIF into Erlang at some point, right? – but it might be an interesting data point.
You’re right, I misspoke. +sbwt none was from an earlier test. I’ve since been running with +sbwt very_long +swt very_low which did help. The idea is schedulers spin-wait longer before sleeping, so they pick up the message faster after enif_send.
The latency I’m chasing is the gap between enif_send completing on the C thread and the Erlang process’s receive actually matching. With +sbwt very_long it’s noticeably better but still measurable, around 1-3us on a 22-core Ubuntu 24.04 LTS with schedulers pinned (+sbt db).
Is there anything else on the runtime side that affects how quickly a sleeping scheduler notices a process became runnable?
Few patterns, message_queue_data set to off_heap. Copying cost and pattern matching cost are both negligible. The latency I’m seeing is purely the scheduler wakeup, how long between enif_send making the process runnable and the scheduler actually dispatching it.
This is not really that high for a thread to thread message in a general purpose system. There’s room for improvement, but it’s likely to require dark arts. It’s worth considering if your NIF is fast enough that it could just return the data directly to the caller, instead of being separately threaded and using enif_send?
As a way to foreshadow the dark arts, can you run a test, where you limit your erlang + nif to the same single cpu thread and see what your timings look like? On Linux, I think it’s something like taskset 4 erl to limit beam to only running on the 3rd cpu thread; you could use taskset 1 to run on the first cpu thread, but that often gets a lot of misc OS stuff scheduled. You’ll likely need to reduce concurrency, and also to retune the scheduler wait time options a bit. But, you might see a decrease in observed latency? If so, you may want to try to align things so that the NIF is bound to a specific cpu, and is sending to a scheduler also bound to the same cpu; erlang:process_flag/2 takes a scheduler option, which is very undocumented, but binds your process to a specific scheduler. OTP has lots of production useful things that are documented and say not to use in production because it’s a bad idea; process_flag(scheduler, X) is not even documented, so that tells you it’s very much a bad idea to use in production. But, sometimes the dark arts can be useful, even when they’re a bad idea.
@russor Thanks for these suggestions, they were genuinely helpful.
I spent two days testing everything: single CPU with taskset, process_flag(scheduler, X) to bind the Erlang worker to the same core as the C NIF thread, various +sbwt and +swt combinations. The scheduler pinning does make the latency more stable and predictable, which was a valuable finding.
But the core issue turned out to be different from what I expected. Under high load, the C NIF side is fast enough to overwhelm the Erlang processes (i.e produces input data faster than the Erlang process can consume them).
So the answer to my original question: the receive latency was never the real problem. The real problem is architectural. I was pushing too much work through the C-to-Erlang boundary.
I’m going to rethink the overall design. Thank you for taking the time to share these insights. That kind of knowledge is hard to find anywhere else.
Two things to try is using a supercarrier to reduce allocation overhead (less sys calls EDIT: idk small messages might be landing in multiblock carriers anyways), and increasing the min_heap_size value and or fullsweep_after to avoid garbage collection from running?