Forcing immediate mailbox check / reducing receive latency for latency-sensitive processes?

Hi all,

I have a latency-sensitive system where a set of processes sit in
tight receive loops waiting for messages. The messages arrive from
outside the BEAM (via enif_send from a C NIF), and I’ve observed
variable delay between when the message lands in the mailbox and
when the process actually wakes up to match it.

I’m looking for ways to minimize this scheduling gap. What I’ve
explored so far:

  • process_flag(priority, high): helps but is a permanent
    setting, not per-event
  • process_flag(message_queue_data, off_heap): already using this
  • +sbwt none / +sfwi 500: tuned, marginal effect
  • +K true with minimal IO threads

I’m aware of OTP 28’s priority messages (EEP-76), which look very
promising. The ability to have {ready, ...} notifications skip
ahead of any accumulated system messages would be ideal. However,
as far as I can tell, enif_send/4 only accepts an ErlNifPid
and doesn’t support sending to priority aliases or passing the
[priority] option.

My questions:

  1. Is there a way to use priority messages from NIF code in
    OTP 28? Does enif_send support aliases, or is there a new
    NIF API for this?

  2. Failing that, has anyone found an effective pattern for
    minimizing the delay between enif_send and the target
    process’s receive? Any undocumented scheduler hints or
    tricks?

  3. Would an Erlang-side relay (C sends to pid, Erlang process
    forwards to its own priority alias) be worth the extra message
    hop, or would the overhead negate the benefit?

Any pointers appreciated, even undocumented or internal mechanisms
are interesting to know about.

Thanks.

1 Like

Hmm, are you able to reproduce this in a benchmark, I’m doing something similar although my receive loop is hot with 5ns pause IIRC.

I get pretty good latency regardless GitHub - elixir-dbvisor/sql: SQL provides state-of-the-art, high-performance SQL integration for Elixir, built to handle extreme concurrency with unmatched expressiveness and ergonomic query composition. Write safe, composable, parameterized queries directly, without translating to Ecto or any ORM. · GitHub but I don’t know if I’m doing something wrong sql/lib/adapters/postgres.ex at b34c050be433efed5d35ef1a689571ba0c54ba43 · elixir-dbvisor/sql · GitHub

Also labels and alias are slower then a pid, when sending messages

1 Like

What does your receive loop look like / what do these messages look like?

Are these processes doing other work / receiving other messages, so that you might be measuring the delay between mailbox insertion and mailbox receiving, but the process was running and doing other things; and it’s not a wake up delay?

If you’re doing a receive that matches the inbound message, the process should become runable once a message is delivered, and it should be put into a scheduler’s run queue immediately. If your run queues are large, there will be some delay (process priority would help with that). You might try +scl false ? If BEAM is migrating all the load onto fewer, busier schedulers, newly runable processes might get pushed into a busy scheduler’s runqueue and have to wait. I haven’t looked at scheduler details in a long while and there’s dragons down there. :slight_smile:

AFAIK, priority messaging gets you to the front of the queue, but doesn’t accelerate processing. If you’re seeing a delay between delivery and wake up, it shouldn’t matter where the message is inserted into the queue?

If you’re experiencing issues with delays waking up your existing process, I don’t see why you wouldn’t also have delays waking up your proxy process.

If you can craft a minimal example, this sounds like an interesting thing to dig into.

1 Like

Thanks for the pointer to your SQL adapter, interesting pattern.
Good to know about aliases being slower than pid for sending,
that confirms my suspicion that the relay approach would add
overhead rather than help.

My receive loop is hot with a few patterns, something like:

loop(State) ->
    receive
        {data, Ref, Payload} -> handle_data(Ref, Payload, State);
        {flush, Ref}         -> handle_flush(Ref, State);
        {stop, Ref}          -> handle_stop(Ref, State)
    end.

I’m seeing consistent ~40us per batch on ARM (macOS M1) and
~125us on x86_64 Linux. Not terrible but I want to push it lower. Will keep digging.

1 Like

Thanks. The processes are dedicated workers, they do nothing else
between receives. Empty mailbox, few patterns match, tight
loop. So it’s not a busy doing other work situation. The
delay I’m measuring is wall time from enif_send on the C side
to when the NIF callback fires back from Erlang (round-trip).

+scl false is already set, good suggestion though.

Your point about priority messages is well taken. If the
mailbox is empty (which it should be most of the time), the
message gets matched immediately and priority ordering doesn’t
help. The bottleneck would be the scheduler noticing the
process became runnable. I’ll try to isolate that component
from the actual processing time.

I’ll put together a minimal reproducer and share it if I find something interesting. Thanks again.

Also make sure you leverage the reference optimization, I’m not sure if you do with the code like that.

Also, you can end up starving schedulers with a hot receive loop, which can make it slower. It good practice to have an after :slight_smile:

1 Like

It sounds like you need to be binding the scheduler threads with “+sbt
db” in the erl command-line, on bare-metal to avoid VM delays.

1 Like

Did it already :roll_eyes:

Just to confirm, how are your NIFs that do enif_send running? Are they just running on the regular scheduler thread and returning relatively quickly, using yielding nifs on the regular things, or are they running on the dirty scheduler, or are they running on a separate (non-BEAM) thread? (I don’t have any more advice at the moment)

I would expect this to increase latency, not decrease it. Have you tried ”very_high” (IIRC) instead?

2 Likes

Random shot in the dark: how large are these messages? Is copying them into the message queue (or off-heap) what’s costing the time?

I don’t know of a way to optimise this – you’ve got to get the messages from the NIF into Erlang at some point, right? – but it might be an interesting data point.

2 Likes

Hi @jhogberg

You’re right, I misspoke. +sbwt none was from an earlier test. I’ve since been running with +sbwt very_long +swt very_low which did help. The idea is schedulers spin-wait longer before sleeping, so they pick up the message faster after enif_send.

The latency I’m chasing is the gap between enif_send completing on the C thread and the Erlang process’s receive actually matching. With +sbwt very_long it’s noticeably better but still measurable, around 1-3us on a 22-core Ubuntu 24.04 LTS with schedulers pinned (+sbt db).
Is there anything else on the runtime side that affects how quickly a sleeping scheduler notices a process became runnable?

Hi @rlipscombe

The messages are small. The Payload in {data, Ref, Payload} is something like {ready, Count, BufIdx}, just integers:

    receive
        {data, Ref, Payload} -> handle_data(Ref, Payload, State);
        {flush, Ref}         -> handle_flush(Ref, State);
        {stop, Ref}          -> handle_stop(Ref, State)
    end

Few patterns, message_queue_data set to off_heap. Copying cost and pattern matching cost are both negligible. The latency I’m seeing is purely the scheduler wakeup, how long between enif_send making the process runnable and the scheduler actually dispatching it.

OTP 28.4.1

This is not really that high for a thread to thread message in a general purpose system. There’s room for improvement, but it’s likely to require dark arts. It’s worth considering if your NIF is fast enough that it could just return the data directly to the caller, instead of being separately threaded and using enif_send?

As a way to foreshadow the dark arts, can you run a test, where you limit your erlang + nif to the same single cpu thread and see what your timings look like? On Linux, I think it’s something like taskset 4 erl to limit beam to only running on the 3rd cpu thread; you could use taskset 1 to run on the first cpu thread, but that often gets a lot of misc OS stuff scheduled. You’ll likely need to reduce concurrency, and also to retune the scheduler wait time options a bit. But, you might see a decrease in observed latency? If so, you may want to try to align things so that the NIF is bound to a specific cpu, and is sending to a scheduler also bound to the same cpu; erlang:process_flag/2 takes a scheduler option, which is very undocumented, but binds your process to a specific scheduler. OTP has lots of production useful things that are documented and say not to use in production because it’s a bad idea; process_flag(scheduler, X) is not even documented, so that tells you it’s very much a bad idea to use in production. But, sometimes the dark arts can be useful, even when they’re a bad idea.

3 Likes

@russor Thanks for these suggestions, they were genuinely helpful.

I spent two days testing everything: single CPU with taskset, process_flag(scheduler, X) to bind the Erlang worker to the same core as the C NIF thread, various +sbwt and +swt combinations. The scheduler pinning does make the latency more stable and predictable, which was a valuable finding.

But the core issue turned out to be different from what I expected. Under high load, the C NIF side is fast enough to overwhelm the Erlang processes (i.e produces input data faster than the Erlang process can consume them).

So the answer to my original question: the receive latency was never the real problem. The real problem is architectural. I was pushing too much work through the C-to-Erlang boundary.

I’m going to rethink the overall design. Thank you for taking the time to share these insights. That kind of knowledge is hard to find anywhere else.

2 Likes

FWIW I don’t believe you can bind processes to scheduler. The VM move them as it see fit

Two things to try is using a supercarrier to reduce allocation overhead (less sys calls EDIT: idk small messages might be landing in multiblock carriers anyways), and increasing the min_heap_size value and or fullsweep_after to avoid garbage collection from running?