Pathological scheduler behavior on specific AWS instance types

We observed some undesirable performance of Erlang BEAM on some specific AWS instances, and not on others. The poorly performing instance types are c7i.24xlarge and c7i.metal-48xl. On these systems (perhaps the same underlying hardware?), and with the default number of schedulers online (96 and 192 respectively), a workload that does heavy message-passing between Erlang processes is untenable.

We observed this on our production workload and created a benchmark that reproduces the problem. In this post we’ll compare the performance across some different instance types. We’re seeking help on where to go next.

System info:
All systems under test have 96 logical processors, except for c7i.metal-48xl, which has 192.

Linux hostname 5.15.0-1084-aws #91~20.04.1-Ubuntu SMP Fri May 2 06:59:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Erlang info:

Erlang/OTP 28 [erts-16.0] [source] [64-bit] [smp:96:96] [ds:96:96:10] [async-threads:1] [jit:ns]

The same for all systems, except for c7i.metal-48xl, which had 192 schedulers online.

Poorly performing AWS Instance Types under test:

  • c7i.24xlarge
  • c7i.metal-48xl

Highly performing AWS Instance Types under test:

  • m5.24xlarge
  • c7i.metal-24xl

Benchmark

Repo: message_load_benchmark_public

The benchmark is simple. First it starts a set of Erlang processes, all with an equivalent event loop. Each process randomly selects a target process and sends a message. It then inspects its message queue for any messages that are currently queued and flushes them. Finally, it waits at most 10 ms for the receipt of a new message. When such a message is received, the time difference is calculated between the send and the receive. Then, the process loops.

Our intention is to approximate the real amount of time between the sender and the receiver, without including message queue time. Any flushed message is not included in the throughput or latency metrics.

The benchmark is not expected to be a saturating workload for these systems. We only start a modest number of Erlang processes.

Metrics to compare across instance types:

  • Throughput (msg/sec): The rate of messages measured at the receivers. Higher is better.
  • Latency (usec): Average number of microseconds measured between the send and the receive for good throughput only. Lower is better.
  • Flushput (msg/sec): The rate of messages that have been discarded from measurement due to queueing. Lower is better.

With 96 Erlang processes

Our takeaway: c7i.24xlarge enters a pathological state of some kind, but the workload is small enough that c7i.metal-48xl is still ok.

instance_type logical processors Erlang procs avg latency (us) throughput (/s) flushput (/s) total (/s)
m5.24xlarge 96 96 10 39695 2320 42015
c7i.metal-24xl 96 96 11 40109 1201 41310
c7i.24xlarge 96 96 772 5620 4074 9694
c7i.metal-48xl 192 96 12 40225 1026 41251

With 192 Erlang processes

Our takeaway: c7i.24xlarge is struggling heavily. The workload has grown to a size that causes c7i.metal-48xl to slow.

instance_type logical processors Erlang procs avg latency (us) throughput (/s) flushput (/s) total (/s)
m5.24xlarge 96 192 8 92782 1297 94073
c7i.metal-24xl 96 192 8 92404 1349 93753
c7i.24xlarge 96 192 3962 1987 4081 6068
c7i.metal-48xl 192 192 957 6779 3456 10245

Other Observations

The pathological state isn’t immediate. It takes about 5 minutes to reliably get the system into the state described above. The numbers for our benchmarks were measured after this 5 minute warming period.

Scheduler Utilization during poor performance

With c7i.24xlarge as the poorest performer, we collected scheduler utilization data (scheduler:utilization/1) during the 96-proc benchmark run.

A small number of schedulers are at 100% while others are idle.

Interestingly, we tried changing the schedulers_online from 96 to 24 – doing so causes all 24 schedulers to become busy again. This reminded us of the old issue of so-called “scheduler collapse”, but otherwise we don’t have evidence that it’s the same problem.

{total,0.10934912048525097,"10.9%"}
{weighted,0.21869824097050194,"21.9%"}
{normal,1,1.0,"100.0%"}
{normal,2,1.0,"100.0%"}
{normal,3,5.58303566227387e-6,"0.0%"}
{normal,4,1.0,"100.0%"}
{normal,5,1.0,"100.0%"}
{normal,6,1.0,"100.0%"}
{normal,7,5.126323292650903e-6,"0.0%"}
{normal,8,1.0,"100.0%"}
{normal,9,1.0,"100.0%"}
{normal,10,1.0,"100.0%"}
{normal,11,1.0,"100.0%"}
{normal,12,4.906183199608674e-6,"0.0%"}
{normal,13,1.0,"100.0%"}
{normal,14,1.0,"100.0%"}
{normal,15,1.0,"100.0%"}
{normal,16,1.0,"100.0%"}
{normal,17,1.0,"100.0%"}
{normal,18,1.0,"100.0%"}
{normal,19,1.0,"100.0%"}
{normal,20,1.0,"100.0%"}
{normal,21,1.0,"100.0%"}
{normal,22,1.0,"100.0%"}
{normal,23,1.0,"100.0%"}
{normal,24,1.0,"100.0%"}
{normal,25,6.259780657377271e-6,"0.0%"}
{normal,26,5.496938458024259e-6,"0.0%"}
{normal,27,6.305404208768346e-6,"0.0%"}
{normal,28,5.621689947452118e-6,"0.0%"}
{normal,29,5.165565996041186e-6,"0.0%"}
{normal,30,5.305773543597241e-6,"0.0%"}
{normal,31,4.920218743583807e-6,"0.0%"}
{normal,32,5.758529109259419e-6,"0.0%"}
{normal,33,5.58164412163277e-6,"0.0%"}
{normal,34,4.950861226300747e-6,"0.0%"}
{normal,35,5.877247079485125e-6,"0.0%"}
{normal,36,5.97810106243979e-6,"0.0%"}
{normal,37,5.891971392599641e-6,"0.0%"}
{normal,38,5.447215499276138e-6,"0.0%"}
{normal,39,4.617056319070007e-6,"0.0%"}
{normal,40,7.2914062498871724e-6,"0.0%"}
{normal,41,5.1784460943365516e-6,"0.0%"}
{normal,42,5.787004545484994e-6,"0.0%"}
{normal,43,5.180002114471294e-6,"0.0%"}
{normal,44,4.616060171962071e-6,"0.0%"}
{normal,45,4.417586414659817e-6,"0.0%"}
{normal,46,5.8317519058503196e-6,"0.0%"}
{normal,47,5.504460186548103e-6,"0.0%"}
{normal,48,5.309437124675028e-6,"0.0%"}
{normal,49,5.429387892760399e-6,"0.0%"}
{normal,50,4.985876345064144e-6,"0.0%"}
{normal,51,6.384960497472964e-6,"0.0%"}
{normal,52,4.487253079293929e-6,"0.0%"}
{normal,53,2.0572776595553175e-6,"0.0%"}
{normal,54,6.006224830088852e-6,"0.0%"}
{normal,55,5.195931865992904e-6,"0.0%"}
{normal,56,5.5647373711127905e-6,"0.0%"}
{normal,57,5.201120156378307e-6,"0.0%"}
{normal,58,5.3847504734739995e-6,"0.0%"}
{normal,59,6.198964288353352e-6,"0.0%"}
{normal,60,4.3640235625754655e-6,"0.0%"}
{normal,61,4.531318243637775e-6,"0.0%"}
{normal,62,6.4350917892342455e-6,"0.0%"}
{normal,63,5.305698076441054e-6,"0.0%"}
{normal,64,5.858108511519475e-6,"0.0%"}
{normal,65,5.444306540754987e-6,"0.0%"}
{normal,66,4.346515118146357e-6,"0.0%"}
{normal,67,6.384108996950193e-6,"0.0%"}
{normal,68,4.779536226699929e-6,"0.0%"}
{normal,69,4.193538528757371e-6,"0.0%"}
{normal,70,7.48615386777243e-6,"0.0%"}
{normal,71,5.434803709470685e-6,"0.0%"}
{normal,72,4.86929483832563e-6,"0.0%"}
{normal,73,2.899459099867281e-6,"0.0%"}
{normal,74,4.63330867595908e-6,"0.0%"}
{normal,75,5.115959533735618e-6,"0.0%"}
{normal,76,4.9098615492426465e-6,"0.0%"}
{normal,77,5.233361507930516e-6,"0.0%"}
{normal,78,5.502109724074743e-6,"0.0%"}
{normal,79,5.84897662924456e-6,"0.0%"}
{normal,80,4.41919119358856e-6,"0.0%"}
{normal,81,5.630379932187433e-6,"0.0%"}
{normal,82,5.280271742549084e-6,"0.0%"}
{normal,83,4.759883832735486e-6,"0.0%"}
{normal,84,5.567368124376901e-6,"0.0%"}
{normal,85,2.76068218559842e-6,"0.0%"}
{normal,86,4.585415769074829e-6,"0.0%"}
{normal,87,3.8648374777872e-6,"0.0%"}
{normal,88,2.1733507378973854e-6,"0.0%"}
{normal,89,3.853097110405978e-6,"0.0%"}
{normal,90,5.812516759883333e-6,"0.0%"}
{normal,91,5.2112435821372965e-6,"0.0%"}
{normal,92,4.980230689222788e-6,"0.0%"}
{normal,93,5.253219365755038e-6,"0.0%"}
{normal,94,5.201969216855129e-6,"0.0%"}
{normal,95,4.707100637002602e-6,"0.0%"}
{normal,96,4.8749385197686555e-6,"0.0%"}
{cpu,97,0.0,"0.0%"}
{cpu,98,0.0,"0.0%"}
{cpu,99,0.0,"0.0%"}
{cpu,100,0.0,"0.0%"}
{cpu,101,0.0,"0.0%"}
{cpu,102,0.0,"0.0%"}
{cpu,103,0.0,"0.0%"}
{cpu,104,0.0,"0.0%"}
{cpu,105,0.0,"0.0%"}
{cpu,106,0.0,"0.0%"}
{cpu,107,0.0,"0.0%"}
{cpu,108,0.0,"0.0%"}
{cpu,109,0.0,"0.0%"}
{cpu,110,0.0,"0.0%"}
{cpu,111,0.0,"0.0%"}
{cpu,112,0.0,"0.0%"}
{cpu,113,0.0,"0.0%"}
{cpu,114,0.0,"0.0%"}
{cpu,115,0.0,"0.0%"}
{cpu,116,0.0,"0.0%"}
{cpu,117,0.0,"0.0%"}
{cpu,118,0.0,"0.0%"}
{cpu,119,0.0,"0.0%"}
{cpu,120,0.0,"0.0%"}
{cpu,121,0.0,"0.0%"}
{cpu,122,0.0,"0.0%"}
{cpu,123,0.0,"0.0%"}
{cpu,124,0.0,"0.0%"}
{cpu,125,0.0,"0.0%"}
{cpu,126,0.0,"0.0%"}
{cpu,127,0.0,"0.0%"}
{cpu,128,0.0,"0.0%"}
{cpu,129,0.0,"0.0%"}
{cpu,130,0.0,"0.0%"}
{cpu,131,0.0,"0.0%"}
{cpu,132,0.0,"0.0%"}
{cpu,133,0.0,"0.0%"}
{cpu,134,0.0,"0.0%"}
{cpu,135,0.0,"0.0%"}
{cpu,136,0.0,"0.0%"}
{cpu,137,0.0,"0.0%"}
{cpu,138,0.0,"0.0%"}
{cpu,139,0.0,"0.0%"}
{cpu,140,0.0,"0.0%"}
{cpu,141,0.0,"0.0%"}
{cpu,142,0.0,"0.0%"}
{cpu,143,0.0,"0.0%"}
{cpu,144,0.0,"0.0%"}
{cpu,145,0.0,"0.0%"}
{cpu,146,0.0,"0.0%"}
{cpu,147,0.0,"0.0%"}
{cpu,148,0.0,"0.0%"}
{cpu,149,0.0,"0.0%"}
{cpu,150,0.0,"0.0%"}
{cpu,151,0.0,"0.0%"}
{cpu,152,0.0,"0.0%"}
{cpu,153,0.0,"0.0%"}
{cpu,154,0.0,"0.0%"}
{cpu,155,0.0,"0.0%"}
{cpu,156,0.0,"0.0%"}
{cpu,157,0.0,"0.0%"}
{cpu,158,0.0,"0.0%"}
{cpu,159,0.0,"0.0%"}
{cpu,160,0.0,"0.0%"}
{cpu,161,0.0,"0.0%"}
{cpu,162,0.0,"0.0%"}
{cpu,163,0.0,"0.0%"}
{cpu,164,0.0,"0.0%"}
{cpu,165,0.0,"0.0%"}
{cpu,166,0.0,"0.0%"}
{cpu,167,0.0,"0.0%"}
{cpu,168,0.0,"0.0%"}
{cpu,169,0.0,"0.0%"}
{cpu,170,0.0,"0.0%"}
{cpu,171,0.0,"0.0%"}
{cpu,172,0.0,"0.0%"}
{cpu,173,0.0,"0.0%"}
{cpu,174,0.0,"0.0%"}
{cpu,175,0.0,"0.0%"}
{cpu,176,0.0,"0.0%"}
{cpu,177,0.0,"0.0%"}
{cpu,178,0.0,"0.0%"}
{cpu,179,0.0,"0.0%"}
{cpu,180,0.0,"0.0%"}
{cpu,181,0.0,"0.0%"}
{cpu,182,0.0,"0.0%"}
{cpu,183,0.0,"0.0%"}
{cpu,184,0.0,"0.0%"}
{cpu,185,0.0,"0.0%"}
{cpu,186,0.0,"0.0%"}
{cpu,187,0.0,"0.0%"}
{cpu,188,0.0,"0.0%"}
{cpu,189,0.0,"0.0%"}
{cpu,190,0.0,"0.0%"}
{cpu,191,0.0,"0.0%"}
{cpu,192,0.0,"0.0%"}

What we’ve tried so far

  • We’ve tried different settings of erl flags +sbt, +sbwt, +scl to no avail. The same behavior is present
  • We’ve inspected the Linux system outside of the BEAM and didn’t see any evidence that the system as whole is affected
  • We are attempting to confirm with AWS that the underlying hardware is configured with optimal settings. Nothing to report at this time.

Help

Where do we go next? Any hints on debugging tools in Erlang or Linux that would reveal more information? Is an OTP GitHub issue prudent?

Thanks!

4 Likes

Update: Since this post, we gathered more information from Erlang microstate accounting (thank you @weaver) that suggests the problem actually lives in the ets data tracking (the measurement) rather than the benchmark workload itself. We’ve removed the ets calls from our benchmark, and in all testing the system is behaving well.

The blog post Decentralized ETS Counters for Better Scalability - Erlang/OTP may fully describe the poor performance we are seeing, and we are continuing to test under this hypothesis.

3 Likes

With an eye on ets, the benchmark becomes much simpler: we simply call ets:update_counter from a set of Erlang processes, all targeting the same key. Clearly there is strong lock contention in this benchmark, but the performance on c7i.metal-48xl and c7i.24xlarge is so bad that I can only conclude something is wrong/misconfigured on that hardware.

ets_load_benchmark_public | GitHub

In this output the first number is the number of Erlang processes updating the single ets key, and the measurements are taken from the resulting counter value on that key. The execution time of the test increases because the processes are not getting scheduled reliably.

m5.24xlarge
This looks pretty good considering how much we’re asking of the ets locking.

2 => 1737107.4495706013 /s (8697697 in 5.007 s)
4 => 534051.9169329073 /s (2674532 in 5.008 s)
8 => 366312.7113586632 /s (1841454 in 5.027 s)
16 => 219697.70660340055 /s (1111231 in 5.058 s)
32 => 202757.12634822805 /s (1052715 in 5.192 s)
64 => 191872.39902080785 /s (1254078 in 6.536 s)
128 => 169693.27168755324 /s (2789418 in 16.438 s)
256 => 163980.70872155187 /s (10735817 in 65.47 s)

c7i.24xlarge
The system is crushed with even a modest number of processes (16).

2 => 4802641.2869704245 /s (24032417 in 5.004 s)
4 => 1059262.9896083134 /s (5300552 in 5.004 s)
8 => 737230.5849470951 /s (3692788 in 5.009 s)
16 => 28791.825972313778 /s (218386 in 7.585 s)
32 => 12475.239955831139 /s (146871 in 11.773 s)
64 => 15266.53054813884 /s (97202 in 6.367 s)
128 => 8570.778470099538 /s (1749673 in 204.144 s)
256 => 8488.675595899067 /s (8408916 in 990.604 s)

Addressing the lock contention with ets sharding, write_concurrency, etc does help work around the problem, but the question remains – what is the difference on this hardware?

Any ideas from the core team or community?

1 Like

The M5 is “Intel Xeon Platinum 8175” while the C7i is “Intel Xeon Scalable (Sapphire Rapids)” according to vantage.sh. Given your findings I’d suspect NUMA effects, cache line sizes (look out for false sharing), or possibly differences in cache coherency protocols to be the culprit.

2 Likes

Do you understand the physical layout of CPU cores in these systems?

There are groups of efficient high bandwidth cores, called chiplets or tiles. These groups connect via comparatively slow interconnects.

If a thread migrates between cores within a chiplet the on-cpu cache is lost and must be refeched from the next level.

If the thread migrates between chiplets the same thing happens but the performance hit is much much worse and more cache is lost.

We want the BEAM to not migrate cpu core because of this.

Generally on these types of systems you pin processes (incl network cards, OS processes, BEAM processes) to specific cores to avoid this massive penalty.

you trade off a lot of flexibility though.

You may well get better performance by pinning BEAM schedulers and running multiple BEAMs too, isolating each one within a chiplet, and thus avoiding crossing.

The design of these systems is usually targeted to hyperscaler cloud vendors, and who use VMs to subdivide onto tiles.

Consider trying the larger arm64 CPUs for comparison, their internal architecture is less susceptible to this.

But mostly try pinning all BEAM and OS processes to specific cores to avoid crossing the chiplet chasm.

Also retry specifically with hyperthreading disabled and see if that helps.

There are some serious tools on intel platforms to examine this and see what CPUs are doing but it’s not lightweight, look for talks books and content by Brendan Gregg he’s the legend in this space.

https://erlangforums.com/t/beam-on-128-core-linux-box-seeming-not-to-use-many-of-available-cores/ this thread has a lot of relevant info too

And it might also be that the cpu just sucks Bug Forces Intel to Halt Some Xeon Sapphire Rapids Shipments | Tom's Hardware

4 Likes

Thank you for this information @dch, I learnt a lot here!

2 Likes

Great info thank you. We didn’t see any benefit from the various scheduler options such as sbt, sbwt, scl with this system and this workload.

I do agree that running several beam.smp processes with partitioned scheduler counts would obviate the issue as it creates boundaries between the chiplets which data would not cross, which makes cache coherency easier for the system. I suppose a conclusion here is that this hardware was designed to be subdivided, and doubly true for any custom configuration done for and by AWS on this platform.

I appreciate the advice about ARM and the links to further information. It’s tempting to think we’ve been subjected to a hardware bug but of course I have no evidence of that.

For the time being we are forced into the conclusion that this hardware is not out-of-the-box compatible with a single beam.smp with default settings. (YMMV)

I don’t have appropriate hardware here to try it but roughly this is what I would do:

  • figure out actual cpu NUMA topology AWS docs may help
  • for whatever linux you have ensure NICs & OS are on separate chiplet to ones used for BEAM, to reduce contention
  • compare NUMA topology to that reported by erlang:system_info(cpu_topology).
  • look up +sct and +sbt
  • try with & without +scl false which may reduce scheduler migration & compaction (at cost of wasted cpu)

By binding BEAM to free CPUs, and ensuring logical processers matches physical layout, and reducing scheduler migration you should get a lot better performance out of this hardware.

Very interested to hear about experiments in this area, I just don’t have the load to experiment further in this space.

1 Like