Do you understand the physical layout of CPU cores in these systems?
There are groups of efficient high bandwidth cores, called chiplets or tiles. These groups connect via comparatively slow interconnects.
If a thread migrates between cores within a chiplet the on-cpu cache is lost and must be refeched from the next level.
If the thread migrates between chiplets the same thing happens but the performance hit is much much worse and more cache is lost.
We want the BEAM to not migrate cpu core because of this.
Generally on these types of systems you pin processes (incl network cards, OS processes, BEAM processes) to specific cores to avoid this massive penalty.
you trade off a lot of flexibility though.
You may well get better performance by pinning BEAM schedulers and running multiple BEAMs too, isolating each one within a chiplet, and thus avoiding crossing.
The design of these systems is usually targeted to hyperscaler cloud vendors, and who use VMs to subdivide onto tiles.
Consider trying the larger arm64 CPUs for comparison, their internal architecture is less susceptible to this.
But mostly try pinning all BEAM and OS processes to specific cores to avoid crossing the chiplet chasm.
Also retry specifically with hyperthreading disabled and see if that helps.
There are some serious tools on intel platforms to examine this and see what CPUs are doing but it’s not lightweight, look for talks books and content by Brendan Gregg he’s the legend in this space.
https://erlangforums.com/t/beam-on-128-core-linux-box-seeming-not-to-use-many-of-available-cores/ this thread has a lot of relevant info too
And it might also be that the cpu just sucks Bug Forces Intel to Halt Some Xeon Sapphire Rapids Shipments | Tom's Hardware