Great number of page faults while reading files

door · December 26, 2024, 12:11pm

Hi,

We compared Nginx and Erlang in roughly the same role as edge servers: delivering HLS video with caching from a DVR server.
The load was approximately 2 Gbps, host with 24 cpu cores.

When comparing the results of perf stat, a significant difference in the number of page faults becomes evident: 10/s for Nginx versus 4K/s for Erlang.

This is problematic. The flamegraph shows that a lot of time is spent inside readv and exc_page_fault.
Running strace -f -e trace=memory revealed an unexpectedly high number of mmap/munmap calls, which seems strange — why free memory only to allocate it again immediately?

The issue can be partially mitigated by enabling the super carrier, but as traffic increases, it may become insufficient, causing page faults to rise again.

Used memory settings: “+MMscs 32000 +MMmcs 30 +MBas aoffcaobf +MBsbct 4096 +MBmmsbc 100000000 +MBmmmbc 100000000 +MBsmbcs 16000 +MBlmbcs 16000 +MBacul 10 +Mdai 16”

starbelly · December 26, 2024, 12:29pm

Can you share why you ended up with the settings memory settings you have above?

door · December 26, 2024, 12:53pm

Gradually, by adapting the server to different loads. Interestingly, the latest change was aimed at making memory allocation more efficient. Earlier settings were different:
+MMmcs 30 +MBas aoffcaobf +MBsbct 8192 +MBsmbcs 64000 +MBlmbcs 128000 +MBmbcgs 3 +MBacul 10 +Mdai 16
I probably need to check this.
However, the result without any tuning was seemingly the same.

starbelly · December 26, 2024, 2:26pm

I see. You say you noticed a high number of mmap / munmap calls. Have you ruled out nifs and/or sys allocations? Related, have you looked at +Musac true ?

maxlapshin · December 27, 2024, 9:45am

No, we are a bit afraid of Musac, because mmap should be absolutely ok.

Main problem is immediate munmap after mmap

mpope · December 27, 2024, 7:35pm

Are you using the raw mode when reading the files from disk?

maxlapshin · January 6, 2025, 10:42am

No.

After all searches it happened, that problem was in our code: it called malloc instead of erts_alloc

Together with super carrier page faults have reduced a lot with reduced number of mmap/munmap calls