I’m compiling Erlang/OTP (28+) from source on Linux using these GCC flags: -O2 -march=native -fomit-frame-pointer -funroll-loops
Looking for community guidance on:
Which additional optimization flags are safe to use for maximum performance? I’ve avoided -O3 due to concerns about potential instability, but wondering if that’s overly cautious for Erlang.
Are any of my current flags considered problematic for production Erlang systems?
For Docker deployments where portability matters, what’s the recommended approach instead of -march=native while still maintaining good performance?
Would appreciate any insights from your production experiences with optimized Erlang builds.
Even the Linux kernel, one of the most performance-critical codebases in
existence, explicitly uses -O2 as its default optimization level rather than -O3. This can be seen directly in the official Linux kernel Makefile
where CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE sets KBUILD_CFLAGS += -O2.
Linus (Torvalds) has explicitly rejected attempts to use -O3 in the kernel,
citing concerns about compiler bugs and lack of performance benefits. As one
kernel documentation source notes: “This is the default optimization level for
the kernel, building with the -O2 compiler flag for best performance and
most helpful compile-time warnings.”
Note that the Erlang/OTP build itself enables -O3 for select files, in particular the BEAM emulator beam_emu.c. (I haven’t checked if that’s changed with JIT.) -O3 has historically been problematic because it enabled auto-vectorization which has had a number of bugs. However some of that is enabled even at -O2 these days.
-march=native is problematic if build and run hosts aren’t exactly the same. You may even run into problems on big.LITTLE systems unless the two sets of cores have identical feature sets.
We use rpmbuild’s defaults, currently AL2023 but previously AL2 and generations of CentOS.
if -march=native is too restrictive, usually you can find a combination of -march and -mtune that will allow you to compile for processors newer than from the 90s, to take advantage of newer features
-fdata-sections and -ffunction-sections combined with the linker flag -Wl,–gc-sections can shrink the size of the binary quite a bit by removing unused stuff
Finally, there’s various flavours of LTO, PGO, and BOLT that combined can give really nice wins, but usually involve some more complex setup