For example, if I call gc_bif
on X registers only and I call it on Y registers only, will there be any difference in performance?
It depends.
On the interpreted BEAM (non-JIT) some instructions can be slightly more efficient when operating on X registers, and some instructions have shorter forms when operating on {x,0}
.
On the x86_64 JIT, both X and Y registers are stored in memory, so there shouldn’t be much difference, except for how well the CPU will cache the memory locations.
On the AArch64/ARM64, the first five X registers are kept in CPU registers, so here X registers should be faster.
Note, however, that X registers are used for temporary, short-lived values, while Y registers are used for values that will need to live across function calls. Therefore, there is not a really choice between using X and Y registers. In the case that a value in an X register has just been copied into a Y register, the compiler will prefer to use the X register to access the value.
Thank you for this answer!
In the compiler I am writing, I am facing an issue where I “push” X registers to Y before external function call, and after the external function call I have a choice to use Y registers directly with pushed values, or to “pop” from Y to X and then use the X registers
For example in this code
m:f(a),
a + 1.
I can generate (erlang compiler does this)
{allocate, 1, 1}
{move, {x, 0}, {y, 0}}
{call_ext, 1, {extfunc, m, f, 1}}
{gc_bif, :+, 1, [{y, 0}, {integer, 1}], {x, 0}}
{deallocate, 1}
Or I can generate
{allocate, 1, 1}
{move, {x, 0}, {y, 0}}
{call_ext, 1, {:extfunc, m, f, 1}}
{move, {y, 0}, {x, 0}}
{deallocate, 1}
{gc_bif, :+, 1, [{x, 0}, {integer, 1}], {x, 0}}
The latter version contains one extra move
, but if you say that on ARM64 JIT will use CPU registers, then the both versions will have the same performance (since the first version will generate this move
during JIT).
But for code like m:f(a), a + a * a - 1
, I feel that the latter approach can even be faster on ARM64
And about different JIT on different architectures, is it supposed to be this way or are these temporary differences related to different implementations?
Also, if I want to generate most efficient BEAM, what runtime should I optimize for: x86_64, interpreter, or maybe ARM64? When talking about generation of efficient BEAM, what runtime does the erlang compiler target?
Your version might have the same performance on ARM64, but it will certainly be slower on other platforms, so why would you want to do that?
I say might because the loader and the JIT do optimizations based on knowledge of the kind of code our compiler emits. If you emit a different style of code, some of those optimizations may not be applied.
x86_64 has only 16 general-purpose registers, while ARM64 has 31 general-purpose registers. It might perhaps be possible store a few X registers in CPU registers for x86_64, but probably only with some tradeoffs that might ultimately not be worth it.
We avoid creating BEAM code that only works well one one kind of runtime system. Regarding register usage, the compiler assumes that accessing X registers will not be slower than accessing Y registers. That means that if the compiler prefer an X register over a Y register when there is a choice, performance will be good on the interpreter but not harm the JIT systems.
Got it, thanks for detailed answer!
Would be curious to learn more about the compiler you’re making and your experience targeting the BEAM.
– Ben Scherrey