Should "x" registers be part of local (process) or global context?

karlsson · June 9, 2023, 9:31pm

When checking the AtomVM header files I noticed that the VM registers “x” are part of the local context (src/libAtomVM/context.h) meaning that every erlang process has its own set of registers (if I understand it right).
Shouldn’t the registers be part of the GlobalContext in globalcontext.h as they are part of the “global” VM? In such case it might be possible to map some registers to real CPU ones as is done in OTPs beam_emu.c
Or maybe this is a tricky thing to change?

fadushin · June 9, 2023, 11:31pm

Hi @karlsson,

You are correct that each context does have its own set of x-registers (an array of 16 terms). Not being the author or an expert on BEAM internals, I can’t really say what the genesis of the design decision is.

However, I wonder about two things:

The compiler generates instructions for reading from and writing to registers. If the space of registers is shared, how does the compiler know which registers are safe to read or write?
I can imagine there would be some lock contention for shared memory in an SMP environment. How does BEAM manage concurrent access from multiple cores for reads and writes over a single block or memory? Or does the BEAM keep a set of registers per core/scheduler?

Sorry if my questions are naive.

mikpe · June 10, 2023, 2:36am

I know nothing about AtomVM, but in the BEAM the X-registers are local to the process being executed. The X registers are “global” in the sense that the aren’t affected by recursive function calls. The Y-registers OTOH are really just aliases for words in the currently executing function’s stack frame, so they change as recursiv calls are made or being returned from.

pguyot · June 10, 2023, 5:55am

I think there may be a misunderstanding here. BEAM threaded emulator (the one linked to, in beam_emu.c) doesn’t use CPU registers for X registers but tries to enforce in the loop CPU registers for variables used in this loop, including stack and heap pointers.

Interestingly, REG_xregs is defined but not used any more, it was used in R13B though and was a pointer to X registers.

github.com

erlang/otp/blob/84adefa331c4159d432d22840663c38f155cd4c1/erts/emulator/beam/beam_emu.c#L1045


      
          #endif
          
          
    /*
               * X register zero; also called r(0)
               */
              register Eterm x0 REG_x0 = NIL;
          
          
    /* Pointer to X registers: x(1)..x(N); reg[0] is used when doing GC,
               * in all other cases x0 is used.
               */
              register Eterm* reg REG_xregs = NULL;
          
          
    /*
               * Top of heap (next free location); grows upwards.
               */
              register Eterm* HTOP REG_htop = NULL;
          
          

          
#ifdef HYBRID
               Eterm *g_htop;
               Eterm *g_hend;

Also BEAM has an optimization related to x0 which it can have because it rewrites opcodes before they are processed by the loop and I believe it rewrites opcodes related to x0 differently. AtomVM doesn’t rewrite because it’s more complicated and would be memory expensive as code is currently mmap’d on esp32.

Yet, AtomVM emulation loop could be optimized further. We introduced gcc goto labels for traps (a trick used in BEAM) and could benchmark further optimizations indeed. Please feel free to submit PRs.

Still, registers cannot be in GlobalContext because they are intrinsically local to each process and several processes can run at the same time (now that AtomVM supports SMP).

karlsson · June 10, 2023, 10:45am

Yes,
Thanks for the replies everyone.
I think my questions was based on some bad understanding from my side. As mentioned, when SMP is introduced this is not an option and the registers from beam_emu.c that I was having in mind was actually (now excluding REG_xregs):

#if defined(__GNUC__) && defined(sparc) && !defined(DEBUG)
#  define REG_xregs asm("%l1")
#  define REG_htop asm("%l2")
#  define REG_stop asm("%l3")
#  define REG_I asm("%l4")
#  define REG_fcalls asm("%l5")
#elif....

This has bearing on an old question on the mailing-list, that @mikpe answered 5 years ago, about optimizing for the RISC-V architecture. RISC-V is a register based machine as is the Erlang VM.
Since the AtomVM is implemented for ESP32 and the ESP32-C3 version is a RISC-V architecture (RV32IMC), I thought it was interesting to see if there could room for any optimizations in that case.
My guess is that in the long run there will be many devices based on FreeRTOS and RISC-V, so AtomVM might be quite interesting for this more “general” case too.

fadushin · June 10, 2023, 2:23pm

FWIW we do have some ESP32c3 (RISC-V) images emitting out of our AtomVM CI builds. Not that they are optimized for that architecture, but we do at least run on them.

bjorng · June 11, 2023, 8:26am

BEAM originally had two optimizations for the handling of x0.

The first optimization was to store the contents of x0 in a CPU register. That made sense for platforms with many CPU registers, such as Sparc and PowerPC. It complicated the code and did not work for platforms with fewer CPU registers, so we removed that optimization in OTP 19. The comment about X registers that you quoted is no longer correct. I have created a pull request to correct that comment and remove the other vestiges of the optimization.

The other optimization, which is still used, is to not encode an x0 operand explicitly for certain commonly used instructions. That will be slighly faster and will often make the instruction one word shorter. For an example, consider the following two functions:

i0(A) -> A + 42.
i1(_, A) -> A + 42.

We can disassemble the loaded code like so:

1> c(t).
{ok,t}
2> erts_debug:df(t).
ok

In the file t.dis we find the code for the two functions:

0000000143DFF590: i_func_info_IaaI 0 `t` `i0` 1 
0000000143DFF5B8: i_increment_rWd r(0) 42 x(0) 
0000000143DFF5D0: return 

0000000143DFF5D8: i_func_info_IaaI 0 `t` `i1` 2 
0000000143DFF600: i_increment_xWd x(1) 42 x(0) 
0000000143DFF620: return

The BEAM loader has rewritten the addition operator to the specialized i_increment instruction.

The i_increment instruction in the i0/1 function uses the r(0) operand. An r(0) operand is not explicitly encoded as an operand. As can be seen by subtracting the addresses for the instructions, that instruction occupies 3 words (24 bytes).

The i_increment instruction in the i1/1 function uses the x(1) operand that needs to be explicitly encoded. Therefore, that instruction occupies 4 words (32 bytes).

nzok · June 11, 2023, 9:13am

If I recall correctly, the Quintus implementation of the WAM for Prolog
kept X1-X4 in CPU registers on all machines with “enough” registers, where
“enough” was defined as 16. This included M68010, VAX, s/370. It’s a bit
shocking to realise that we’d have been thrilled to the very socks with
the RP 2040 (dual core ARM Cortex M0+) or ESP32 dual core XTensa LX6).
Given that even PCs today have 16 general purpose registers, might it be
worth resurrecting the old X0-in-register optimisation?

jhogberg · June 11, 2023, 9:29am

I think adopting the same approach as the JIT (xregs, fregs, bitstring construction state, etc in the same variable/register) would be worthwhile too, I didn’t do it back then because I wanted minimal changes to the interpreter but I think the time is ripe now. It ought to give the compiler enough leeway to do the right thing without these annotations.

The ARM JIT does this to great effect (x0-x5), and I recently experimented with doing the same for x86 (x0-x3) where it unfortunately didn’t give much at all. :-/

karlsson · June 11, 2023, 7:51pm

Just adding a note here that the RISC-V ISA has 31 general purpose registers, except for the embedded (RV32E) variant that has 15.
Although general purpose, it suggests a calling convention as described in the old post mentioned before.
Maybe enough for the Erlang Core team to play with…