Is it possible for the BeamAsm to use vectorized instructions?

mpope · December 22, 2024, 8:19pm

This might be a silly question, and it is a question not a feature request by any means, but could it be possible to generate vectorized instructions through the BeamJit? I see that it is supported by asmjit. For example could the addition of 4 element tuples of homogeneous small integer types be translated to AVX-512

[{1,2,3,4},{2,3,4,5}] + [{2,3,4,5},{3,4,5,6}]

Looking through the x86/instr_arith.cpp it seems like an emit_i_vec_plus would be pretty challenging as there seems to be quite a bit of logic around integer types in BeamModuleAssembler::emit_i_plus And the + BIF is not aware of addition on lists…

vkatsuba · December 22, 2024, 8:50pm

Not a silly question at all - it’s a fascinating idea! While BeamJIT leverages asmjit for JIT compilation, the BEAM itself is designed with a focus on generic, dynamic typing and process isolation rather than low-level, type-specific optimizations like SIMD (e.g., AVX-512).

The challenge here is that BEAM instructions like + are built to handle a wide range of types and not specifically optimized for operations on homogeneously typed data structures like tuples of integers. Adding support for vectorized instructions would require significant changes, including:

Adding new specialized instructions to the BEAM specifically for vectorized operations, like an emit_i_vec_plus as you mentioned. This would bypass the current generic logic in emit_i_plus.
Extending the compiler and runtime to recognize patterns or data types that could benefit from vectorization. For example, tuples of small integers would need to be flagged as candidates for SIMD optimizations.
Handling edge cases gracefully, like non-uniform tuple sizes, mixed types, or operations that fall outside what SIMD can handle efficiently.

While asmjit technically supports generating such instructions, implementing this on the BEAM would require non-trivial changes to both the JIT and the interpreter to introduce these optimized paths. It’s definitely possible in theory, but the trade-off is the complexity and potential performance cost for general cases versus the benefits in specific scenarios like your example.

It’s an intriguing direction though! Perhaps in the future, as more workloads demand these kinds of optimizations, we might see extensions or experimentation along these lines. For now, handling such cases outside the BEAM with NIFs or ports optimized for vectorized operations is usually the go-to approach.

mpope · December 22, 2024, 9:41pm

Yes, right now I’m pushing (attempting to) some instructions off into LLVM IR JIT through Rustler, and it occurred to me that with the BeamAsm JIT maybe it’d be possible to not have do that

rvirding · December 23, 2024, 4:08pm

Later versions of the compiler do try to work out the type of a variable and if possible can use maybe optimised instructions with them. NOTE, I mean really NOTE, the compiler ignores any type specs in your code, the truth is in the code it the code itself! You can quite happily give a type spec which says the function returns an atom when the it actually returns a tuple.

THE TRUTH IS IN THE CODE!

mpope · December 24, 2024, 11:59pm

I was not really referring to specs. emit_i_plus does runtime inspection, for example. But I see your point that it is useless to rely on them for optimizations or correctness, it is a good callout.

An alternative question is: is it possible to pass instructions to the asmjit? Seems like the answer may be no, given that it doesn’t seem to expose some kind of IR.

filmor · December 26, 2024, 1:10pm

“wide range of types” is a bit of an overstatement. IIRC, Erlang’s + currently badargs on everything that is not a number.

This whole answer looks like it is completely AI generated.

vkatsuba · December 26, 2024, 2:11pm

Well, you’re absolutely correct that Erlang’s + only operates on numbers and will throw a badarg error for other types. My initial phrasing - referring to a “wide range of types” - was misleading in this context. What I intended to highlight is that the BEAM, as a virtual machine, is designed for dynamic typing and general-purpose operations, rather than specialized numerical workloads like SIMD. While Erlang’s + itself is type-specific, the complexity arises when introducing new constructs or behaviors like vectorized operations. This complexity isn’t due to + directly but rather the BEAM’s dynamic and general-purpose design. For use cases requiring heavy numerical computation or SIMD-style processing, a more practical approach might be offloading to NIFs or ports, where you can leverage libraries or languages that are optimized for this kind of workload.

This whole answer looks like it is completely AI generated.

Haha, you got me! It could be AI-generated, but if it were, I’d ask for a refund because clearly, the “wide range of types” line needs a patch update. Jokes aside, this response is hand-crafted with love, sweat, and too much coffee - no bots were harmed in its creation (though one might’ve been consulted for inspiration).

jhogberg · December 26, 2024, 2:18pm

There are more types under the hood than are visible to the Erlang programmer.

For example, 0 and 12345678910111213141516171819202122232425 are distinct types internally (and we may add a similar split for floats). Any SIMD-implementation would need to handle this somehow, which would probably be horribly inefficient unless the compiler could somehow infer that the operations would neither fail nor act on or produce values outside a certain range.

On top of that, the scatter/gather operations required to operate on anything but tuples would probably dwarf the small compute win anyhow.

mpope · December 26, 2024, 3:36pm

Yeah loading the memory would be an issue. Could it be possible to do this on two binaries?

jhogberg · December 27, 2024, 8:06am

While I’m sure it could work out reasonably well for byte-aligned binaries, unaligned binaries throw a massive wrench into it; the compiler has no way of telling aligned and unaligned binaries apart (since it’s invisible to the BEAM layer, only ERTS knows the details), so we’d always need to emit code for handling them. Code which slows down the happy path considerably. :-/

mpope · December 27, 2024, 5:01pm

Okay that makes sense. Say a new operator was added for the sake of discussing the general possibility of this (I’m not making a EEP, I see how well that goes with the generators proposals ). This deviates from the usual compiler taking care of optimizing addition loops into vectorized / unrolled operations and the final horizontal sum.


<<0,0,0,36/integer>> = <<1,2,3,4>> +++ <<5,6,7,8>>.

I guess this is interesting in my eyes. As a new operator wouldn’t slowdown other existing hot paths, it has a very explicit intent, isolated to binaries. The runtime byte alignment checking could be relegated to those JIT emitters. It is also explicit in its intent, this should throw a runtime exception for unaligned binaries.

Again I’m curious of the possibility of this because access to vectorized instructions is at the current time not existent, and to get them one has to drop down to C/Rust/LLVM IR. I went to a talk by Pinterest back in 2018 and they created a columnar query engine using NIFs, so this isn’t a completely unique problem.

But again this isn’t a feature request, but was more curious if this was ever considered \ or if there would be known challenges.