What are some possible solutions to make the BEAM faster for CPU calculations?

(Not sure what topic catagory to put this under)
Hello!

I’m really interested in the BEAM, so so far I’ve been writing all of my personal projects in Rust because I really like the CPU performance Rust gives. However, I want to give the BEAM a shot, but I understand that right now if I want some good performance, I’ll have to write a NIF in C or Rust. Are there upcoming proposals/efforts to make the BEAM faster? Something similar to how the JVM has Project Valhalla adding Value Types/Classes. And, if any of those projects exist, is there a way I can help? I don’t think the BEAM could get up to Rust/C performance, but C#/Go level performance is acceptable to me at least.

2 Likes

Performance at what? Rust and Erlang are designed for different applications. Erlang is intended for high rreliability distributed applications. Rust, as Cloudflare found to their dismay, is not. Rust is intended for situations where a program is frozen for so long that it makes sense to invest a lot of programmer effort into persuading an enormously complex compiler that the program is completely trustworthy, like the Unix utilities. Erlang is intended for a world in which frequent maintenance occurs, down time is undesirable, no component is perfect, and systems have to keep going anyway.

Erlang now has a type system and a JIT compiler, which are steps toward higher performance. These aren’t going to get the performance of Sycl for number crunching, but how good are Sycl and Rust at supervision trees?

Perhaps one useful thing would be to reproduce the Programming Language Shootout numbers using current versiobs of Erlang, Sycl, and Rust, and try to analyse the causes for the differences. Adding some teats that play to Erlang’s strength would be good.

6 Likes

The BEAM is a beast of a project and it ain’t going to be easy to delve into it, but if you ask for it…

Type optimisations, there’s a lot of work that has been done into giving the compiler and the runtime information about the parameters types, so that it can skip a lot of type checks at runtime. That is, a function doing add(A, B) → A + B., at runtime, normally needs to check if A is a number, if B is a number, and then add, if any of them is not a number, then raise. But if the compiler can ensure that the function is never called with anything that is not a number, it can generate assembly to just do the addition, without checking the types. Even better if it knows that the numbers are small integers (that is, fit in a machine word), in that case it can skip logic about bignums and use pure CPU add op. This gets insanely complicated for all the bigger and better types and the complexity of control flow analysis, but if you’re up to the challenge, then please, contribute! :smiley:

Other topics… non-mutation means a lot of copying, so if the compiler and runtime can ensure a term is not referenced anywhere, then it can modify it in place instead of making a copy. Again, control-flow, it’s really hard, if you get it wrong you trigger segfaults, it has happened to me :winking_face_with_tongue:

What else… the JIT compiler. I have no idea of how insanely complex that one is, but if you’re familiar with GitHub - asmjit/asmjit: Low-latency machine code generation then again, go rock it!

Hmm… message passing and process scheduling are the core of what makes the BEAM the BEAM, but I never found out anything I could optimise there, they’re insanely well done for over three decades, but sometimes I wish they were faster.

If I get any other idea I’ll share too, that’s all I have in my head right now :slight_smile:

6 Likes

To be clear, I understand that Rust and Erlang have very different purposes. I’m not saying “give Erland a borrow checker” or any of that, I’m interested in the question “what are some ways to make Erlang faster/more performant”

1 Like

I wasn’t aware Erlang gained a type system. I knew that Gleam has a type system, and that Elixir is in the process of getting one, but I didn’t know if that had filtered down into the BEAM bytecode itself

Edit:
Also, on the topic of the JIT, one of the things that originally inspired this question was that I loved reading the blog posts Shopify’s team did on the new Ruby JIT (YJIT and ZJIT).

I understand Erlang already has a JIT, but is there any ideas worth stealing from that effort?

1 Like

Erlang has always been a fully strictly typed language, just not statically typed, but dynamically. So every term has a type and type-casting is disallowed; what Elixir and Gleam are introducing are static types. What the progress has been on, is on using some static type information to let the JIT generate faster code, so that it doesn’t need to check for types at runtime because it already knows them before running. It is on this second stage where, maybe there’s even more type information the compiler could pass to the JIT that is not passing already and there might be more optimisations possible. Which ones I don’t know, you’d need to check the compiler code, the SSA intermediate representation, and the JIT. And if you find an opportunity for improvement please share! Even if it is too complicated for you to implement just share the idea, somebody might be able to contribute :smiley:

I’m not familiar with Ruby, but after a quick browse, it seems to me like the ZJIT effort is somewhat close to the Erlang’s JIT :thinking:

3 Likes

I think you have a misunderstanding here. NIFs are not commonly used for performance. There’s a cost to the FFI barrier, and it limits the amount of optimising that the compiler and the JIT can do. For example, the fastest Erlang JSON module is not the one that uses a C library with a NIF, it’s the one written in well-optimised Erlang.

It depends what you’re doing, but for some tasks you already have this. What specifically do you have in mind? It’s hard to give opinions as the question is somewhat vague.

For generally improving BEAM language performance, there’s a few improvements that could be made to the runtime. A lot of technology inside the BEAM is no longer state-of-the-art; For example, the locking C code could be upgraded. There’s also plenty more improvements that can be made to the JIT.

5 Likes

Yeah for context I read this forum thread recently: Anyone pushed the BEAM for real-time simulation workloads? I built a multiplayer game server framework pushing 10,000 Entities @ 30Hz - Chat / Discussions - Elixir Programming Language Forum

and they talked about how they needed a sidecar written in Zig/Rust in order to get the performance they needed

1 Like

I suspect that’s not so much Erlang specific, and other languages with GC such as Go and dotnet (as you mentioned) would struggle also. As the post mentions at the start, the conventional wisdom is to use a language with manual memory management, such as C++.

The game industry makes heavy use of performance optimisations and architectures that are unavailable in most languages, such as data-oriented-design and arenas.

2 Likes

I’m really curious to learn more about the Erlang compiler including the JIT and IR! Do you have any good resources on how the BEAM is architected?

That makes sense, though from my experience there’s lots of ways for C# to match C++ performance (however usually by using unsafe C# or structs, which I’m not sure would work in Erlang….)

The following blog post is a good starting point. It contains links to other blog posts about the JIT and the Erlang compiler.

There is also this presentation:

6 Likes

Oh cool! Has there been any notable updates to the information presented in that blog post since then?

And you just got an answer about the BEAM architecture from the architect of the BEAM, the B in BEAM is actually from his name :smiley:

Very good materials, both the blogpost and the presentation, I’d say they’re still very up-to-date so don’t be shy to read them carefully, they’re gold. Thank you @bjorng for the great content!

4 Likes

Sorry, I probably should have posted a link to the type of stuff I was talking about lol

(I don’t think such a thing would work in Erlang though (unless you did some weird stuff with ETS? I dunno))

Yeah no, its crazy that the developers of Erlang and the other BEAM languages just hang out on this forum. Thanks for all the answers!

2 Likes

Perhaps it would be a good idea to read the Erlang documentation before trying to improve Erlang.
(Please imagine that read in a very gentle loving voice with ABSOLUTELY NO SNARK IN IT.)
See Type Specifications and Erlang | Learn You Some Erlang for Great Good! for a gentle introduction.
See Types and Function Specifications — Erlang System Documentation v28.3.1 for official documentation.
This is checked by ‘dialyzer’.
The type system in the reference documentation is not the first type system ever developed for
Erlang, nor is it the last, see GitHub - erszcz/erlang-type-checker-comparison: Comparison of Erlang type checkers: Dialyzer, ETC, and Gradualizer
It is however the one with tools shipped with Erlang.
Read the documentation for ‘typer’ and ‘dialyzer’.

Back in the HiPE days there was work done on doing unboxed floating-point arithmetic.
IIRCAIPD[%] This needed the occasional hint in the code (use of is_float/1).
Performance was said to be close to other strict functional languages using strict
static types to get unboxed floating-point arithmetic.

I want to emphasise what I wrote before. START by finding benchmarks where Erlang’s performance
is surprisingly bad and try to figure out why.

It is also important to consider the importance of SCALE.
Here’s an anecdote that I’ve bored people with before.
Way back in the 80s I worked at Quintus, and we were very pleased with the
performance of Quintus Prolog. A rival company boasted that they had much
better performance because they compiled to native code. So we got an
evaluation copy of their system, and ran some benchmarks.
1 page program: their system was 4 times faster than ours.
4 page program: the two systems were the same speed.
40 page program: our system was 4 times faster than theirs.
Why? Cache. Our emulator fitted into instruction cache.
(Even today, the laptop I’m using has only 128kB of L1 cache per core,
so having under 64kB of native code would still pay off.)
Native code was much bulkier than emulated code, so for large programs
we were humming along in L1 and they were mostly running from L2 or RAM.

Do not dismiss the effect of STYLE. Performance problems mainly exist
between keyboard and screen. Another anecdote from the Quintus days.
There were major implementation changes between QP2 and QP3, which meant
that some undocumented behaviour ceased to work. One of our customers
complained that this broke a program of theirs. On getting the code,
I rewrote it so that it ran in both QP2 and QP3. And it was 700
times faster. That’s not a percentage. Of course we lost that customer.
(Joe Armstrong: “nobody likes the bearer of good news.”)
I’ve spent a lot of time looking at the RosettaCode site.
I regularly find that examples in a language I’m interested in can
be rewritten to be several times faster. Presumably this is because
I was seriously trying to write performant code while the author(s)
of the examples were trying to optimise something else. The point is
that “reasonable” code in one implementation of one language can vary
substantially in performance. There are two XML parsers I use.
Both are written in C. The one I wrote is about 13 times faster than
the other. The other one is able to do much more, it’s just that I don’t need it to.

So it is really really important to get evidence that there is a problem
AND that it needs to be addressed in the Erlang implementation rather than
by adopting a different coding style or better libraries or whatever.

As one possible benchmark,
What would it take to reimplement Wings 3D (Wings 3D) in Rust?
How well would the Rust version perform?
Where does the most time go in the Erlang version and
what would it take to fix that?

[%] IIRCAIPD = If I Recall Correctly And I Probably Don’t.

6 Likes

Yeah I get what you mean. For context, before I made this thread, I was talking to some people about using Erlang/Elixir for game dev (both within the BEAM community, and out), and everyone said that those languages are just not the right tool for the job. However, I think after reading everyone’s responses, it seems less that “Erlang and friends can’t do high performance workloads” and more that there isn’t a big ecosystem around Erlang for game dev compared to say C# or Python because of other reasons not relating to performance itself.

2 Likes

Let me introduce you to a 42-year-old software architecture.
“A New Architecture for Large Scientific Simulations”,
P. F. Dubois, 1984, LLNL report 197756.
(A PDF is or was available on-line but the holding site is not up at the moment.)

The architecture consists of a large simulation system (think of an
application like a galaxy formation simulator or a general circulation
model or a computational fluid dynamics model of some sort) with the
basic physics and aspects common to many simulations of this kind coded
in a programming languaged esigned for high numeric calculation efficiency
(such as Fortran, C, C++, probably vectorised, possibly with GPU code,
probably running on multiple processors concurrently),
together with a script configuring a simulation run, selecting inputs,
directing the processing of outputs, coded in a STEERING LANGUAGE
designed to be “lightweight” and easily modified without massive
recompilation and (this is important!) without any risk of tampering
with the tightly interconnected and dangerously vulnerable to slips
computational kernel.

This concept was devised, implemented in the LLNL “Basis” system, and
written up before Python, Perl, Lua, or Matlab were twinkles in their
inventors’ eyes. TCL was consciously intended for steering programs
(such as customising an editor – I still have find memories of Alpha).
Emacs Lisp is a steering language for Emacs. Python is a steering
language for numeric calculations using SciPy and Pandas. I have a
copy of an e-book about programming astronomical simulations in Python.
You would have to be certifiably insane to do this in “native” Python.
Python is S L O W. No, slower than that. But in these simulations,
it’s setting parameters, configuring stuff, and letting the calculations
rip in library code implemented some other way. The metaphor of
steering is meant to call to mind a mere human being at the wheel of a
supertanker: a small weak thing is steering a huge strong thing.
(This Basis system is not to be confused with the 2001 LLNL BASIS
system, which was developed into BioWatch.)

So this says that it’s perfectly reasonable to have a game physics engine
coded in whatever and to steer it (configure a world, model plays, etc)
in Erlang. A web search for “Open source game physics engines” will turn
up a number of 2D and 3D engines it would be silly to rewrite when you
can just use them. (I still prefer using C nodes to NIFs; NIFs compromise
Erlang’s safety, and if the overhead of a local socket for communication
is too high, you need a better design.)

Actually, this is one way that Erlang was used at Ericsson. The AXE-10
exchange was programmed in Eri-Pascal, I think, and had more than a
million lines of code, and had reached the point where any time you
fixed one bug you introduced another. One of the Erlang people had the
bright idea of adding a little ‘external control’ interface and writing
new features in Erlang. This is very much a scripting/steering approach.
[This is IIRCAIPD because it’s old memories of what someone else told me.]

Now for some numbers to illustrate a point I was making earlier.
I wrote a micro-benchmark stressing list processing and integer
arithmetic in 5 different programming languages.

A 56
B 0.97
C 1.00
D 28
E 9.6
F 82
G 40

These are run times normalised to the run time of version C,
which oddly enough was in C. Cruelly hacked C. C where the
allocation of a list cell was done by freelist++ and GC by
freelist = &pool[0]. And of course no tag checking or
dynamic dispatch.

D was the program I started with, written in a a beautiful
programming language designed for list processing. The
language has no static type systems and is, except for
concurrency, semantically very close to Erlang. The
compiler I used compiles via C, which makes life for the
compiler rather hard.

E was of course the Erlang version. It has no -type or -spec declarations.

F and G are written in a dynamically typed object-oriented language which
really doesn’t want to do lists. It loves arrays and hash tables. F is
an idiomatic version using an existing rich List class. G is a stripped
to the bone version using just what it needed. Like D, this compiles via C.
And that has a really painful effect. Allocating a list cell requires
three (count 'em, three!) function calls in C. The equivalent of
[H|T] = L in Erlang costs three (dynamically dispatched!) function calls
in the generated C. Now there’s nothing absolutely fundamental about this.
If lists were idiomatic in this language, we could compile allocation into
a single C function call and dismantling could be inlined, without
changing anything about the language. List cells would still have to take
3 words instead of 2.

The really interesting thing is A and B. They are written in a strict
functional programming language with immutable data structures. The
most important differences here are that it has polymorphic static typing
and a wide range of fixed size signed and unsigned integers (as well as
bignums). So the integer operations could be compiled to single
instructions. The really interesting thing about A and B is that they are
the SAME program, just compiled by different compilers. One supports
incremental compilation with a REPL. The other dos whole-program compilation,
discards everything from the library that’s not used, and optimises the heck
out of what’s left. And despite being safe clean code, it manages to beat C.

So, why did Erlang not do brilliantly in this micro-benchmark?

  • With types, [X|Xs] -vs- only requires testing for . The compiler
    used for B can map cons = [ptr|00]–>{head,tail} and nil = [0|01] so that
    testing is a cheap comparison and accessing the fields of a list cell
    requires no untagging.
  • with fixed size integers, integer arithmetic can be done by single
    instructions with at most a trap-on-overflow instruction overhead.
    There is NO tagging or untagging for integers.
  • with static whole-program compilation, function calls can go direct
    via instructions that make no provision for hot loading.
  • as it happens, the compiler used for B does support concurrency,
    but since concurrency isn’t used in this benchmark, there’s no cost
    for supporting it.
  • the A, B, and E compilers use different tagging systems, for good
    reason.

I can imagine a subset of Erlang which could be compiled as Erlang or
could be compiled by (a hacked version of) the B compiler. I’m not sure
how hard sharing an address space and a tagging system would be, but it
is doable in principle.

2 Likes

the BEAM book…

3 Likes