While specifying the hardware design for GRiSP-nano another evaluation board that will explore how low power we can run a full blown Erlang VM directly on hardware (it will be powered by temperature gradient energy harvesting, providing milli watt only) we encountered a issue with the CPU:
While it has a floating point acceleration unit, that unit only supports single precision floats (which are sufficient for many applications).
Erlang by default uses double precision and single precision only shows up when encoding/decoding binaries. It would be cool if it could use if for this hardware the floats could be defaulting to single precision. Because using doubles will use a slow software float emulation while single precision would be hardware accelerated. Using soft floats is also not the most energy efficient way so probably we need to avoid floats altogether.
Floats in Erlang are specified to be double precision, so if you change that you will create a different language. I don’t know if anything in stdlib/kernel relies on this fact, but you are bound to run into issues in various places in erts.
I think that the best approach would probably be to try to avoid floats alltogether and use the double float emulation where needed.
My guess is he’s doing something that could be ai/neural-network related. A lot of ai systems do quite fine even with something as low as 4-bit floats. So the bandwidth of double floats is just outrageous waste. I suspect this kind of request may become more frequent as developers must be able to use their accelerated hardware to be competitive. Naturally such a BEAM would not currently be compatible when talking to another (for now). I think the introduction of a shortfloat or something like that might be worth considering if Erlang wants to be viable in this space. As an old Forth-write - I abhor all floats (we did everything in fixed point) but, alas, that’s where the silicon creators focused their energies. I see some companies now creating hardware optimized for ai apps that use types more appropriate but don’t know how long until/if that will become standard. Until then - small floats are the rule in the space.
No neural network, these are very small IoT nodes which will run on a power budget of 5-10 mW. If I actually need to do some computation for signal processing I can always make arrays of single floats in binaries with a linear algebra lib (BTW we have recently someone completely ported BLAS and Lapacke to Erlang, funded by EEF) thats what you would want to use there.
As for encoding/decoding floats into binaries Erlang always supported short floats and recently got 16bit added too.
This question was not about such use cases and Erlang and Elixir are actually quite well placed in the NN and linear algebra space. With the recent ports we are at the same place as Python + Numpy except that we can do concurrency much better and have failure tolerance etc.
Thats what I thought would be the case, but one never knows if such a weird thing as single precision Erlang was built some time in the past.
Avoiding floats and relying on the software emulation is exactly what I planned. And I can always build some higher performance single float NIFs which work on small vectors if I ever need that.
Ah I see. Thanks for the backgrounder - it was usefully informative (as always)! So your issue is that for encoding/decoding Erlang supports short & 16bit floats, internally it always goes to full doubles? Isn’t the ultimate solution still to introduce an explicit short float type?
The issue as I understand it is
(1) on most machines these days, single precision has
no speed or power advantage over double precision.
I don’t know whether the current VM supports
unboxed doubles but there’s the example of Visual
Works Smalltalk (thanks to Andres Valloud) to show
that it can be done well, so on a 64-bit machine
doubles need not in practice suffer any space
penalty compared with singles.
So no utility for singles on x86-64, SPARC, ARM-V8
(2) On chips based on ARM Cortex M0, with no hardware
support for any kind of FP (the RP2040 doesn’t even
have integer division) single precision floats are
emulated 2-4 times faster than double precision,
but given the rest of what a program has to do,
this probably isn’t worth bothering about. Even 32-bit
floats need boxing, so the space saving is not as great
as you might expect. So no advantage to supporting
singles as well as doubles, and small advantage to
singles instead of doubles. (Why did C originally do
all FP arithmetic in double precision? To keep the
run-time library small.)
(3) On EPS32 processors or others with hardware support for
single precision floats but not double precision,
emulated double precision arithmetic is 20-120 times
slower than single precision with a corresponding power
increase. For some applications, such as audio or neural
nets, this may be enough to matter, and the otherwise
deeply unsatisfactory numeric performance may be adequate.
(4) An Erlang VM designed for ultra-low-power IoT applications
probably wouldn’t look like the BEAM. In particular,
whole-program compilation without hot loading might be a
good idea. (For the applications I’d like to use Erlang
on RPi pico, the last thing I want is for programs in the
field to be alterable without serious authentication and
switchover from complete version X to complete version Y.)
The simplest solution seems to be to just lie. To take the BEAM operations that talk about doubles and make them work on singles. Develop different tests for “real double” and “pretend double”. You’re not going to want every part of the OTP (and other) libraries anyway.
It would be good to see some careful analysis of what’s
going on there. (Native) instruction-level profiling would
be nice.
Suppose the amount of overhead per flop is N instructions,
and that the cost of emulating a flop is M instructions.
Then we’re comparing N+1 with N+M, not 1 with M. For
example, I just benchmarked floating-point addition in
Erlang, and the number came about as 4.2 nanoseconds, on
a machine where a similar C benchmark made it 1.0 nanoseconds.
In this benchmark practically nothing BUT floating-point
addition was happening (on the surface). I’m actually
impressed that Erlang did so well.
Is this running on bare metal or under FreeRTOS?
My information is probably out of date, but I thought that
on ESP32 FreeRTOS hopes you won’t use the FPU (so it does
not have to save and restore FPU state when context
switching), so the first FPU use in a FreeRTOS task traps
out to a handler that initialises the FPU and (on ESP32)
pins the task to one of the two cores. I don’t know what
effect this pinning would have on the BEAM emulator.
You might consider using binary type and write your own NIF that converts erlang float to different precision.
You can see an example for that in our open-source project: link of encode/decode functions NIF implementation
I’ve been experimenting with various game development things in Elixir (utilizing the wx and gl modules from OTP). It’s typical to use single precision floats for rendering since the extra precision is usually not necessary. For sending data to the GPU, it’s not a problem since I can just specify the size with <<data::float-native-size(32)>>, but it feels slightly bad knowing that I’m doing all of my math with twice the memory that I actually need.
I don’t know if the ERTS having special behavior for single precision floats would actually improve much, especially on 64-bit architectures, but it’s something I’ve been curious about.
I’m not an expert in anything, and this is not a typical case for Erlang, but some of the recent improvements to in-place tuple and record updates seem like they could, potentially, utilize SIMD operations.
I am only vaguely aware of how some of this stuff works, and I’m just throwing various problems at the ERTS. All of them can be solved with NIFs, but it’s been interesting to consider them from a runtime perspective.