Mandelbrot - Erlang vs Mojo

reymono · November 9, 2023, 10:29pm

Here is a Mandelbrot program I wrote - my 1st Erlang program.
I used the rebar3 escript template.

I’m interested in benchmarking Erlang against Mojo and/or other languages.

On my MacbookAir M1 it takes ~6.9s (single threaded) to calcuate

a 60x30 mandelbrot rendered as ASCII to stdout
a 1000x750 mandelbrot rendered as a grayscale PNG file.

Is this good Erlang code - can it be optimised to run faster?

When Modula recently released mojo, they did a vectorised Mandelbrot implementation with SIMD instructions. I don’t think we can take advantage of that in Erlang - would have to ‘cheat’ and do it via C (NIFs)?

nzok · November 10, 2023, 1:13am

SIMD schMIMD. If you want top speed, send the work to the GPU.

The only thing I’ve looked at so far is complex.erl.
The point of a module with an interface that hides the internal representation is that on the INSIDE of the module the representation is NOT hidden.

There is nothing to be gained, and much to be lost, by writing

divide(Z1, Z2) →
new((real(Z1) * real(Z2) + imaginary(Z1) * imaginary(Z2))
/ (real(Z2) * real(Z2) + imaginary(Z2) * imaginary(Z2)),
(imaginary(Z1) * real(Z2) - real(Z1) * imaginary(Z2))
/ (real(Z2) * real(Z2) + imaginary(Z2) * imaginary(Z2))).

when you could have written

divide({R1,I1}, {R2,I2}) →
D = R2R2 + I2I2,
{(R1R2 + I1I2)/D, (R2I1 - R1I2)/D}.

Which one is easier to read?
Which one is easier to check?
Which one is easier to see the numerical problems in?

In fact complex division is a notoriously tricky problem to get right. If you are working with exact rational numbers, no worries. If you are working with floating-point numbers, you have to worry about overflow, underflow, and cancellation.

In my Smalltalk library, complex division uses (a minor variant of) the Smith algorithm, taking about two dozen lines of terse code (single-letter variables). It also remarks that the code should be replaced by the more robust BaudThin+Smith algorithm.

There are also problems with complex multiplication.
Technical Report OUCS-2017-07
Complex Arithmetic is Complex
Otago University Computer Science
goes into some but not all of the details. The multiplication algorithm given there works for IEEE arithmetic, but infuriatingly, not in Erlang.

But the thing that really vexed me was

equal(Z1, Z2) →
erlang:abs(real(Z1) - real(Z2)) < 0.005
andalso erlang:abs(imaginary(Z1) - imaginary(Z2)) < 0.005.

This is not equality. It isn’t remotely equality. So why call it equal/2? The magic number 0.005 has no place in a module called ‘complex’ that is not documented as tied to a specific application. What you want here is

l_infinity_distance({R1,I1}, {R2,I2}) →
max(abs(R1-R2), abs(I1-I2)).

then in the caller, instead of

equal(Z1, Z2)

use

l_infinity_distance(Z1, Z2) < 0.005

because it is only in the caller that we can tell what value to use for the bound.

Now let’s stop and think about something else.
Memory.
Suppose we are dealing with floating-point numbers.
32-bit machine: 3 words (FLONUM tag, 2 data words)
64-bit machine: 2 words
A 2-tuple costs 3 words either way (going by the BEAM book).
So {1.2,3.4} costs 9 words on a 32-bit machine, 7 on a 64-bit one.
Compare this with 4 words (32-bit) 2 words (64-bit) using C or Ada.

Sticking with floats for now, the only way to reduce the space overhead (and thus the time overhead) is to eliminate the tuples.

-define(ADD(Rr,Ir, R1,I1, R2,I2), (Rr = R1+R2, Ir = I1+I2)).
-define(SQUARE(Rr,Ir, R,I), (Rr = RR - II, Ir = 2RI)).
-define(NORMSQ(R,I), (RR+II)).
…
and instead of

escape(_, _, Limit, It) when It >= Limit →
It;
escape(Z, C, Limit, It) →
case complex:norm_sqr(Z) < 2.0 of
false →
It;
true →
escape(complex:add(
complex:mul(Z, Z), C),
C,
Limit,
It + 1)
end.

write

escape(R, I, Rc, Ic, Limit, It) →
if It >= Limit → It
; ?NORMSQ(R, I) >= 2.0 → It
; true → ?SQUARE(Rs,Is, R,I),
?ADD(Rx,Ix, Rs,Is, Rc,Ic),
escape(Rx, Ix, Rc, Ic, Limit, It+1)
end.

Oh look, you don’t actually NEED complex division or general complex multiplication. Good.

The next place I’d look to be saving memory is that you are calculating the whole picture and then rendering it. That means building up a giant data structure in memory that you have no actual use for. Switching to calculating-as-you-render is easy because the pixel calculations are independent. (This is why the task shines using SIMD and GPU support. It’s trivially parallelisable.)

Finally, I’d ask whether you really need to use floats.

reymono · November 10, 2023, 5:31pm

Thanks - good point about using macros - I guess I was hoping the compiler would look through the complex tuples and optimise them away - rather than relying on “C style” macros.

Worth it here though - that little change made the whole thing 3x faster.

Have not read the BEAM book, but will take a look - nice that it is free. Reminds me of some of the Rust documentation.

You seem a little old school if you don’t mind me saying - a stalwart from the old mainling list. Did you know that if you put 3 backticks (`) before and after your code blocks they get formatted like code rather than free flowing text? Like so:

  .                                                       . 
    M                                                       
             .                                              
         .                                                  
                         .                                 .
            .           M                                  a
             _         .                            .. ...MM
          .  _M2..                                     .MMMM
         _ .aMMMM _ . ..a               .  .          aaMMMM
             MMMaMMMMM.Ma   .             _.        .  .MMMM
             WMMMMMMMMMM_M .           ._ ..           .MMMM
     .     _2MMMMMMMMMMMMMM.              _M2M  ._. M   .MMM
           MMMMMMMMMMMMMMM2         ..  aMMMMM__MMMMMMMMaMMM
       a aMMMMMMMMMMMMMMMMM_       _M. M _MMMMMMMMMMMMMMMMMM
   .     _MMMMMMMMMMMMMMMMMM    2.a.MMMMMMMMMMMMMMMMMMMMMMMM
          MMMMMMMMMMMMMMMMM.  2_ MMMMMMMMMMMMMMMMMMMMMMMMMMM
          .2MMMMMMMMMMMMMMM  MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
          _MMMMMMMMMMMMMMM.MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
           ._MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
           . ..MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
                  .MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
                .MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
            ._MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

nzok · November 11, 2023, 8:23am

(1) One of the core features of Erlang, since day 0, is “hot loading”.
That is, the ability, in a safe and controlled way, to replace a module
without shutting the application down. This comes from the telecoms
world. I’ve seen a demonstration of boards being changed in a running processor, literally pulled out of the cage, and the program keeps running. It’s the same idea but in software. This is fantastic for keeping systems up and continuing to provide vital services, BUT it comes at a price.

The price is that the compiler can’t do cross-module inlining. C compilers can do that (some of them); linkers can do that (some of them); GHC does that for Haskell. But Erlang cannot do it, because once you’ve inlined code from another module, it’s very very hard to replace that module.

Not impossible. There has been work over the years on dynamically undoing compiler optimisations. (I think it goes back to SPITBOL, but I might be wrong.) Some of that has been rediscovered for Javascript – compile under the assumption that you know everything you need to, then undo certain optimisations when you find you’re wrong. There is another approach, which I proposed for Erlang many years ago, and shamelessly swiped from InterLISP. That’s “block compilation”, which InterLISP applied to functions and I proposed applying to modules. A block of modules compiled together would have been able to exploit cross-module inlining (and of course data flow analysis regardless of inlining) BUT would have to be reloaded as a group.

In the mean time, people have to fall back on hacks like including files of
macros. And THAT has a massive problem. You get the down side of block compilation (a dependency between two source files) without the up side (the system hasn’t the least idea what’s going on). Suppose header H describes a critical data structure provides macros and/or inlined local functions. Suppose modules A and B use it. Everything is up to date. Compile A and B. They do not know about each other, but they are coupled to each other’s internals via H. Load A and B. Days later, find a bug. Change H. Recompile/reload A. Now we have a version skew problem. B needed to be recompiled/reloaded because of the change to A. BUT THE SYSTEM DOES NOT KNOW THAT. Oh, makefiles can know that A.beam depends on A.erl and H.hrl, and that B.beam depends on B.erl and H.hrl, and Erlang’s make: can figure this out. That’s fine. What nothing in the system knows is that A and B have to be reloaded together. If you load A then B, or B then A, you may run into trouble.

(2) I’m accessing this misbegotten whatever-it-is through e-mail. When I send a message like this, I get two replies. One is what I sent, and the other is insanely mangled. I ASKED FOR HELP about what to do. As it happens, I have tried the three backticks thing, and it did not work at all.

I have spent several years avoiding Markdown on the grounds that

if I wanted weird complex markup, I knew where to find TeX
if I wanted to struggle with incompatible versions of the same
document production tool, I knew where to find Microsoft Word.

I came to suspect that this replacement for the Erlang mailing list (which I sorely miss) used some flavour of Markdown. But I don’t know which.
Markdown may be the most perfect illustration of https://xkcd.com/927/
around. I had hopes when Common Markdown came out that it might resolve
some of the compatibility headaches, but no, just one more competing
standard. One of my colleagues has a rule: he refuses to use anything
whose manual is more than 1.2cm thick. Common Markdown violates that rule.
(Sorry, I am beginning to go on a Markdown rant.)

Can someone please, pretty please with knobs on, direct me to the document (a single file) that defines the exact markup used by this communication channel? And exactly what you have to do in the e-mail interface to make it work properly?

mmin · November 11, 2023, 9:01am

According to this thread, this forum is built using Discourse software which uses CommonMark. I’m not aware of the exact version, but last version of CommonMark was released more than 2 years ago so it should be v0.30.0.

reymono · November 11, 2023, 10:10am

Makes sense that hot loading prevents inlining - didn’t see that consequence described; good to know.

Regarding Markdown

There is a cheat sheet here

Symbols dance, tease me,
Markdown, a fickle partner,
Format errors weep.

I think most of the basics are there - not sure if they get stripped off in email…

nzok · November 11, 2023, 10:26am

CommonMark 0.30, then. Sigh. Another 125 pages to read…

Many thanks.

nzok · November 11, 2023, 10:28am

Ah, no, I am actually familiar with the basics.
It was an arcane detail I fell foul of having to do with whether a blank link was required or not. At least, that’s what I think happened.

AstonJ · November 11, 2023, 5:33pm

Here are the basics:

Multiline Code Block

Three backticks on a separate line above/below the block of code.

```

(paste code block here)

```

Inline Code

`code`

Bold

**bold text**

Italic

*italicized text*

Headings

# H1
## H2
### H3

Blockquote

> blockquote

Ordered List

1. First item
2. Second item
3. Third item

Unordered List

- First item
- Second item
- Third item

Horizontal Rule

---

Links

Or just paste a URL

[title](https://www.example.com)

Please also disable hard-wrapping when posting to the forum (hard-wrapping makes it difficult to read your posts on mobile devices or those who need to use non-standard font sizes).

reymono · November 18, 2023, 11:44pm

I added command line parsing with argparse (OTP). I was expecting it to automatically respond to --help - it does but with error message:

error: mandelbrot_erl: unknown argument: --help
Usage:
  mandelbrot_erl [-p] [-h <height>] [-w <width>]

Optional arguments:
  -h height (int, 750)
  -w width (int, 1000)
  -p parallel (false)

The github version of argparse has an example where this works - but I guess the two may be out of sync?

Also did some benchmarking. A 5000x5000 pixel Mandelbrot takes ~61 seconds on my laptop (Mac M1, 8 cores). Not bad - but a lot slower than a similar program in rust (also sequential, no simd). Jim Blandy’s mandelbrot takes ~8 seconds to do the same image.

Also, I added an option to spawn multiple processes. One process is spawned per row in the image. This is mainly because I was interested to see the overhead of having more processes than CPU cores. No surprise that best speedup (5) is when processes does not significantly exceed #cores.

Width	Height	Parallel	Time (sec)
5000	5000	false	61
5000	5000	true	27
50000	500	false	61
50000	500	true	17
500000	50	false	72
500000	50	true	19
5000000	5	false	135
5000000	5	true	27

Maria-12648430 · November 20, 2023, 9:34am

Well, Erlang isn’t just that good (fast) in terms of math. “Erlang was not built for math”, I read somewhere. “C is about 10 times faster for math”, I read elsewhere. I think that if you want to get faster on that front, you will probably have to resort to implementing NIFs for doing the math.

reymono · November 20, 2023, 10:34am

That would be in line with rust above - it’s a lot.

I don’t think NIFs and C-nodes are attractive in general - it’s the old two language problem; Same way Python handles it.

Of course good to have the option - and I guess that’s how they made the Elixir Nx/Axon libraries fast.

Maria-12648430 · November 20, 2023, 11:06am

No idea about Elixir/Nx. In general, my work doesn’t require any serious number crunching, and as such I rarely have to resort to NIFs. But AFAIK, with my limited partially hear-say knowledge, using NIFs has its downsides that should be considered. NIFs are AFAIK non-interruptible, they have to “run through” from front to rear end. And if you have all of your (8?) cores using pretty much each and every available CPU cycle doing math, there isn’t much left for Erlang to do Erlang-ish stuff on the side. NIFs also have the potential for crashing the entire Erlang node they run on/in, so they are a risky business and should only be used in small code fragments, ie if you can pretty much guarantee that nothing can go seriously wrong.

bjorng · November 21, 2023, 6:30am

It couldn’t resist the temptation to optimise your code.

I used OTP 26.1. Before any optimisations, producing a 5000 x 5000 pixel image using the parallel option needed slightly less than 23 seconds on my computer (the original M1 MacBook Pro with the touch bar).

% /usr/bin/time _build/default/bin/mandelbrot_erl -w 5000 -h 5000 -p
   .
   .
   .
"Saving to mandelbrot.png"
       22.45 real       117.57 user         3.31 sys

My initial optimisation was to create a binary for each row instead of a list:

pcalc_pixels({LLx, LLy}, {URx, URy}, {WIDTH, HEIGHT}) ->
    R = [LLx + X * (URx - LLx) / WIDTH || X <- lists:seq(0, WIDTH - 1)],
    C = [LLy + Y * (URy - LLy) / HEIGHT || Y <- lists:seq(HEIGHT - 1, 0, -1)],
    ROWS = [[{X, Y} || X <- R] || Y <- C],
    pmap(fun(Row) ->
                 << <<(255 - escape({0.0, 0.0}, C, 255, 0))>> ||
                     C <- Row>>
         end, ROWS).

(That change also requires minor changes in the rendering code.)

That reduced the runtime to slightly less than 13 seconds.

My second optimisation was the floating point calculations in the escape/4 function. BEAM actually does some optimisations of floating point operations by keeping known floats in special registers during calculation instead of storing them on the process heap. The trick for leveraging those optimisation is to avoid mixing floating point operations with other operations because that will force the floats out of the floating point register on to the process heap.

Here is a diff for a change that shaved off about a second on my computer:

 escape(_, _, Limit, It) when It >= Limit ->
     It;
 escape({Zr, Zi}, {Cr, Ci}, Limit, It) ->
+    Zn = {Zr * Zr - Zi * Zi + Cr, 2 * Zi * Zr + Ci}, % Z*Z + C
     case Zr * Zr + Zi * Zi < 2.0 of
         false ->
             It;
         true ->
-            Zn = {Zr * Zr - Zi * Zi + Cr, 2 * Zi * Zr + Ci}, % Z*Z + C
             escape(Zn, {Cr, Ci}, Limit, It + 1)
     end.

More optimisations are possible. I eliminated some of the intermediary lists and tuples, which shaved off an additional three seconds. It should be possible to eliminate all of intermediary lists. As your own benchmarks showed, spawning too many processes is not efficient. You could modify pmap/2 to find out the number of schedulers using erlang:system_info(schedulers) and only spawn that number of processes at once.

reymono · November 21, 2023, 3:10pm

Thanks Björn - good insight from that.

erlang:system_info(schedulers)

[35,115,99,104,101,100,117,108,101,114,115,58|8]

Seems I have 12 schedulers & 8 cores? Numbers 101 & 115 are duplicated…

I can see why storing the pixel values in a binary would make
it faster - in my benchmark above, doing a 5Mx5M Mandelbrot took 2x longer than a 5000x5000 - the list processing should mostly account for that.

bjorng · November 21, 2023, 3:31pm

It seems that you wrote the following expression:

1> "#schedulers:" ++ erlang:system_info(schedulers).
[35,115,99,104,101,100,117,108,101,114,115,58|8]

erlang:system_info(schedulers) returns the number of schedulers, not information about each scheduler. The number of schedulers is the 8 at the tail of the list (after |). You probably want to write like this:

2> "#schedulers:" ++ integer_to_list(erlang:system_info(schedulers)).
"#schedulers:8"

reymono · November 21, 2023, 3:44pm

Right

reymono · December 11, 2023, 1:14am

I have compared Erlang, Python and Rust on this task - calculate a 5000x5000 mandelbrot sequentially and using a pool of workers that take advantage multiple CPU cores - no other optimisations.

Erlang is about 7x faster than python on this task - better than I expected. Rust is 25x faster than python - and achieves slightly better speedup than the other two in parallel mode (3.3, 3.5 & 4.5 respectively).