Ring benchmark: Erlang vs Go

ijt · November 28, 2021, 5:42am

Hi all, I’ve been reading Joe Armstrong’s book on Erlang: Software For a Concurrent World. In Chapter 12, the third exercise is to write a benchmark that creates N processes in a ring and sends a message around the ring M times so that a total of N*M messages get sent. I did this for Erlang and Go and compared the result. For M = 100 and N = 100,000, the Go version took an average of 9.4s and the Erlang version took an average of 8.5s on my MacBook. Here’s the Erlang version:

-module(ring_benchmark).

-export([run/2, proc/1]).

run(M, N) ->
  % Spawn N processes and connect them as a ring.
  Launcher = self(),
  Pids = [spawn(fun() -> proc(Launcher) end) || _ <- lists:seq(1, N)],
  [HPid | TPids] = Pids,
  RPids = TPids ++ [HPid],
  [Pid ! {next, Next} || {Pid, Next} <- lists:zip(Pids, RPids)],

  % Time sending messages around the ring M times.
  T1 = now(),
  HPid ! M * N,
  receive
    done -> ok
  end,
  T2 = now(),
  Dt = timer:now_diff(T2, T1),

  % Kill all the processes.
  [Pid ! quit || Pid <- Pids],

  Dt.

proc(Launcher) ->
  receive
    quit -> void;
    {next, Pid} ->
      put(next, Pid),
      proc(Launcher);
    0 ->
      Launcher ! done,
      proc(Launcher);
    J ->
      get(next) ! J - 1,
      proc(Launcher)
  end.

Here’s the Go version:

package main

import (
	"flag"
	"fmt"
	"time"
)

func main() {
	m := flag.Int("m", 10, "number of loops")
	n := flag.Int("n", 10, "number of goroutines")
	flag.Parse()

	chs := make([]chan int, *n)
	for i := range chs {
		chs[i] = make(chan int)
	}
	done := make(chan struct{})
	for i := 0; i < *n; i++ {
		go f(chs[i], chs[(i+1) % *n], done)
	}

	t0 := time.Now()
	chs[0] <- *m * *n
	<-done
	t1 := time.Now()
	fmt.Printf("%v\n", t1.Sub(t0))
}

func f(in, out chan int, done chan struct{}) {
	for {
		select {
			case k := <- in:
				if k <= 0 {
					close(done)
					return
				}
				out <- k-1
			case <-done:
				return
		}
	}
}

vkatsuba · November 28, 2021, 10:05am

Hi @ijt and welcome to the forum. Could you please remove in Go version printing of results and just put - return of value as it done in Erlang and try run Go version again?

Neiby · November 28, 2021, 4:19pm

I might be misunderstanding what you’re asking, @vkatsuba, but in the Go version, that is the main function, so I think there is nothing to return to. It’s been a long time since I’ve used Go, though.

Neiby · November 28, 2021, 4:25pm

This is really cool! I ran both on my machine to compare results. Now I’m going to look closer at your Erlang code to see exactly what you were doing. I’ll probably look at the Go code, too, since I spent a few weeks with Go a few years ago. It would be nice to get a little refresher.

ijt · November 28, 2021, 9:40pm

@vkatsuba, thanks for the warm welcome.

If you have another look, you may notice that the timing in the Go code does not include the printing out of the time. It seems logically impossible that it could.

vkatsuba · November 28, 2021, 9:50pm

Oh, now I see. Thanks for explanation.

OvermindDL1 · November 29, 2021, 6:01pm

Oooo, this looks fun, let’s try!

Here’s the erlang results here:

❯ erlc ring_benchmark.erl
ring_benchmark.erl:14:8: Warning: erlang:now/0 is deprecated; see the "Time and Time Correction in Erlang" chapter of the ERTS User's Guide for more information
%   14|   T1 = now(),
%     |        ^

ring_benchmark.erl:19:8: Warning: erlang:now/0 is deprecated; see the "Time and Time Correction in Erlang" chapter of the ERTS User's Guide for more information
%   19|   T2 = now(),
%     |        ^

❯ erl
Erlang/OTP 24 [erts-12.1.5] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit]

Eshell V12.1.5  (abort with ^G)
1> ring_benchmark:run(100, 100000).
6180679

And here’s the golang results here:

❯ go build ring-go.go

❯ ./ring-go -m=100 -n=100000
6.907550383s

So as expected, erlang is faster for this (go’s channels and greenthreads have a lot of overhead, hence why the ‘fastest’ go programs don’t use them).

(Note: My rust times here are wrong because I left a bug in because there wasn’t actually any data for me to consume (it was a dataless operation… ^.^;) so it all got optimized-skipped out, see later post for accurate times.)

Hmm, let’s try this in rust, a very generic channel processor using native threads:

use clap::Parser;
use crossbeam::{channel, scope};
use std::time::Instant;

#[derive(Parser)]
struct CliArgs {
	#[clap(short, long, default_value = "10")]
	m: usize,
	#[clap(short, long, default_value = "10")]
	n: usize,
}


fn main() {
	let CliArgs { m, n } = CliArgs::parse();
	let chs: Vec<_> = (0..n).map(|_| channel::bounded(0)).collect();
	let done = channel::bounded(n);
	scope(|s| {
		for i in 0..n {
			let channel = &chs[i].1;
			let next = &chs[(i+1) % n].0;
			let done = &done.1;
			s.spawn(move |_| {
				loop {
					channel::select! {
						recv(channel) -> k => {
							let k = k.unwrap();
							if k > 0 {
								next.send(k - 1).unwrap();
							} else {
								return;
							}
						}
						recv(done) -> _ => return,
					};
				}
			});
		}
		let start = Instant::now();
		chs[0].0.send(m * n).unwrap();
		(0..n).for_each(|_| done.0.send(()).unwrap());
		let end = Instant::now();
		println!("{:?}", end - start);
	}).unwrap();
}

Now obviously allocating 100000 native threads is extreme, so swapping the m and n to make 100 threads with 100000 processing:

❯ cargo run --release --bin ring_rs_basic -- -m100000 -n100
   Compiling ring v0.1.0 (/home/overminddl1/tmp/ring)
    Finished release [optimized] target(s) in 12.23s
     Running `target/release/ring_rs_basic -m100000 -n100`
9.81679ms

Though let’s try it with n=16 so one thread per core at least:

❯ cargo run --release --bin ring_rs_basic -- -m100000 -n16ring_rs_greenthreads -- -m100 -n100000
    Finished release [optimized] target(s) in 0.03s
     Running `target/release/ring_rs_basic -m100000 -n16`
55.886µs

Too fast! Bumping m to 1 million:

❯ cargo run --release --bin ring_rs_basic -- -m1000000 -n16
    Finished release [optimized] target(s) in 0.03s
     Running `target/release/ring_rs_basic -m1000000 -n16`
962.916µs

Still fast but whatever, almost 1 millisecond to send 16 million messages around 16 threads.

Now let’s try it with Rust’s greenthreads/async as this would be most close to what erlang and go are doing, I’m expecting slower than the native threads of course as these will obviously run a lot more code instead of simple hardware synchronization, the code:

use clap::Parser;
use std::time::Instant;
use tokio::{select, sync::{mpsc, watch}, task};

#[derive(Parser)]
struct CliArgs {
	#[clap(short, long, default_value = "10")]
	m: usize,
	#[clap(short, long, default_value = "10")]
	n: usize,
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
	let CliArgs { m, n } = CliArgs::parse();
	let mut chs: Vec<_> = (0..n).map(|_| {
		let (tx, rx) = mpsc::channel::<usize>(1);
		(tx, Some(rx))
	}).collect();
	let (done_tx, done) = watch::channel(false);
	for i in 0..n {
		let mut channel = chs[i].1.take().unwrap();
		let next = chs[(i+1) % n].0.clone();
		let mut done = done.clone();
		task::spawn(async move {
			loop {
				select! {
					k = channel.recv() => {
						if let Some(k) = k {
							if k > 0 {
								if let Err(e) = next.send(k - 1).await {
									// 'next''task has closed
									eprintln!("Task closed early: {:?}", e);
									return;
								}
							} else {
								return;
							}
						} else {
							// channel closed
							return;
						}
					}
					_ = done.changed() => if *done.borrow() { return; },
				};
			}
		});
	}
	let start = Instant::now();
	chs[0].0.send(m * n).await?;
	done_tx.send(true)?;
	let end = Instant::now();
	println!("{:?}", end - start);

	Ok(())
}

And running it:

❯ cargo run --release --bin ring_rs_greenthreads -- -m100 -n100000
   Compiling ring v0.1.0 (/home/overminddl1/tmp/ring)
    Finished release [optimized] target(s) in 15.26s
     Running `target/release/ring_rs_greenthreads -m100 -n100000`
68.088705ms

Yep a lot slower, as expected, but still a lot faster than go and erlang, which again took (running again, and of course times vary):

❯ erl
Erlang/OTP 24 [erts-12.1.5] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit]

Eshell V12.1.5  (abort with ^G)
1> ring_benchmark:run(100, 100000).
6064000

❯ ./ring-go -m=100 -n=100000
7.098047836s

❯ ./target/release/ring_rs_greenthreads -m100 -n100000
89.482385ms

And yes, I confirmed that rust is indeed executing 100*100000 many times (the times were so fast that something didn’t seem right, and who knows, something about my code might not be right, please confirm above, but it is indeed counting down from 10000000).

EDIT: Don’t forget that Rust plays VERY nicely with erlang, it’s easy to make NIF’s. ^.^

unix1 · December 1, 2021, 4:08pm

I’m not too familiar with rust and tokio. Do they also allocate and copy with every send/receive?

AstonJ · December 1, 2021, 5:33pm

^^Does Rust/Tokio also have the same kind of failsafes/fault tolerance that Erlang/OTP does?

I think Rust is great, but a lot of it seems experimental (and articles like this are always concerning) - that’s not necessarily a bad thing - experimentation often leads to innovative stuff, but would I opt to use it over Erlang for the things Erlang is proven in? Probably not… at least not right now - 5, 10 years maybe? (By which time Erlang will have progressed in all sorts of ways itself).

OvermindDL1 · December 1, 2021, 7:53pm

Not on every send (well, ‘some’ kind of things do), but initial allocation always happens with tokio upon the creation of a new task.

(EDIT1: In more detail, when a task is created then it’s the size of the entire max possible callstack size of the task, which can be large at times, however you can always conditionally box parts of it as well and more too to keep the size down if it’s an issue, but most of the time it’s not an issue, like at all (think of this as the stack space it will use up), and instead it knows precisely, down to the last byte, of how much ‘stack’ space it needs (with all lifetime knowledge and all, reusing memory when possible, etc… etc…) and will allocate only that much for it).

(EDIT2: And for note, tokio allocation on new tasks is just because it is fully generic, if you have more specialized runners (which yes you can mix with the tokio runtime) or so, then it might not have allocations depending on the context and how it’s all set up.)

Rust propagates errors up until they must be handled or they reach program top. Worst case is something can call panic! and that will kill the program outright (after whatever logging or so is setup).

However, it’s not a mesh distributed system, it’s definitely single node, though definitely could build multi-node on it (there’s a few libraries that do). Right now, Erlang’s mesh capabilities is its unmatched highlight, that’s not something most languages are ever going to be able to pull off, especially with its hotswapping capabilities. Thankfully Rust and Erlang work very well together in their respective strengths though.

derekzhou · December 2, 2021, 2:15am

You did not wait until all the messages to travel back. So you just send the first one, let it runs, then send a done that shuts down everything. In this case, if you change the tokio scheduler to current_thread (single thread) it will finish in microseconds because nothing is done.
If I modify your program to add a channel back to main, and wait for this channel in main, I have

3.52163712s

Rust is fast but not that fast.
The modified program:

use clap::Parser;
use std::time::Instant;
use tokio::{select, sync::{mpsc, watch}, task};

#[derive(Parser)]
struct CliArgs {
        #[clap(short, long, default_value = "10")]
        m: usize,
        #[clap(short, long, default_value = "10")]
        n: usize,
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
        let CliArgs { m, n } = CliArgs::parse();
        let mut chs: Vec<_> = (0..n).map(|_| {
                let (tx, rx) = mpsc::channel::<usize>(1);
                (tx, Some(rx))
        }).collect();
    let (main_tx, mut main_rx) = mpsc::channel::<usize>(1);
        let (done_tx, done) = watch::channel(false);
        for i in 0..n {
                let mut channel = chs[i].1.take().unwrap();
                let next = chs[(i+1) % n].0.clone();
                let mut done = done.clone();
                let main_tx = main_tx.clone();
                task::spawn(async move {
                        loop {
                                select! {
                                        k = channel.recv() => {
                                                if let Some(k) = k {
                                                        if k > 0 {
                                                                if let Err(e) = next.send(k - 1).await {
                                                                        // 'next''task has closed   
                                                                        eprintln!("Task closed early: {:?}", e);
                                                                        return;
                                                                }
                                                        } else {
                                                            main_tx.send(0).await.unwrap();
                                                                return;
                                                        }
                                                } else {
                                                        // channel closed                           
                                                        return;
                                                }
                                        }
                                        _ = done.changed() => if *done.borrow() { return; },
                                };
                        }
                });
        }
        let start = Instant::now();
        chs[0].0.send(m * n).await?;
        main_rx.recv().await;
        done_tx.send(true)?;
        let end = Instant::now();
        println!("{:?}", end - start);

        Ok(())
}

OvermindDL1 · December 2, 2021, 5:05pm

Ah yes true!

EDIT: And with that fixed (whoops!), the multi-threaded scheduler here I get:

✦ ❯ cargo run --release --bin ring_rs_greenthreads -- -m100 -n100000
   Compiling ring v0.1.0 (/home/overminddl1/tmp/ring)
    Finished release [optimized] target(s) in 16.58s
     Running `target/release/ring_rs_greenthreads -m100 -n100000`
2.823803715s
[src/bin/ring_rs_greenthreads.rs:59] m = 100
[src/bin/ring_rs_greenthreads.rs:59] n = 100000
[src/bin/ring_rs_greenthreads.rs:59] m * n = 10000000

And with the single threaded scheduler I get:

✦ ❯ cargo run --release --bin ring_rs_greenthreads -- -m100 -n100000
   Compiling ring v0.1.0 (/home/overminddl1/tmp/ring)
    Finished release [optimized] target(s) in 16.17s
     Running `target/release/ring_rs_greenthreads -m100 -n100000`
2.278915002s
[src/bin/ring_rs_greenthreads.rs:59] m = 100
[src/bin/ring_rs_greenthreads.rs:59] n = 100000
[src/bin/ring_rs_greenthreads.rs:59] m * n = 10000000

And running the go one again:

❯ ./ring-go -m=100 -n=100000
7.076336311s

And the erlang one again:

2> ring_benchmark:run(100, 100000).
6113590

Five attempts on both the go and erlang ones and took the best results of each.

derekzhou · December 2, 2021, 6:41pm

I am a little disappointed with the Go performance. In this kind of benchmark, the cooperative async/await style threading shall be more efficient than the preemptive multi-threading in Erlang. Also with the native code advantage, I would have thought Go should finish somewhere in between Erlang and Rust.

OvermindDL1 · December 2, 2021, 10:12pm

It’s the way go context switches, it’s a bit heavy (similar to how the libdill/libmill libraries do in C but not quite ‘that’ bad), hence why basically every go program that must be as fast as possible doesn’t use go’s greenthreads. In comparison, Rust’s async compiles the tree of function calls of an async program into an enum state machine, a simple/trivial example is like:

async fn blah(something: usize) > Result<String, Box<Error>> {
    let blep = an_async_func(something).async?;
    let vwoop = a_normal_function(something, blep);
    another_async_func(vwoop)
}

This essentially gets compiled into this (taking some liberty with some syntax for brevity, for example it normally does more condensed tests so poll doesn’t have to be called again on async functions that immediately are ready, but adding all that here would make it much much longer):

enum BlahFuture {
    Init { something: usize },
    State1 { something: usize, blep: Pin<AnAsyncFuncFuture> },
    State2 { another: Pin<AnotherAsyncFuncFuture> },
    Done,
}

impl std::future::Future for BlahFuture {
    type Output = Result<String, Box<Error>>;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Pool<Self::Output> {
        match *self {
            BlahFuture::Init { something } => {
                let blep = AnAsyncFuncFuture::new(something);
                *self = Self::State1 { something, blep };
                Poll::Pending
            }
            BlahFuture::State1 { something, blep } => {
                match blep.poll(cx) {
                    Poll::Pending => Poll::Pending,
                    Poll::Ready(blep_ret) => {
                        let vwoop = a_normal_function(something, blep_ret);
                        *self = Self::State2 { another: AnotherAsyncFuncFuture::new(vwoop) };
                        Poll::Pending
                    }
                }
            }
            BlahFuture::State2 { another } => {
                match another.poll(cx) {
                    Poll::Pending => Poll::Pending,
                    Poll::Ready(another_ret) => {
                        *self = Self::Done;
                        Poll::Ready(another_ret)
                    }
                }
            }
            BlahFuture::Done => panic!("polled done future"),
        }
    }
}

But as you can see the program basically got split at each .await call, each split chunk becomes it’s own head in the enum variant.

A trivial scheduler (not tokio, or, well, really any of them) will not just call poll repeatedly over time, as that would be horrendously inefficient, instead the Context there has a Waker type that you can store, so if you, well, for example, make a Timeout future then you’d store off that waker somewhere and when its time finally elapses then you will call that waker, which will put that future back onto the runtime polling queue (might even call it instantly) to let it continue to do more work.

So in essence an async function in rust is really just an enum that gets poll called on it via the std::future::Future trait on occasion to let it continue to do work by waiting for the waker to be called in the cx, whether that’s a timer timeout or some network traffic that came in or whatever, you can even build up your own custom ones like for a game engine you might want to process animations in what looks like a linear function but instead it’s a future that runs at the appropriate times like:

async fn run_game_animation_for(engine: &GameEngineRT, model: ModelRef) {
  model.play_animation("sit").until_finished().await;
  model.play_animation("talking").repeat();
  engine.teletype_text("Sit down and listen...").await;
  // Keep playing the "talking" animation for another 450 seconds after text is fully teletyped on
  engine.wait(Duration::from_milliseconds(450)).await;
  model.play_animation("waiting-sat");
  // etc...
}

You could absolutely make such an API that runs within a game engine, rust’s async is generic so you can build about anything on it that you want.

And yes, you can literally just implement std::future::Future yourself if you want (usually with the base most ‘futures’ it’s required, but that’s rarely if ever anything you’d implement in your own program).

But because of this there’s no context switch happening, it’s about as light as it can get (even allocationless if using a specialized scheduler for your specific future types, though of course an allocation for general ones like tokio, but that’s still very cheap enough).

derekzhou · December 2, 2021, 10:48pm

Interesting. So in Rust, you don’t even need to have a different stack per tokio thread? What is the catch? Can’t do async virtual function calls?(If there is such a thing; I am not sufficiently familiar with Rust to know)

OvermindDL1 · December 2, 2021, 10:58pm

A tokio thread is just a processor to churn through poll calls of tasks. They are extremely simple over all (other than handling work stealing on the concurrent queue and scaling up/down, etc…)

Sure, you can have a Box<dyn Future> just fine (and in fact if one of your future’s gets too big that’s a convenient way of making it small again in exchange for an allocation).

dc0d · January 3, 2022, 11:03am

Happy new year!

Changed the Go code a bit and the result is now between Erlang and Rust. First, let’s take a look at the results on my machine (more details are provided below).

Three executions of the original Go code (with -m=100 -n=100000):

6.990975714s
6.947119707s
7.079608331s

Three executions of the Erlang code:

5785368
8367351
5809842

And, three executions of the modified Go code:

4.624983411s
4.639532605s
4.742938768s

Environment: Mac Core i5, Go 1.17.5, Erlang 24 (got via erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().' -noshell!)

In the Go code, a *sync.WaitGroup is used for being informed about the end of the computation - instead of a channel. After removing the done channel, it was possible to remove the select statement:

package main

import (
	"flag"
	"fmt"
	"sync"
	"time"
)

func main() {
	m := flag.Int("m", 10, "number of loops")
	n := flag.Int("n", 10, "number of goroutines")
	flag.Parse()

	chs := make([]chan int, *n)
	for i := range chs {
		chs[i] = make(chan int)
	}
	finishedOnes := new(sync.WaitGroup)
	for i := 0; i < *n; i++ {
		finishedOnes.Add(1)
		go f(chs[i], chs[(i+1)%*n], finishedOnes)
	}

	t0 := time.Now()
	chs[0] <- *m * *n
	finishedOnes.Wait()
	t1 := time.Now()
	fmt.Printf("%v\n", t1.Sub(t0))
}

func f(in, out chan int, finishedOnes *sync.WaitGroup) {
	defer finishedOnes.Done()
	defer close(out)

	for k := range in {
		if k <= 0 {
			return
		}
		out <- k - 1
	}
}

If we want to draw a relatively more precise analogy between a Go goroutine and an Erlang process, we should use a single channel as the input for the goroutine. Of course, the mailbox of an Erlang process is a priority queue. In comparison, a Go channel is just a normal bounded queue, with some additional semantics for participating in a select statement.

Go scheduler since 1.14 is not fully cooperative and Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection (source).

OvermindDL1 · January 3, 2022, 3:59pm

Ooo cool, let’s try the new version of the go code by @dc0d along with the all the other prior ones I have sitting here:

Rust async/greenthreads:

❯ cargo run --release --bin ring_rs_greenthreads -- -m100 -n100000
    Finished release [optimized] target(s) in 0.05s
     Running `target/release/ring_rs_greenthreads -m100 -n100000`
2.569606111s
[src/bin/ring_rs_greenthreads.rs:60] m = 100
[src/bin/ring_rs_greenthreads.rs:60] n = 100000
[src/bin/ring_rs_greenthreads.rs:60] m * n = 10000000

Erlang:

❯ erl                    
Erlang/OTP 24 [erts-12.2] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit]

Eshell V12.2  (abort with ^G)
1> ring_benchmark:run(100, 100000).
5976398

Original Go code:

❯ ./ring-go -m=100 -n=100000 # Original go version
6.930653128s

@dc0d new Go code:

❯ ./ring-go2 -m=100 -n=100000 # New go version by @dc0d
4.706953452s

Much faster than before!

All compilers were updated to their latest version available before running the code and all programs were ran many times with the best time taken as usual (though with basically all of them the first runs were a bit slower then all the rest were pretty consistent).

danilagamma · January 3, 2022, 5:56pm

I managed to shave a bit of time in erlang benchmark, probably still not enough to beat Go, but still:

1> ring_benchmark:run(100, 100000). 
6305893
2> ring_benchmark:run(100, 100000). 
6663063
3> ring_benchmark2:run(100, 100000).
5912472
4> ring_benchmark2:run(100, 100000).
5694360

ring_benchmark2.erl

-module(ring_benchmark2).

-export([run/2]).

run(M, N) ->
    % Spawn N processes and connect them as a ring.
    Launcher = self(),
    Pids = [ spawn(fun() -> bootstrap(Launcher) end) || _ <- lists:seq(1, N) ],
    [HPid | TPids] = Pids,
    RPids = TPids ++ [HPid],
    [ Pid ! {next, Next} || {Pid, Next} <- lists:zip(Pids, RPids) ],

    % Time sending messages around the ring M times.
    T1 = erlang:monotonic_time(),
    HPid ! M * N,
    receive
        done -> ok
    end,
    T2 = erlang:monotonic_time(),
    Dt = erlang:convert_time_unit(T2 - T1, native, microsecond),

    % Kill all the processes.
    [ exit(Pid, quit) || Pid <- Pids ],

    Dt.

bootstrap(Launcher) ->
    receive
        {next, NextPid} ->
            process(Launcher, NextPid)
    end.

process(Launcher, NextPid) ->
    receive
        0 -> Launcher ! done;
        J -> NextPid  ! J - 1
    end,
    process(Launcher, NextPid).

marcellanz · January 4, 2022, 2:45pm

Interesting thread, thanks!

I’d like to throw in a plug to hyperfine, which is a nice tool to run “stuff” timed with some stats: