Looking for a faster RNG

Yes, at least in most of the tests I have run, there is between 5-35% performance improvement using erlang:random_integer/1 compared with the erlang:phash2(erlang:unique_integer(), Range) trick and 60-80% performance improvement compared with rand:uniform/1.

I also neglected to share the performance results from erlang:random_integer/0 in my initial comment.

Single process test with erlang:random_integer/0:

erlperf -c 1 'erlang:random_integer().' 'erlang:unique_integer().' 'rand:uniform().'
Code                             ||        QPS     Rel
erlang:random_integer().          1   40138 Ki    100%
erlang:unique_integer().          1   36524 Ki     91%
rand:uniform().                   1   10316 Ki     26%

64 process test with erlang:random_integer/0 (matches the number of online schedulers):

erlperf -c 64 'erlang:random_integer().' 'erlang:unique_integer().' 'rand:uniform().'
Code                             ||        QPS     Rel
erlang:random_integer().         64   42728 Ki    100%
erlang:unique_integer().         64   36211 Ki     84%
rand:uniform().                  64   16375 Ki     38%

Yeah, I agree: this would not be a full replacement for rand and would primarily be useful in very high QPS use-cases (for example: random routing/load balancing).

That makes sense, but would this mostly be for API complete-ness or do you have any specific use-cases that something like erlang:random_binary/1 would be a perfect fit for that crypto:strong_rand_bytes/1 or rand:bytes/1 are sub-optimal for today?

Also makes sense for 32-bit. I figured we would have to address the 32-bit issue before this could be accepted. Right now I’m just blindly casting to Uint (all integers are just 64-bit everywhere, anyway, right? :slight_smile:).

Right now: I’m mimicking the rand:uniform/1 behavior that requires Range >= 1 while also enforcing Range =< (1 bsl 58) - 1. I’m using v != 0 to terminate the loop, which accomplishes something similar for the “V in the truncated top range” case in the ?uniform_range macro in rand.erl.

One (probably very rare) case that isn’t currently covered is if, for whatever reason, the scheduler’s state gets set to two zeros, any call to erlang:random_integer/1 will always return 1 forever. The same “no recovery from zero state” behavior can be simulated with rand using rand:uniform_s(1000, {element(1, rand:seed(exrop)), [0|0]}).. Regardless of the input N, the resulting output is always 1 and the state never changes.

Another issue is the initial value for splitmix64_seed, which I assumed we could either use some low-grade “entropy” (like system time, number of schedulers, etc; similar to rand:seed/1) that could be overridden with something like erl +randseed 123456.

Context for my following questions: I don’t know a ton about PRNG and the details of the pros/cons of the various algorithms. So I’m legitimately curious.

Why Xoshiro256++ versus Xoroshiro116+? If I’m understanding correctly, I think the cycles/B is roughly double, unless we were able to get the AVX2/NEON vectorization version to work correctly. If we split Xoshiro256++ for integers and Xoshiro256+ for floats, would the idea be that separate state would be kept on each scheduler for each type of output? If so, I think that would be 32-bytes (or 256-bytes for the SIMD version) stored twice, so either 64-bytes or 512-bytes per scheduler.

5 Likes

Interesting speed figures!

Now, a NIF call overhead should not be much larger (if even) than
a BIF call overhead. So if this could well enough be implemented
in a NIF - that would be very much preferred.

Before getting deeper into the discussion of whether this is an
extension of the VM that we want we must explore implementing this
as a NIF.

I have tried once implementing one of the Xoroshiro generators
with a NIF, and did not get much performance improvement, but that
attempt was adhering to the framework of rand, meaning passing the state
back and forth between Erlang and the NIF.

Storing the state in thread specific data should have much less overhead.

:

That makes sense, but would this mostly be for API complete-ness or do you have any specific use-cases that something like erlang:random_binary/1 would be a perfect fit for that crypto:strong_rand_bytes/1 or rand:bytes/1 are sub-optimal for today?

That suggestion I guess was for API completeness.
I have no good use cases - often, I guess, there would be other
requirements so generating binaries would be a too narrow problem.

But I think there are use cases for fast random floats.

Also makes sense for 32-bit. I figured we would have to address the 32-bit issue before this could be accepted. Right now I’m just blindly casting to Uint (all integers are just 64-bit everywhere, anyway, right? :slight_smile:).

Let’s postpone this problem for later.

:
:

One (probably very rare) case that isn’t currently covered is if, for whatever reason, the scheduler’s state gets set to two zeros, any call to erlang:random_integer/1 will always return 1 forever. The same “no recovery from zero state” behavior can be simulated with rand using rand:uniform_s(1000, {element(1, rand:seed(exrop)), [0|0]}).. Regardless of the input N, the resulting output is always 1 and the state never changes.

That should never happen unless you seed it to [0|0], which must be avoided.
[0|0] is not part of the generator’s sequence, or rather; this class of
generators have two sequences; one is the singular [0|0] and the one used
is a full sequence of all other state values.

Another issue is the initial value for splitmix64_seed, which I assumed we could either use some low-grade “entropy” (like system time, number of schedulers, etc; similar to rand:seed/1) that could be overridden with something like erl +randseed 123456.

Possibly, but auto seed only would be prefered.

Also, the Splitmix64 state updates need to be thread safe, right?

Context for my following questions: I don’t know a ton about PRNG and the details of the pros/cons of the various algorithms. So I’m legitimately curious.

Why Xoshiro256++ versus Xoroshiro116+? If I’m understanding correctly, I think the cycles/B is roughly double, unless we were able to get the AVX2/NEON vectorization version to work correctly. If we split Xoshiro256++ for integers and Xoshiro256+, would the idea be that separate state would be kept on each scheduler for each type of output? If so, I think that would be 32-bytes (or 256-bytes for the SIMD version) stored twice, so either 64-bytes or 512-bytes per scheduler.

It is just that Xoshiro256 is recommended by the creator(s) as an
all-purpose generator. Its state is 32 bytes which I do not think
is much to store per scheduler. The jump distance is long: 2^128
compared to 2^64 for Xoroshiro128. If you run a VM very long it is
maybe even possible to use 2^64 random numbers in a scheduler.

The difference between Xhoshiro256++ and Xoshiro256+ is in
the pre state update output scrambling (++ vs.+) so my idea was to use
the same state update and scramble differently between integers and floats
since the + scrambler is slightly faster and its small disadvantage
vs. the ++ scrambler can be hidden when generating floats.

I also suspect that getting down from about 1 ns generation time
per 64-bit integer to a fraction of it will be buried in BIF or NIF
call overhead so there is not much point in looking at e.g. vectorization.
The same reasoning applies between Xoshiro256 vs. Xoroshiro128.

Cheers

3 Likes

I threw together a NIF here: GitHub - potatosalad/erlang-fastrandom

I also started down the path of storing the state in a reference(), but found that the performance was nearly identical to pure-Erlang rand when it was written as a rand “plugin”.

The NIF has support for Xoroshiro116+, Xoshiro256+, Xoshiro256+X8, Xoshiro256++, and Xoshiro256++X8. SplitMix64 is used along with enif_tsd_get to store thread specific data similar to how the BIF functions.

However, in its current form, the BIF implementation is still roughly 30-35% more efficient compared to the NIF implementation.

Single process performance:

Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                           1   34123 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).          1   22562 Ki     66%
fastrandom_nif:xoroshiro116p_next(1000).               1   17359 Ki     51%
fastrandom_nif:xoshiro256p_next(1000).                 1   17319 Ki     51%
fastrandom_nif:xoshiro256px8_next(1000).               1   16886 Ki     50%
fastrandom_nif:xoshiro256ppx8_next(1000).              1   16861 Ki     49%
fastrandom_nif:xoshiro256pp_next(1000).                1   16662 Ki     48%
rand:uniform(1000).                                    1   10465 Ki     30%

64 process performance:

Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                          64   36578 Ki    100%
fastrandom_nif:xoroshiro116p_next(1000).              64   26174 Ki     71%
fastrandom_nif:xoshiro256p_next(1000).                64   20460 Ki     56%
erlang:phash2(erlang:unique_integer(), 1000).         64   18774 Ki     51%
fastrandom_nif:xoshiro256pp_next(1000).               64   15208 Ki     41%
fastrandom_nif:xoshiro256ppx8_next(1000).             64   15018 Ki     41%
fastrandom_nif:xoshiro256px8_next(1000).              64   13834 Ki     37%
rand:uniform(1000).                                   64    9105 Ki     24%

That makes sense, thanks for the explanation :slight_smile:

Normally: yes. However, at first glance, it looked like erts_init_scheduling calls init_scheduler_data in a for loop, which would result in erts_sched_bif_unique_init being called sequentially on a single thread. So…I think SplitMix64 is only ever called on a single thread. However, if I am misinterpreting the code there, then yes: some sort of lock would need to be added.

That makes sense, thank you for the explanation.

What are your thoughts on the direction of the experiments so far? I think given the fairly small performance improvement between the phash2/unique integer trick and the NIF, we’ll likely stick with the former for the time being. However, the BIF implementation could potentially be useful to us for very high QPS random-based load balancing workloads.

4 Likes

I threw together a NIF here: GitHub - potatosalad/erlang-fastrandom

I also started down the path of storing the state in a reference(), but found that the performance was nearly identical to pure-Erlang rand when it was written as a rand “plugin”.

I have once followed that path, and arrived to the same conclusion…

The NIF has support for Xoroshiro116+, Xoshiro256+, Xoshiro256+X8, Xoshiro256++, and Xoshiro256++X8. SplitMix64 is used along with enif_tsd_get to store thread specific data similar to how the BIF functions.

However, in its current form, the BIF implementation is still roughly 30-35% more efficient compared to the NIF implementation.

Single process performance:

Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                           1   34123 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).          1   22562 Ki     66%
fastrandom_nif:xoroshiro116p_next(1000).               1   17359 Ki     51%
fastrandom_nif:xoshiro256p_next(1000).                 1   17319 Ki     51%
fastrandom_nif:xoshiro256px8_next(1000).               1   16886 Ki     50%
fastrandom_nif:xoshiro256ppx8_next(1000).              1   16861 Ki     49%
fastrandom_nif:xoshiro256pp_next(1000).                1   16662 Ki     48%
rand:uniform(1000).                                    1   10465 Ki     30%

64 process performance:

Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                          64   36578 Ki    100%
fastrandom_nif:xoroshiro116p_next(1000).              64   26174 Ki     71%
fastrandom_nif:xoshiro256p_next(1000).                64   20460 Ki     56%
erlang:phash2(erlang:unique_integer(), 1000).         64   18774 Ki     51%
fastrandom_nif:xoshiro256pp_next(1000).               64   15208 Ki     41%
fastrandom_nif:xoshiro256ppx8_next(1000).             64   15018 Ki     41%
fastrandom_nif:xoshiro256px8_next(1000).              64   13834 Ki     37%
rand:uniform(1000).                                   64    9105 Ki     24%

:

What are your thoughts on the direction of the experiments so far? I think given the fairly small performance improvement between the phash2/unique integer trick and the NIF, we’ll likely stick with the former for the time being. However, the BIF implementation could potentially be useful to us for very high QPS random-based load balancing workloads.

About the benchmark: can we get an estimate of the benchmark overhead
by also benchmarking the ‘ok’ expression or ‘self()’ (that I IIRC is a VM
instruction) to see if it might be that the BIF implementation is
even more efficient when accounting for overhead.

Also, ‘rand:uniform(1000)’ stores the state in the process dictionary,
which might be noticeably slower than keeping it in a state variable.
But I can not find any way to iterate over a state in the erlperf framework,
so I will measure in the rand_SUITE:measure to see how much process dictionary
state vs state variable differs.

As it stands now, fastrandom_nif:exrop has about 1.4-2.0 the execution time of
erlang:random_integer, and rand:uniform has about 1.7-3.0 the execution
time of fastrandom_nif:exrop.

When accounting for overhead, erlang:random_integer might be better than that,
and when measuring without process dictionary fastrandom_nif might be not as good.

These might be arguments in favour for a BIF solution…

About the fastrandom_nif code:

I sat down to do some moonshine programming, when I saw that you
posted very much the code I intended to write - thank you very much! :slight_smile:

I can not see any way to optimize the code…

  • The ending in all *_next_0() functions is output %= UINT58MASK; but should
    be output &= UINT58MASK;.

  • All *_next_1() functions check the range and throw a badarg “Bad
    argument: ‘Seed’” but the erroneous argument is the range.

  • The same functions check the max range as bad if n > UINT58MASK
    but that should be n > TWO_POW58 (n > 1<<58).

  • The same functions have a range check-reject-retry loop that returns
    a value in the range [1, n] for a range n, but the *_next_0() functions
    return a value in the range [0, (2^58 - 1)]. This might be exactly
    what we want, or do we want the range [0, (n-1)] for the *_next_1()
    functions?

  • There is a tsd_get, alloc, ctx_jump, tsd_set section in all next()
    functions that I would have made a helper function of.

  • fastrandom_nif_global_seed should not be set to 0 - we should find some
    entropy; future improvement I guess… We could take it from
    load_info to make it an Erlang problem.

  • The 116 bit generator should not be seeded with 64 bit seed words,
    and although the Splitmix64 generator will not output 2 zeros in a row,
    after masking down to 58 bit you might get just that; so a loop that
    avoids two zeros (or just zeros) is needed whan seeding (with masking
    to 58 bits) this generator.

Cheers!

3 Likes

My understanding/finding during my experimentation on SFMT and TinyMT on Erlang from 2010 to 2012 was that the overhead of function callings themselves became the fundamental overhead of generating random numbers as a sequence with state manipulation. See my TinyMT paper Section 4.3 for the details. Quote:

While TinyMT is inherently simpler than SFMT, that does not necessarily mean TinyMT is faster for Erlang. We think this is due to the execution overhead time of each function in BEAM.

If you really need a blazingly fast random number generator, you need to consider a batch generation of 100s or 1000s of random number sequence in NIFs as a list and simply get the head element out of the list pool.

4 Likes

There is indeed a visible overhead for benchmarking itself:

erlperf 'ok.' 'rand:uniform(100).'
Code                       ||        QPS     Rel
ok.                         1   31397 Ki    100%
rand:uniform(100).          1   10029 Ki     31%

Mainly because benchmark calls atomics:add. One can’t avoid it with timer:tc(because of the looping overhead). I plan to address this by using less expensive benchmark approach in the next version.

3 Likes

Also, a loop over a state is a missing feature in this case.
(Shameless feature request :wink:

3 Likes

Now I have tried that.

I took Andrew Bennet’s fastrandom_nif and changed it to have a fixed array of 32 generated numbers, to fill that array and return a list of numbers from that array. Array size 100 gave a hardly noicable improvement while 10 was noticably worse. I chose to measure on Xoshiro256++, and compare to (also Andrew Bennet’s) erlang:random_integer/0,1 that implements Xoroshiro116+ which is a few percent faster algorithm according to Andrew’s numbers.

Then I have measured by modifyng the rand_SUITE:measure/1 test case.

Numbers from my Dell Latitude 7300 laptop, per generated number, amortized, about 10 ns is spent on measure/1 overhead, the figures below are compensated for that:

Both cached fastrandom:xoshiro256pp_next/0 and erlang:random_integer/0 (no range limiting) use 6 ns

For generation with range 10000 both cached fastrandom:xoshiro256pp_next/1 and erlang:random_integer/1 use 12 ns.

For the awkward case half range + 1 (about 50% probability to discard and retry) cached fastrandom:xoshiro256pp_next/1 uses 12 ns while erlang:random_integer/1 uses 19 ns.

The rand framework has got another 20 ns overhead besides measure/1 overhead.
The default exsss algorithm in rand uses 30 ns besides overhead.
For the same algorithm small range 10000 limiting instead uses 50 ns besides overhead, and the awkward half range + 1 range limiting uses 75 ns besides overhead.

So, using a NIF that generates and converts 32 numbers to a list per invocation, and picking cached numbers from the head of the list reaches the same or slightly better performance than the erlang:random_integer/0,1 BIF implementation generating one number at the time.

Flipping the coin: the erlang:random_integer/0,1 BIF implementation reaches almost the same performance as generating in batches and picking numbers from the head of a list, with much less memory footpring and without the hassle of keeping cached numbers in yer state, having to decide the range for all numbers when generating the cache, etc…

With these low figures of number generation time; what you actually do with the numbers gets very significant, it might be so that generating a number in 6 ns vs. in 95 ns is irrelevant compared to application processing, and in other applications it might be an order of a magnitude difference. For some applications it might be no problem to pre-generate and handle a cache with known range numbers, and in others it may be very awkward.

A complete solution for cached PRNG numbers would need the ability to control how many numbers that are generated per cache refill, and being able to generate floating point numbers.

Both cached NIF generated PRNG numbers and erlang:random_integer/0,1 numbers can be generated in about 1/8 … 1/5 of the time required by the default rand generator i.e they are about 5 to 8 times faster / leaner. (but lack e.g repeatability and jumping)

Cheers!

2 Likes

I have done some more benchmarking and it turns out that bjorng was right from the start…

A simple Linear Congruential Generator is very fast, and can be an alternative if its substandard statistical qualities are ok for your application. To be fast it has to avoid bignum operations and fancy theoreticaly correct ways to implement a random number range.

I also noted that the remainder for integer division operator ‘rem’ is expensive - a bit mask is much faster.

After compensating for benchmark overhead, on my machine, (amortized) generation time per number (integer):

Generator	Full Range (ns)		Range 10000 (ns)
---------	---------------		----------------
LCG35		4			14
BIF  		6			12
NIF + cache	6			12
NIF  		18			25
unique + phash 18			25
rand		50			60

BIF is Andrew Bennet’s erlang:random_integer/0,1.

NIF is Andrew Bennet’s fastrandom, Xoshiro256++.

NIF + cache is the same’s Xoshiro256+ with a 32 number list cache.

unique + phash is, simply erlang:phash2(erlang:unique_integer()[, Range]).

rand is the default algorithm in the ‘rand’ module (116 bit), not using the process dictionary.

LCG35 is the following generator:

lcg35(R) ->
    (15319397 * R + 15366142135) band ((1 bsl 35) - 1).

The range 10000 operation is done with rem 10000 which is imperfect but mostly good enough for such a small range. All other generators produce the range in a theretically correct way with reject and retry.

An LCG generator has got bad statistical properties for the low bits - see the Wikipedia article. I have also tried a smaller Multiplicative Congruential Generator that should not have that particular flaw, but, it is smaller and have other pecularities. Other than that it is roughly equally fast thanks to being based on a Mersenne prime which facilitates an optimization of the ‘rem’ operation:

-define(MCG31_M, 163490618).
mcg31(R0) ->
    X = ?MCG31_M * R0,
    %%
    %% X rem ?MCG31_M optimized for Mersenne prime
    R1 = (X band ((1 bsl 31) - 1) + (X bsr 31),
    if
	?MCG31_M =< R1 -> R1 - ?MCG31_M;
	true           -> R1
    end.

I have also looked at a <35 bit MCG generator that does not use a Mersenne prime so it was harder to optimize, and it turned out to be only slightly faster than NIF and unique + phash. Therefore I think it is probably not very interesting, but I can publish it if requested.

Generator constants from “Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure” by Pierre L’Ecuyer (1999).

Cheers!
/ Raimo Niskanen, Erlang/OTP

5 Likes

And now I created a PR with some improvements in this area:

https://github.com/erlang/otp/pull/5836
8 Likes

Sure thing!

I have rewritten erlperf in attempt to reduce benchmarking overhead to a bare minimum (now it is essentially a single local function call, estimated at 1-2 ns). And of course added this missing feature. I can confirm that new niche RNGs are indeed faster (up to 7x compared to exsss):

./erlperf 'run(_, S) -> rand:mcg35(S).' --init_runner "1." 'run(_, S) -> {_, S1} = rand:uniform_s(S), S1.' --init_runner 'rand:seed(exsss).' 'run(_, S) -> rand:lcg35(S).' --init_runner "1."
Code                                                  ||        QPS       Time     Rel
run(_, S) -> rand:lcg35(S).                            1     129 Mi       7 ns    100%
run(_, S) -> rand:mcg35(S).                            1     122 Mi       8 ns     94%
run(_, S) -> {_, S1} = rand:uniform_s(S), S1.          1   20079 Ki      49 ns     15%

Decrypting the incantation above:

  1. Run rand:mcg35(S) and use the result as the input for the next iteration (the feature you requested)
  2. Same but for uniform_s. Unfortunately it returns a bit more complicated structure (random number and a new state).
  3. Same, with lcg35.

In general, results reported by erlperf 2.0 seem to match numbers posted in your PR (except for crypto:strong_rand_bytes that takes only ~800 ns on my machine (AMD 3.2 GHz desktop CPU).

Now the only issue is they all require state saved somewhere. Which means we’ll continue using a wrapper… rand:lcg35(erlang:unique_integer()).. It is still blazingly fast, 2x faster compared to erlang:phash2(erlang:unique_integer()).
But I just don’t like wrappers… especially this one, that is -define(RNG(), rand:lcg35(erlang:unique_integer())).

4 Likes

Sure thing!

I have rewritten erlperf in attempt to reduce benchmarking overhead to a bare minimum (now it is essentially a single local function call, estimated at 1-2 ns). And of course added this missing feature. I can confirm that new niche RNGs are indeed faster (up to 7x compared to exsss):

./erlperf 'run(_, S) -> rand:mcg35(S).' --init_runner "1." 'run(_, S) -> {_, S1} = rand:uniform_s(S), S1.' --init_runner 'rand:seed(exsss).' 'run(_, S) -> rand:lcg35(S).' --init_runner "1."
Code                                                  ||        QPS       Time     Rel
run(_, S) -> rand:lcg35(S).                            1     129 Mi       7 ns    100%
run(_, S) -> rand:mcg35(S).                            1     122 Mi       8 ns     94%
run(_, S) -> {_, S1} = rand:uniform_s(S), S1.          1   20079 Ki      49 ns     15%

Decrypting the incantation above:

  1. Run rand:mcg35(S) and use the result as the input for the next iteration (the feature you requested)
  2. Same but for uniform_s. Unfortunately it returns a bit more complicated structure (random number and a new state).
  3. Same, with lcg35.

In general, results reported by erlperf 2.0 seem to match numbers posted in your PR (except for crypto:strong_rand_bytes that takes only ~800 ns on my machine (AMD 3.2 GHz desktop CPU).

Great to get my results confirmed!

Now the only issue is they all require state saved somewhere. Which means we’ll continue using a wrapper… rand:lcg35(erlang:unique_integer()).. It is still blazingly fast, 2x faster compared to erlang:phash2(erlang:unique_integer()).
But I just don’t like wrappers… especially this one, that is -define(RNG(), rand:lcg35(erlang:unique_integer())).

That wrapper looks strange to me. rand:lcg35/1 is not a hash function. Yes, you may use erlang:unique_integer/0 as seed to rand:lcg35/1, but you need to maintain the PRNG state.

If you keep the state in the process dictionary you may write a wrapper for put(lcg35_state, rand:lcg35(get(lcg35_state))), and such a wrapper runs slower, a factor 3 or about 10 ns on my machine.

The fastest is to keep the state in a loop variable, but as soon as you store the state on the heap (or in the process dictionary), updating it it will cost performance. To return {Value, State} instead of just State costs about the same as updating the state in the process dictionary.

So these simple algorithms can generate a new state very fast, but the inherent overhead in a real application that stores its state on the heap will in many cases bury the difference between e.g rand:lcg35/1 and rand:exsp_next/1…

Nevertheless, these light-weight algorithms are the most light-weight I have seen, and in special cases that can make a difference.

Cheers!
/ Raimo Niskanen, Erlang/OTP, Ericsson AB

3 Likes

That’s exactly what I cannot really afford, having that state. For most cases, processes I start are very short-lived, doing RNG once-twice in its lifecycle, but the moment where I need the state is really unpredictable. That’s why keeping the state somewhere in a scheduler info is really that tempting.

2 Likes

This sounds like PRNG seeding is a bigger problem. The new small PRNG:s are useful because they are fast when iterated over their state, after seeding, which is not at all your use case.

It sure seems erlang:phash2(erlang:unique_integer(), Range) is a good fit here. Something like that or more heavy-weight would anyway be needed to seed a state for e.g rand:lcg35/1.

Cheers

2 Likes

Yeah I would just opt for a simple hash or noise function as well over something fairly unique like the time or unique_integer in erlangs case.

1 Like