Looking for a faster RNG

max-au · March 2, 2022, 7:44pm

There are many applications when a random number generator is allowed to be not exactly uniform, and not very robust. But it needs to be fast.
It appears that build-in rand:uniform isn’t exactly fast:

erlperf 'rand:uniform(1000).' 'erlang:phash2(erlang:unique_integer(), 1000).'
Code                                                  ||        QPS     Rel
erlang:phash2(erlang:unique_integer(), 1000).          1   15137 Ki    100%
rand:uniform(1000).                                    1    6282 Ki     41%

I wonder if there are other implementations of not-very-uniform RNGs that look less scary than taking a phash2 of a unique number.

bjorng · March 3, 2022, 5:04am

A linear congruential generator is probably faster than ´rand` and should be good enough.

Also, the type-based optimizations in the upcoming OTP 25 results in smaller code generated for the rand module (fewer type tests and overflow tests). I have not benchmarked to see whether the better code also results in measurable better performance compared to randin OTP 24.

max-au · March 3, 2022, 5:53am

LCG using process dictionary to save the seed is indeed ~2x faster than exsss, but it still slower than two BIF calls:

erlperf 'lcg:rand(1000).' 'erlang:phash2(erlang:unique_integer(), 1000).'
Code                                                  ||        QPS     Rel
erlang:phash2(erlang:unique_integer(), 1000).          1   14604 Ki    100%
lcg:rand(1000).                                        1   12539 Ki     85%

Naive implementation:

-module(lcg).
-export([rand/1]).

rand(Max) ->
    Seed = case get(seed) of
          undefined -> erlang:unique_integer();
          S -> S
    end,
    Next = (Seed * 134775813 + 1) rem Max,
    put(seed, Next),
    Next.

Measurements are taken using master branch which I think includes type-based optimisations.

Inline implementation is of course faster,

erlperf 'runner(Seed) ->  (Seed * 134775813 + 1) rem 1000.' --init '1.' 'erlang:phash2(erlang:unique_integer(), 1000).'
Code                                                      ||        QPS     Rel
runner(Seed) ->  (Seed * 134775813 + 1) rem 1000.          1   21073 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).              1   14532 Ki     68%

But it only works with existing Seed kept as a part of process state, while phash2(unique_integer()) has stable performance across all processes.

One thing that makes me a bit nervous, is what is the distribution of phash2(unique_integer()) when there are hundreds of thousands processes calling this.

raimo · March 3, 2022, 3:38pm

A linear congruential generator is probably faster than ´rand` and should be good enough.

It could be an idea to add a simple LGC to the rand algorithms,
for those who prefer speed over quality

max-au · March 4, 2022, 3:48am

Based on my (not very scientific) benchmarking attempt, regardless of the algorithm used, rand module is slow. There are multiple BIF calls to recall the seed and remember it.
I wonder how opposed the OTP team would be to a PR that implements (any agreed upon) RNG as a BIF/NIF.

dgud · March 4, 2022, 7:05am

You have functional alternatives, which don’t store the seed.

I don’t think we should have a RNG BIF/NIF, we have had a lot of problems
with hash, phash and phash2 over the years and we have also changed algorithms in
rand, these stuff evolves and should in my opinion not be a BIF/NIF.

But I’m all for a faster algorithm in the rand module, speed is more important than distribution
for many things.

raimo · March 4, 2022, 11:32am

I have tried to implement one of the existing algorithms as a NIF,
but the NIF overhead was so large that in total it did not
beat the erlang implementation.

A BIF could be faster, but just as Dan G I do not think we should
use BIF:s for a thing with so diffuse requirements. Sure we can
get fast, but to hit the sweet spot for fast vs. random looking
that suits more than just a few use cases would be impossible.

raimo · March 7, 2022, 3:20pm

For fun I tried to implement an MCG and an LCG with parameters
recommended in Pierre L’Ecuyer’s classic paper from 1999;
“Tables of Linear Congruential Generators of Different Sizes
and Good Lattice Structure”.

MCG: X1 = (163490618 * X0) rem ((1 bsl 31) - 1)
LCG: X1 = (32684613 * X0 + 1) band ((1 bsl 33) - 1)

The parameters are chosen to avoid bignum operations.

The LCG is a bit faster, but both are only about 10…20 % faster
than the default in ‘rand’: exsss. I also tried to short circuit
the measurement framework in rand_SUITE:measure/1 which only
gave a few percent more.

Given that the statistical properties (period, distribution, etc.)
are so much worse for these dead simple old basic generators,
especially the LCG (the faster) than for the current,
I do not see much point in implementing them.

But if someone could show that this is wrong in some way,
I may become interested again…

max-au · March 7, 2022, 6:06pm

Yes, that is exactly what I was talking about. It’s not the RNG speed per se, but the way rand is implemented - keeping a lot of state in the process dictionary, creating quite an amount of garbage and having high in-built implementation overhead.

bjorng · March 8, 2022, 4:29am

I would personally not oppose a fast no-fancy-features random number generator implemented as a BIF. That is, all it it would do is generating random integers. The random number seed would be kept in the scheduler data for each scheduler. (That way, there is no per process memory cost.) There would be no way to inspect or change the seed, and no way to change algorithms.

But to consider implementing that, we would have to know that there are compelling use cases where the rand module is not fast enough. What are you doing with the random numbers and would eliminating the overhead of rand make a significant difference?

raimo · March 8, 2022, 8:10am

As Dan G stated before; the rand_SUITE:measure/1 benchmark uses the
functional API in rand such as rand:uniform_s/2 that does not
store anything in the process dictionary so that should not be
an issue here. It is up to the user of rand to choose the
functional API or the process dictionary API.

Furthermore, for every state update, the garbage overhead of rand
is a wrapping 2-tuple around the state {AlgHandler,AlgState},
where the AlgHandler is static metadata. So the garbage from
a state update should be the AlgState, which is what
the algorithm dictates, plus the wrapping 2-tuple.

AlgHandler is a metadata map that might be much larger
than AlgStat, but it is static so it does not contribute
to garbage during state updates. The same applies when using
the process dictionary API since the process dictionary only
contains the key that refers to content on the heap, so for
a state update in the process dictionary all that is updated
is the 2-tuple (and AlgState) that the rand key refers to.

rand has a plug-in framework. The overhead is a few indirect
fun() calls via the AlgHandler, and the wrapping 2-tuple garbage.
So the overhead I’d say is not excessives, but of course,
compared to not using a plug-in framework, which has zero overhead,
it is infinite.

I might do some measurements to get some numbers on the overhead…

raimo · March 8, 2022, 8:21am

So far so good, but the next, and I think, hardest problem, is to select
this no-fancy-features algorithm as in how “random” the generated
numbers should be.

There is a whole research field debating the merits of different PRNG
algorithms’ properties and speed versus statistical properties,
predictability stretching in to security and possibly cryptography.

And, if we should go for a “best in class” cryptographical PRNG;
what performance guarantees should we have for all our supported
platforms given that some of them might not have hardware support?
Would this be another nail in the coffin for 32-bit Erlang?

raimo · March 8, 2022, 9:49am

Now I have some numbers.

I used `timer:tc/3 on:

lcg33_bare(Range, Seed, N) →
lcg33_bare(Range, Seed, N, 0).

lcg33_bare(_Range, _R, 0, V) →
V;
lcg33_bare(Range, R, N, V) →
{V1, R1} = lcg33_uniform(Range, R),
lcg33_bare(Range, R1, N-1, V bxor V1).

vs.

rand(Alg, Range, Seed, N) →
S = rand:seed_s(Alg, Seed),
rand_loop(Range, S, N, 0).

rand_loop(_Range, _S, 0, V) →
V;
rand_loop(Range, S, N, V) →
{V1, S1} = rand:uniform_s(Range, S),
rand_loop(Range, S1, N-1, V bxor V1).

Where lcg33_uniform(Range, R) is both inlined and implemented
in rand as a plug-in.

A loop generating 10000000 numbers in the range 1…2^30 uses
about 40 ns for the inlined generator and 50 ns for the one
in rand. So the overhead is about 10 ns or 20%.

The same for the mlcg31 generator which is noticabely slower
gives an overhead of about 15 ns or 25%.

For a range that is not a power of 2 the difference between
the LCG and the MLCG generater is smaller and the overhead
for the LCG generator increases a bit, but all in all
seems to be around 10…15 ns or 20…25%. I guess this
is due to increased rejection sampling to satisfy the
range. This is probably also the reason the MLCG generator
has got a higher overhead for the 2^30 range since it
almost always has to do rejection sampling.

I would say that the overhead is worth the flexibility…

bjorng · March 8, 2022, 11:03am

For OTP 24.x or master?

Some of the optimizations we have done in the compiler and JIT have hit the rand module. Not sure whether it results in a noticeable performance improvement.

bjorng · March 8, 2022, 11:08am

It would probably be possible to eliminate some of that overhead by optimizing get/1 calls in the JIT similar to how map lookups are optimized.

raimo · March 8, 2022, 12:23pm

On a master from 2022-03-01 15:10:34 (6340a42fd).

potatosalad · March 9, 2022, 11:48pm

I realize the BIF option may not be preferred, but out of curiosity I put together a proof-of-concept similar to unique_integer where the random state is stored on each scheduler’s ErtsSchedulerData. There’s also a global state which jumps for each scheduler’s initial state so each scheduler has its own non-overlapping subsequence.

It uses the Xoroshiro116+ code in rand.erl as well as SplitMix64 for establishing the initial global state on boot.

Here’s the code (based on master at 86fc3db): WIP - erlang:random_integer/0,1 BIF · potatosalad/otp@e3427a8 · GitHub

Single process test:

erlperf -c 1 'erlang:random_integer(1000).' 'erlang:phash2(erlang:unique_integer(), 1000).' 'rand:uniform(1000).'
Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                           1   33956 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).          1   22099 Ki     65%
rand:uniform(1000).                                    1   10102 Ki     29%

64 process test (matches the number of online schedulers):

erlperf -c 64 'erlang:random_integer(1000).' 'erlang:phash2(erlang:unique_integer(), 1000).' 'rand:uniform(1000).'
Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                          64   37147 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).         64   27644 Ki     74%
rand:uniform(1000).                                   64   14981 Ki     40%

raimo · March 10, 2022, 1:55pm

So about 2…3 times faster, it seems. Right?

I think the general structure of your suggestion is appealing,
i.e one state per scheduler, use the jump function to separate
the scheduler’s sequences.

This would be a no hard promises PRNG that could change any time
with “good enough” statistical properties and no sequence
repeatability where also the seeding is automatic and unspecified.

I think we need erlang:random_integer/0, erlang:random_integer/1,
erlang:random_float/0, erlang:random_binary/1 and some system
info to get the integer size (58 bits today) from the
erlang:random_integer/0 function.

erlang:random_integer/0 would return a non-bignum on 64 bit machines
(58 bits). I think we should not go lower than that on 32 bit
machines and instead take the speed penalty.

erlang:random_integer/1 would take a Range =< 58 bits and do
rejection sampling to satisfy the range. I think we should
not support bignum ranges, or? The return a value should be
in the range 0 =< V < Range i.e 0 based, never the upper bound.

Internally I think we should start by choosing Xoshiro256++,
i.e not mess around with 58 bit words, and just shift down
the output to a non-bignum (58 bits). (Maybe Xoshiro256+
for erlang:random_float/0) Using the machine word size
should improve performance and Xoroshiro128 is kind of small,
today, and not very much faster.

bjorng · March 10, 2022, 2:12pm

We could get 59 bits. 58 bits were used to ensure that addition of two numbers would still be a small.

raimo · March 10, 2022, 3:18pm

That would be possible, since it will be a system limit that may change.
But we should make a guarantee about the smallest possible max range.

To specify the full range, erlang:random_integer(1 bsl 59) would then
have a bignum range, right? That would complicate the code since this
one range requires a bignum comparision in the BIF to validate the range,
and probably make erlang:random_integer(1 bsl 59) noticably slower than
erlang:random_integer() although producing exactly the same range.

I think corner cases like this was why we selected 58 bit word size
for our rand generators.