# Looking for a faster RNG

So far so good, but the next, and I think, hardest problem, is to select
this no-fancy-features algorithm as in how “random” the generated
numbers should be.

There is a whole research field debating the merits of different PRNG
algorithms’ properties and speed versus statistical properties,
predictability stretching in to security and possibly cryptography.

And, if we should go for a “best in class” cryptographical PRNG;
what performance guarantees should we have for all our supported
platforms given that some of them might not have hardware support?
Would this be another nail in the coffin for 32-bit Erlang?

2 Likes

Now I have some numbers.

I used `timer:tc/3 on:

lcg33_bare(Range, Seed, N) →
lcg33_bare(Range, Seed, N, 0).

lcg33_bare(_Range, _R, 0, V) →
V;
lcg33_bare(Range, R, N, V) →
{V1, R1} = lcg33_uniform(Range, R),
lcg33_bare(Range, R1, N-1, V bxor V1).

vs.

rand(Alg, Range, Seed, N) →
S = rand:seed_s(Alg, Seed),
rand_loop(Range, S, N, 0).

rand_loop(_Range, _S, 0, V) →
V;
rand_loop(Range, S, N, V) →
{V1, S1} = rand:uniform_s(Range, S),
rand_loop(Range, S1, N-1, V bxor V1).

Where `lcg33_uniform(Range, R)` is both inlined and implemented
in `rand` as a plug-in.

A loop generating 10000000 numbers in the range 1…2^30 uses
about 40 ns for the inlined generator and 50 ns for the one
in `rand`. So the overhead is about 10 ns or 20%.

The same for the mlcg31 generator which is noticabely slower
gives an overhead of about 15 ns or 25%.

For a range that is not a power of 2 the difference between
the LCG and the MLCG generater is smaller and the overhead
for the LCG generator increases a bit, but all in all
seems to be around 10…15 ns or 20…25%. I guess this
is due to increased rejection sampling to satisfy the
range. This is probably also the reason the MLCG generator
has got a higher overhead for the 2^30 range since it
almost always has to do rejection sampling.

I would say that the overhead is worth the flexibility…

2 Likes

For OTP 24.x or `master`?

Some of the optimizations we have done in the compiler and JIT have hit the `rand` module. Not sure whether it results in a noticeable performance improvement.

2 Likes

It would probably be possible to eliminate some of that overhead by optimizing `get/1` calls in the JIT similar to how map lookups are optimized.

3 Likes

On a master from 2022-03-01 15:10:34 (6340a42fd).

4 Likes

I realize the BIF option may not be preferred, but out of curiosity I put together a proof-of-concept similar to `unique_integer` where the random state is stored on each scheduler’s `ErtsSchedulerData`. There’s also a global state which jumps for each scheduler’s initial state so each scheduler has its own non-overlapping subsequence.

It uses the Xoroshiro116+ code in `rand.erl` as well as SplitMix64 for establishing the initial global state on boot.

Here’s the code (based on master at 86fc3db): WIP - erlang:random_integer/0,1 BIF · potatosalad/otp@e3427a8 · GitHub

Single process test:

``````erlperf -c 1 'erlang:random_integer(1000).' 'erlang:phash2(erlang:unique_integer(), 1000).' 'rand:uniform(1000).'
Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                           1   33956 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).          1   22099 Ki     65%
rand:uniform(1000).                                    1   10102 Ki     29%
``````

64 process test (matches the number of online schedulers):

``````erlperf -c 64 'erlang:random_integer(1000).' 'erlang:phash2(erlang:unique_integer(), 1000).' 'rand:uniform(1000).'
Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                          64   37147 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).         64   27644 Ki     74%
rand:uniform(1000).                                   64   14981 Ki     40%
``````
5 Likes

So about 2…3 times faster, it seems. Right?

I think the general structure of your suggestion is appealing,
i.e one state per scheduler, use the jump function to separate
the scheduler’s sequences.

This would be a no hard promises PRNG that could change any time
with “good enough” statistical properties and no sequence
repeatability where also the seeding is automatic and unspecified.

I think we need erlang:random_integer/0, erlang:random_integer/1,
erlang:random_float/0, erlang:random_binary/1 and some system
info to get the integer size (58 bits today) from the
erlang:random_integer/0 function.

erlang:random_integer/0 would return a non-bignum on 64 bit machines
(58 bits). I think we should not go lower than that on 32 bit
machines and instead take the speed penalty.

erlang:random_integer/1 would take a Range =< 58 bits and do
rejection sampling to satisfy the range. I think we should
not support bignum ranges, or? The return a value should be
in the range 0 =< V < Range i.e 0 based, never the upper bound.

Internally I think we should start by choosing Xoshiro256++,
i.e not mess around with 58 bit words, and just shift down
the output to a non-bignum (58 bits). (Maybe Xoshiro256+
for erlang:random_float/0) Using the machine word size
should improve performance and Xoroshiro128 is kind of small,
today, and not very much faster.

3 Likes

We could get 59 bits. 58 bits were used to ensure that addition of two numbers would still be a small.

2 Likes

That would be possible, since it will be a system limit that may change.
But we should make a guarantee about the smallest possible max range.

To specify the full range, `erlang:random_integer(1 bsl 59)` would then
have a bignum range, right? That would complicate the code since this
one range requires a bignum comparision in the BIF to validate the range,
and probably make `erlang:random_integer(1 bsl 59)` noticably slower than
`erlang:random_integer()` although producing exactly the same range.

I think corner cases like this was why we selected 58 bit word size
for our `rand` generators.

3 Likes

Yes, at least in most of the tests I have run, there is between 5-35% performance improvement using `erlang:random_integer/1` compared with the `erlang:phash2(erlang:unique_integer(), Range)` trick and 60-80% performance improvement compared with `rand:uniform/1`.

I also neglected to share the performance results from `erlang:random_integer/0` in my initial comment.

Single process test with `erlang:random_integer/0`:

``````erlperf -c 1 'erlang:random_integer().' 'erlang:unique_integer().' 'rand:uniform().'
Code                             ||        QPS     Rel
erlang:random_integer().          1   40138 Ki    100%
erlang:unique_integer().          1   36524 Ki     91%
rand:uniform().                   1   10316 Ki     26%
``````

64 process test with `erlang:random_integer/0` (matches the number of online schedulers):

``````erlperf -c 64 'erlang:random_integer().' 'erlang:unique_integer().' 'rand:uniform().'
Code                             ||        QPS     Rel
erlang:random_integer().         64   42728 Ki    100%
erlang:unique_integer().         64   36211 Ki     84%
rand:uniform().                  64   16375 Ki     38%
``````

Yeah, I agree: this would not be a full replacement for `rand` and would primarily be useful in very high QPS use-cases (for example: random routing/load balancing).

That makes sense, but would this mostly be for API complete-ness or do you have any specific use-cases that something like `erlang:random_binary/1` would be a perfect fit for that `crypto:strong_rand_bytes/1` or `rand:bytes/1` are sub-optimal for today?

Also makes sense for 32-bit. I figured we would have to address the 32-bit issue before this could be accepted. Right now I’m just blindly casting to `Uint` (all integers are just 64-bit everywhere, anyway, right? ).

Right now: I’m mimicking the `rand:uniform/1` behavior that requires `Range >= 1` while also enforcing `Range =< (1 bsl 58) - 1`. I’m using `v != 0` to terminate the loop, which accomplishes something similar for the “V in the truncated top range” case in the `?uniform_range` macro in `rand.erl`.

One (probably very rare) case that isn’t currently covered is if, for whatever reason, the scheduler’s state gets set to two zeros, any call to `erlang:random_integer/1` will always return `1` forever. The same “no recovery from zero state” behavior can be simulated with `rand` using `rand:uniform_s(1000, {element(1, rand:seed(exrop)), [0|0]}).`. Regardless of the input `N`, the resulting output is always 1 and the state never changes.

Another issue is the initial value for `splitmix64_seed`, which I assumed we could either use some low-grade “entropy” (like system time, number of schedulers, etc; similar to `rand:seed/1`) that could be overridden with something like `erl +randseed 123456`.

Context for my following questions: I don’t know a ton about PRNG and the details of the pros/cons of the various algorithms. So I’m legitimately curious.

Why Xoshiro256++ versus Xoroshiro116+? If I’m understanding correctly, I think the cycles/B is roughly double, unless we were able to get the AVX2/NEON vectorization version to work correctly. If we split Xoshiro256++ for integers and Xoshiro256+ for floats, would the idea be that separate state would be kept on each scheduler for each type of output? If so, I think that would be 32-bytes (or 256-bytes for the SIMD version) stored twice, so either 64-bytes or 512-bytes per scheduler.

5 Likes

Interesting speed figures!

Now, a NIF call overhead should not be much larger (if even) than
a BIF call overhead. So if this could well enough be implemented
in a NIF - that would be very much preferred.

Before getting deeper into the discussion of whether this is an
extension of the VM that we want we must explore implementing this
as a NIF.

I have tried once implementing one of the Xoroshiro generators
with a NIF, and did not get much performance improvement, but that
attempt was adhering to the framework of `rand`, meaning passing the state
back and forth between Erlang and the NIF.

Storing the state in thread specific data should have much less overhead.

:

That makes sense, but would this mostly be for API complete-ness or do you have any specific use-cases that something like `erlang:random_binary/1` would be a perfect fit for that `crypto:strong_rand_bytes/1` or `rand:bytes/1` are sub-optimal for today?

That suggestion I guess was for API completeness.
I have no good use cases - often, I guess, there would be other
requirements so generating binaries would be a too narrow problem.

But I think there are use cases for fast random floats.

Also makes sense for 32-bit. I figured we would have to address the 32-bit issue before this could be accepted. Right now I’m just blindly casting to `Uint` (all integers are just 64-bit everywhere, anyway, right? ).

Let’s postpone this problem for later.

:
:

One (probably very rare) case that isn’t currently covered is if, for whatever reason, the scheduler’s state gets set to two zeros, any call to `erlang:random_integer/1` will always return `1` forever. The same “no recovery from zero state” behavior can be simulated with `rand` using `rand:uniform_s(1000, {element(1, rand:seed(exrop)), [0|0]}).`. Regardless of the input `N`, the resulting output is always 1 and the state never changes.

That should never happen unless you seed it to [0|0], which must be avoided.
[0|0] is not part of the generator’s sequence, or rather; this class of
generators have two sequences; one is the singular [0|0] and the one used
is a full sequence of all other state values.

Another issue is the initial value for `splitmix64_seed`, which I assumed we could either use some low-grade “entropy” (like system time, number of schedulers, etc; similar to `rand:seed/1`) that could be overridden with something like `erl +randseed 123456`.

Possibly, but auto seed only would be prefered.

Also, the Splitmix64 state updates need to be thread safe, right?

Context for my following questions: I don’t know a ton about PRNG and the details of the pros/cons of the various algorithms. So I’m legitimately curious.

Why Xoshiro256++ versus Xoroshiro116+? If I’m understanding correctly, I think the cycles/B is roughly double, unless we were able to get the AVX2/NEON vectorization version to work correctly. If we split Xoshiro256++ for integers and Xoshiro256+, would the idea be that separate state would be kept on each scheduler for each type of output? If so, I think that would be 32-bytes (or 256-bytes for the SIMD version) stored twice, so either 64-bytes or 512-bytes per scheduler.

It is just that Xoshiro256 is recommended by the creator(s) as an
all-purpose generator. Its state is 32 bytes which I do not think
is much to store per scheduler. The jump distance is long: 2^128
compared to 2^64 for Xoroshiro128. If you run a VM very long it is
maybe even possible to use 2^64 random numbers in a scheduler.

The difference between Xhoshiro256++ and Xoshiro256+ is in
the pre state update output scrambling (++ vs.+) so my idea was to use
the same state update and scramble differently between integers and floats
since the + scrambler is slightly faster and its small disadvantage
vs. the ++ scrambler can be hidden when generating floats.

I also suspect that getting down from about 1 ns generation time
per 64-bit integer to a fraction of it will be buried in BIF or NIF
call overhead so there is not much point in looking at e.g. vectorization.
The same reasoning applies between Xoshiro256 vs. Xoroshiro128.

Cheers

3 Likes

I threw together a NIF here: GitHub - potatosalad/erlang-fastrandom

I also started down the path of storing the state in a `reference()`, but found that the performance was nearly identical to pure-Erlang `rand` when it was written as a `rand` “plugin”.

The NIF has support for Xoroshiro116+, Xoshiro256+, Xoshiro256+X8, Xoshiro256++, and Xoshiro256++X8. SplitMix64 is used along with `enif_tsd_get` to store thread specific data similar to how the BIF functions.

However, in its current form, the BIF implementation is still roughly 30-35% more efficient compared to the NIF implementation.

Single process performance:

``````Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                           1   34123 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).          1   22562 Ki     66%
fastrandom_nif:xoroshiro116p_next(1000).               1   17359 Ki     51%
fastrandom_nif:xoshiro256p_next(1000).                 1   17319 Ki     51%
fastrandom_nif:xoshiro256px8_next(1000).               1   16886 Ki     50%
fastrandom_nif:xoshiro256ppx8_next(1000).              1   16861 Ki     49%
fastrandom_nif:xoshiro256pp_next(1000).                1   16662 Ki     48%
rand:uniform(1000).                                    1   10465 Ki     30%
``````

64 process performance:

``````Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                          64   36578 Ki    100%
fastrandom_nif:xoroshiro116p_next(1000).              64   26174 Ki     71%
fastrandom_nif:xoshiro256p_next(1000).                64   20460 Ki     56%
erlang:phash2(erlang:unique_integer(), 1000).         64   18774 Ki     51%
fastrandom_nif:xoshiro256pp_next(1000).               64   15208 Ki     41%
fastrandom_nif:xoshiro256ppx8_next(1000).             64   15018 Ki     41%
fastrandom_nif:xoshiro256px8_next(1000).              64   13834 Ki     37%
rand:uniform(1000).                                   64    9105 Ki     24%
``````

That makes sense, thanks for the explanation

Normally: yes. However, at first glance, it looked like `erts_init_scheduling` calls `init_scheduler_data` in a `for` loop, which would result in `erts_sched_bif_unique_init` being called sequentially on a single thread. So…I think `SplitMix64` is only ever called on a single thread. However, if I am misinterpreting the code there, then yes: some sort of lock would need to be added.

That makes sense, thank you for the explanation.

What are your thoughts on the direction of the experiments so far? I think given the fairly small performance improvement between the phash2/unique integer trick and the NIF, we’ll likely stick with the former for the time being. However, the BIF implementation could potentially be useful to us for very high QPS random-based load balancing workloads.

4 Likes

I threw together a NIF here: GitHub - potatosalad/erlang-fastrandom

I also started down the path of storing the state in a `reference()`, but found that the performance was nearly identical to pure-Erlang `rand` when it was written as a `rand` “plugin”.

I have once followed that path, and arrived to the same conclusion…

The NIF has support for Xoroshiro116+, Xoshiro256+, Xoshiro256+X8, Xoshiro256++, and Xoshiro256++X8. SplitMix64 is used along with `enif_tsd_get` to store thread specific data similar to how the BIF functions.

However, in its current form, the BIF implementation is still roughly 30-35% more efficient compared to the NIF implementation.

Single process performance:

``````Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                           1   34123 Ki    100%
erlang:phash2(erlang:unique_integer(), 1000).          1   22562 Ki     66%
fastrandom_nif:xoroshiro116p_next(1000).               1   17359 Ki     51%
fastrandom_nif:xoshiro256p_next(1000).                 1   17319 Ki     51%
fastrandom_nif:xoshiro256px8_next(1000).               1   16886 Ki     50%
fastrandom_nif:xoshiro256ppx8_next(1000).              1   16861 Ki     49%
fastrandom_nif:xoshiro256pp_next(1000).                1   16662 Ki     48%
rand:uniform(1000).                                    1   10465 Ki     30%
``````

64 process performance:

``````Code                                                  ||        QPS     Rel
erlang:random_integer(1000).                          64   36578 Ki    100%
fastrandom_nif:xoroshiro116p_next(1000).              64   26174 Ki     71%
fastrandom_nif:xoshiro256p_next(1000).                64   20460 Ki     56%
erlang:phash2(erlang:unique_integer(), 1000).         64   18774 Ki     51%
fastrandom_nif:xoshiro256pp_next(1000).               64   15208 Ki     41%
fastrandom_nif:xoshiro256ppx8_next(1000).             64   15018 Ki     41%
fastrandom_nif:xoshiro256px8_next(1000).              64   13834 Ki     37%
rand:uniform(1000).                                   64    9105 Ki     24%
``````

:

What are your thoughts on the direction of the experiments so far? I think given the fairly small performance improvement between the phash2/unique integer trick and the NIF, we’ll likely stick with the former for the time being. However, the BIF implementation could potentially be useful to us for very high QPS random-based load balancing workloads.

About the benchmark: can we get an estimate of the benchmark overhead
by also benchmarking the ‘ok’ expression or ‘self()’ (that I IIRC is a VM
instruction) to see if it might be that the BIF implementation is
even more efficient when accounting for overhead.

Also, ‘rand:uniform(1000)’ stores the state in the process dictionary,
which might be noticeably slower than keeping it in a state variable.
But I can not find any way to iterate over a state in the erlperf framework,
so I will measure in the rand_SUITE:measure to see how much process dictionary
state vs state variable differs.

As it stands now, fastrandom_nif:exrop has about 1.4-2.0 the execution time of
erlang:random_integer, and rand:uniform has about 1.7-3.0 the execution
time of fastrandom_nif:exrop.

When accounting for overhead, erlang:random_integer might be better than that,
and when measuring without process dictionary fastrandom_nif might be not as good.

These might be arguments in favour for a BIF solution…

About the fastrandom_nif code:

I sat down to do some moonshine programming, when I saw that you
posted very much the code I intended to write - thank you very much!

I can not see any way to optimize the code…

• The ending in all *_next_0() functions is `output %= UINT58MASK;` but should
be `output &= UINT58MASK;`.

• All *_next_1() functions check the range and throw a badarg “Bad
argument: ‘Seed’” but the erroneous argument is the range.

• The same functions check the max range as bad if n > UINT58MASK
but that should be n > TWO_POW58 (n > 1<<58).

• The same functions have a range check-reject-retry loop that returns
a value in the range [1, n] for a range n, but the *_next_0() functions
return a value in the range [0, (2^58 - 1)]. This might be exactly
what we want, or do we want the range [0, (n-1)] for the *_next_1()
functions?

• There is a tsd_get, alloc, ctx_jump, tsd_set section in all next()
functions that I would have made a helper function of.

• fastrandom_nif_global_seed should not be set to 0 - we should find some
entropy; future improvement I guess… We could take it from
load_info to make it an Erlang problem.

• The 116 bit generator should not be seeded with 64 bit seed words,
and although the Splitmix64 generator will not output 2 zeros in a row,
after masking down to 58 bit you might get just that; so a loop that
avoids two zeros (or just zeros) is needed whan seeding (with masking
to 58 bits) this generator.

Cheers!

3 Likes

My understanding/finding during my experimentation on SFMT and TinyMT on Erlang from 2010 to 2012 was that the overhead of function callings themselves became the fundamental overhead of generating random numbers as a sequence with state manipulation. See my TinyMT paper Section 4.3 for the details. Quote:

While TinyMT is inherently simpler than SFMT, that does not necessarily mean TinyMT is faster for Erlang. We think this is due to the execution overhead time of each function in BEAM.

If you really need a blazingly fast random number generator, you need to consider a batch generation of 100s or 1000s of random number sequence in NIFs as a list and simply get the head element out of the list pool.

4 Likes

There is indeed a visible overhead for benchmarking itself:

``````erlperf 'ok.' 'rand:uniform(100).'
Code                       ||        QPS     Rel
ok.                         1   31397 Ki    100%
rand:uniform(100).          1   10029 Ki     31%
``````

Mainly because benchmark calls `atomics:add`. One can’t avoid it with `timer:tc`(because of the looping overhead). I plan to address this by using less expensive benchmark approach in the next version.

3 Likes

Also, a loop over a state is a missing feature in this case.
(Shameless feature request

3 Likes

Now I have tried that.

I took Andrew Bennet’s fastrandom_nif and changed it to have a fixed array of 32 generated numbers, to fill that array and return a list of numbers from that array. Array size 100 gave a hardly noicable improvement while 10 was noticably worse. I chose to measure on Xoshiro256++, and compare to (also Andrew Bennet’s) erlang:random_integer/0,1 that implements Xoroshiro116+ which is a few percent faster algorithm according to Andrew’s numbers.

Then I have measured by modifyng the rand_SUITE:measure/1 test case.

Numbers from my Dell Latitude 7300 laptop, per generated number, amortized, about 10 ns is spent on measure/1 overhead, the figures below are compensated for that:

Both cached fastrandom:xoshiro256pp_next/0 and erlang:random_integer/0 (no range limiting) use 6 ns

For generation with range 10000 both cached fastrandom:xoshiro256pp_next/1 and erlang:random_integer/1 use 12 ns.

For the awkward case half range + 1 (about 50% probability to discard and retry) cached fastrandom:xoshiro256pp_next/1 uses 12 ns while erlang:random_integer/1 uses 19 ns.

The `rand` framework has got another 20 ns overhead besides measure/1 overhead.
The default `exsss` algorithm in `rand` uses 30 ns besides overhead.
For the same algorithm small range 10000 limiting instead uses 50 ns besides overhead, and the awkward half range + 1 range limiting uses 75 ns besides overhead.

So, using a NIF that generates and converts 32 numbers to a list per invocation, and picking cached numbers from the head of the list reaches the same or slightly better performance than the erlang:random_integer/0,1 BIF implementation generating one number at the time.

Flipping the coin: the erlang:random_integer/0,1 BIF implementation reaches almost the same performance as generating in batches and picking numbers from the head of a list, with much less memory footpring and without the hassle of keeping cached numbers in yer state, having to decide the range for all numbers when generating the cache, etc…

With these low figures of number generation time; what you actually do with the numbers gets very significant, it might be so that generating a number in 6 ns vs. in 95 ns is irrelevant compared to application processing, and in other applications it might be an order of a magnitude difference. For some applications it might be no problem to pre-generate and handle a cache with known range numbers, and in others it may be very awkward.

A complete solution for cached PRNG numbers would need the ability to control how many numbers that are generated per cache refill, and being able to generate floating point numbers.

Both cached NIF generated PRNG numbers and erlang:random_integer/0,1 numbers can be generated in about 1/8 … 1/5 of the time required by the default `rand` generator i.e they are about 5 to 8 times faster / leaner. (but lack e.g repeatability and jumping)

Cheers!

2 Likes

I have done some more benchmarking and it turns out that bjorng was right from the start…

A simple Linear Congruential Generator is very fast, and can be an alternative if its substandard statistical qualities are ok for your application. To be fast it has to avoid bignum operations and fancy theoreticaly correct ways to implement a random number range.

I also noted that the remainder for integer division operator ‘rem’ is expensive - a bit mask is much faster.

After compensating for benchmark overhead, on my machine, (amortized) generation time per number (integer):

``````Generator	Full Range (ns)		Range 10000 (ns)
---------	---------------		----------------
LCG35		4			14
BIF  		6			12
NIF + cache	6			12
NIF  		18			25
unique + phash 18			25
rand		50			60
``````

BIF is Andrew Bennet’s erlang:random_integer/0,1.

NIF is Andrew Bennet’s fastrandom, Xoshiro256++.

NIF + cache is the same’s Xoshiro256+ with a 32 number list cache.

unique + phash is, simply erlang:phash2(erlang:unique_integer()[, Range]).

rand is the default algorithm in the ‘rand’ module (116 bit), not using the process dictionary.

LCG35 is the following generator:

``````lcg35(R) ->
(15319397 * R + 15366142135) band ((1 bsl 35) - 1).
``````

The range 10000 operation is done with `rem 10000` which is imperfect but mostly good enough for such a small range. All other generators produce the range in a theretically correct way with reject and retry.

An LCG generator has got bad statistical properties for the low bits - see the Wikipedia article. I have also tried a smaller Multiplicative Congruential Generator that should not have that particular flaw, but, it is smaller and have other pecularities. Other than that it is roughly equally fast thanks to being based on a Mersenne prime which facilitates an optimization of the ‘rem’ operation:

``````-define(MCG31_M, 163490618).
mcg31(R0) ->
X = ?MCG31_M * R0,
%%
%% X rem ?MCG31_M optimized for Mersenne prime
R1 = (X band ((1 bsl 31) - 1) + (X bsr 31),
if
?MCG31_M =< R1 -> R1 - ?MCG31_M;
true           -> R1
end.
``````

I have also looked at a <35 bit MCG generator that does not use a Mersenne prime so it was harder to optimize, and it turned out to be only slightly faster than NIF and unique + phash. Therefore I think it is probably not very interesting, but I can publish it if requested.

Generator constants from “Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure” by Pierre L’Ecuyer (1999).

Cheers!
/ Raimo Niskanen, Erlang/OTP

5 Likes

And now I created a PR with some improvements in this area:

``https://github.com/erlang/otp/pull/5836``
8 Likes

Sure thing!

I have rewritten `erlperf` in attempt to reduce benchmarking overhead to a bare minimum (now it is essentially a single local function call, estimated at 1-2 ns). And of course added this missing feature. I can confirm that new niche RNGs are indeed faster (up to 7x compared to exsss):

``````./erlperf 'run(_, S) -> rand:mcg35(S).' --init_runner "1." 'run(_, S) -> {_, S1} = rand:uniform_s(S), S1.' --init_runner 'rand:seed(exsss).' 'run(_, S) -> rand:lcg35(S).' --init_runner "1."
Code                                                  ||        QPS       Time     Rel
run(_, S) -> rand:lcg35(S).                            1     129 Mi       7 ns    100%
run(_, S) -> rand:mcg35(S).                            1     122 Mi       8 ns     94%
run(_, S) -> {_, S1} = rand:uniform_s(S), S1.          1   20079 Ki      49 ns     15%
``````

Decrypting the incantation above:

1. Run `rand:mcg35(S)` and use the result as the input for the next iteration (the feature you requested)
2. Same but for `uniform_s`. Unfortunately it returns a bit more complicated structure (random number and a new state).
3. Same, with `lcg35`.

In general, results reported by erlperf 2.0 seem to match numbers posted in your PR (except for `crypto:strong_rand_bytes` that takes only ~800 ns on my machine (AMD 3.2 GHz desktop CPU).

Now the only issue is they all require state saved somewhere. Which means we’ll continue using a wrapper… `rand:lcg35(erlang:unique_integer()).`. It is still blazingly fast, 2x faster compared to `erlang:phash2(erlang:unique_integer()).`
But I just don’t like wrappers… especially this one, that is `-define(RNG(), rand:lcg35(erlang:unique_integer())).`

4 Likes