Yes, I thought about that, and since there was no huge difference worried about the peak memory consumption, often is not a problem in benchmarks. Hence stayed on shrinking the map while building the result list.
But I guess that many other algorithms suggested here happily does map:to_list/ which also has both the list and the map alive at one point.
Premature optimization for memory consumption, I guess…
Erlang’s maps is not an ordered data structure, so here you will get the elements in the map’s internal order, not in the key order since there is no key order.
OK, I see what they are doing there. It’s the same argument used to show that you can’t make uniformly random permutations using randomly flipping switches, because log(n!) is not a power of 2. It’s mathematically watertight. The question is whether it has any practical significance, and if it does, how hard it is to overcome.
There are M^N possible random labellings, which have to be distributed between N! permutations, and M^N %% factorial(N) is not in general 0.
Now the bias here is bounded by factorial(N)/M^N, and to a first approximation, factorial(N) is sqrt(2piN)(N/e)^N, so we’re looking at the odds of various permutations differing by about sqrt(2piN)(N/(e*M))^N.
This is a VERY SMALL NUMBER. Less than “one proton out of the entire universe” small for nontrivial N.
The argument is more general. Suppose you want to generate random permutations of N things. And suppose you have a procedure that generates a finite number f(N) of random numbers of size M. Then your state space is of size M^f(N) and the distribution will not be perfectly uniform unless N! divides M^f(N). In our case, M is 2**53, so M^f(N) is a power of 2, but N! is NOT a power of 2, so we’re never going to get a perfectly uniform distribution. Since the Fisher-Yates shuffle generates f(N) = N-1 random numbers, it is subject to the same stricture. The usual argument for why the F-Y shuffle works relies on the assumptions that we start with M=N, N-1, …, 2 but if you start with random floats and compute floor(Kuniform()) at step K, you’re starting with 53 bits and don’t get a perfectly uniform distribution over the integers 0..K-1. To get this right, you have to do something like 'do T = uniform_integer() while (T >= g(K)); return T%K". I’m pretty sure Knuth discusses this somewhere in AoCP vol 2. Swift gets this right in its Random.swift generator for integers. My partitioning algorithm has the same (kind of) issue sa the Fisher-Yates shuffle: if you just do Nuniform() < K it’s going to be a wee bit biased, you need uniform(N) < K with uniform(K) unbiased thanks to a rejection cycle.
Also, how will the bias show. If one just ignores the duplicates it should be some elements that appear in the same order as in the original list. But I have not seen any text that examines what happens if such sublists of colliding numbers are randomly permuted themselves. Kind of like your divide and permute suggestion. Would that make perfect distribution, or are there still mathematical anomalies…?
Now let’s consider duplicates.
Suppose we have a source of [0,M) random numbers. Think for example of the UNIX lrand48() function with M = 2**31.
To get a sample of N items with no duplicates, we have M choices, then M-1, …, then M-N+1.
There are M to the falling N ways to do this out of M to the N ways to pick N numbers without worrying about duplicates.
Now computing thisprobability requires serious care to work within the limits of floating-point arithmetic (overflow,
catastrophic cancellation, roundoff), but for values of N that are much smaller than M, we can for practical purposes
forget about duplicates. If M is small, as in the case of lrand48(), we have to worry even for N = 1000. If your
RNG is giving you 53 bits per float, N = 1000000 is not a problem.
An obvious way to increase M is to use two random numbers as a tag.
@SisMaker is right in that maps:values/1 should work.
When we insert n values with integer keys into a map, the internal order is always the same, regardless of the order in which we insert the values. That means that maps:values/1 will do an additional permutation of the already shuffled values, which should not make the list any more ordered or introduce any bias. That is, the first clause can be changed to:
shuffle_s([], State, M) ->
{maps:values(M), State};
But the list is not uniformly shuffled until the sort by random key has been done.
Without that we have just the maps internal order permutation, which sure is based on an internal hash function, but not designed to be uniformly distributed random from a user controlled seed. It is just an unknown permutation.
Edit, added
Wait… Now I realize what you are getting at…
The terms doesn’t have to be sorted; it is enough that they are ordered based on the random key. The maps term order will do just fine, any order will do.
And then I realized another caveat. Just inserting terms in a map, with maybe duplicate keys, will have the insert compare the value, not just the key, when there is a duplicate key.
This is a theoretical problem. Theoretical problems are often not possible to notice until they hit you in the face…
It may be so that I want to shuffle terms that are not at all suitable to compare. Maybe too big so it will topple the VM… So an ideal algorithm should not look ever at the values, as keysort (after we cleaned that up).
No, a map is never a bag. There is only ever one value associated with a key, and there is never any need to do any comparisons of the values put into a map.
The pseudo-random-number-generators in the rand module are all based on integer arithmetic, mostly xorshift and similar. Floating point arithmetic is only used when creating a floating point return value. The functions that returns an integer on a range have been implemented with rejection sampling so they are bias free by design.
The generators have at least 56 bits as their raw output value.
The rule of thumb for when to get away with collisions being unnoticeable is below half the generator width.
So when I shuffle a list of 2^28 elements, about 128_000_000 elements, it should require 4 GB just to store the list, more to shuffle it, it takes some 20 minutes on my laptop. I would need about 2^28 (or was it 2^56?) samples of shuffling to detect as small bias as we are talking about here.
Nevertheless, one can see that it might get possible to detect this bias on a very powerful computer(s) in a few decades. Therefore it would be nice to have proof that the algorithm we choose is bias free and doesn’t crumble for lists with lengths above 2^28.
Using two random numbers would be unwanted overhead for small lists, but I guess taking length/1 on the list to decide the number of random numbers will probably not be noticed.
Sorry that I have a hard time letting this go, but I tried to implement this
Here is the result:
%% Random-split-and-shuffle algorithm suggested by Richard A. O'Keefe
%% on ErlangForums, as I interpreted it...
%%
%% Randomly split the list in two lists,
%% recursively shuffle the two smaller lists,
%% randomize the order between the lists according to their size.
shuffle5_r([], State, _Length, Acc) ->
{Acc, State};
shuffle5_r([X], State, _Length, Acc) ->
{[X | Acc], State};
shuffle5_r([X, Y], State0, _Length, Acc) ->
{V, State1} = uniform_s(2, State0),
case V of
1 ->
{[Y, X | Acc], State1};
2 ->
{[X, Y | Acc], State1}
end;
shuffle5_r([X, Y, Z], State0, _Length, Acc) ->
{V, State1} = uniform_s(6, State0),
case V of
1 ->
{[Z, Y, X | Acc], State1};
2 ->
{[Y, Z, X | Acc], State1};
3 ->
{[Z, X, Y | Acc], State1};
4 ->
{[X, Z, Y | Acc], State1};
5 ->
{[Y, X, Z | Acc], State1};
6 ->
{[X, Y, Z | Acc], State1}
end;
shuffle5_r(List, State0, Length, Acc0) when is_integer(Length) ->
{Left, LeftLength, Right, State1} = shuffle5_split(List, State0),
{V, State2} = uniform_s(Length, State1),
RightLength = Length - LeftLength,
if
V =< LeftLength ->
{Acc1, State3} =
shuffle5_r(Left, State2, LeftLength, Acc0),
shuffle5_r(Right, State3, RightLength, Acc1);
true ->
{Acc1, State3} =
shuffle5_r(Right, State2, RightLength, Acc0),
shuffle5_r(Left, State3, LeftLength, Acc1)
end.
shuffle5_split(L, State) ->
shuffle5_split(L, State, 1, [], 0, []).
%%
shuffle5_split([], State, _P, Left, N, Right) when is_integer(N) ->
{Left, N, Right, State};
shuffle5_split(L, State0, 1, Left, N, Right) when is_integer(N) ->
M = 1 bsl 56,
case rand:uniform_s(M, State0) of
{V, State1} when is_integer(V) ->
shuffle5_split(L, State1, (V - 1) + M, Left, N, Right)
end;
shuffle5_split([X | L], State, P, Left, N, Right)
when is_integer(P), is_integer(N) ->
case P band 1 of
0 ->
shuffle5_split(L, State, P bsr 1, [X | Left], N + 1, Right);
1 ->
shuffle5_split(L, State, P bsr 1, Left, N, [X | Right])
end.
Since it is by far the fastest I have tried I would very much appreciate if @nzok and others, the more the better, could proofread it. It executes in about 2/3 of the time for 2^20 elements of the one currently suggested in Random shuffle of lists by RaimoNiskanen · Pull Request #10281 · erlang/otp · GitHub, and I think it has better time complexity since it finishes 2^28 elements in under 8 minutes on my laptop.
So the only question is if it is correct, and unbiased…
I have a suggestion (nitpick) for improving the shuffle5_split/6 function. As currently written, the three clauses will be attempted sequentially in order. Internally in the compiler, the function will be rewritten to something similar to this:
shuffle5_split(List, State0, P, Left, N, Right) when is_integer(P), is_integer(N) ->
case List of
[] ->
{Left, N, Right, State0};
_ ->
case P of
1 ->
M = 1 bsl 56,
case rand:uniform_s(M, State0) of
{V, State1} when is_integer(V) ->
shuffle5_split(List, State1, (V - 1) + M, Left, N, Right)
end;
_ ->
case List of
[X | L] ->
case P band 1 of
0 ->
shuffle5_split(L, State0, P bsr 1, [X | Left], N + 1, Right);
1 ->
shuffle5_split(L, State0, P bsr 1, Left, N, [X | Right])
end
end
end
end.
Adding a pattern to the second clause like so:
shuffle5_split([], State, _P, Left, N, Right) when is_integer(N) ->
{Left, N, Right, State};
shuffle5_split([_ | _]=L, State0, 1, Left, N, Right) when is_integer(N) ->
M = 1 bsl 56,
case rand:uniform_s(M, State0) of
{V, State1} when is_integer(V) ->
shuffle5_split(L, State1, (V - 1) + M, Left, N, Right)
end;
shuffle5_split([X | L], State, P, Left, N, Right)
when is_integer(P), is_integer(N) ->
case P band 1 of
0 ->
shuffle5_split(L, State, P bsr 1, [X | Left], N + 1, Right);
1 ->
shuffle5_split(L, State, P bsr 1, Left, N, [X | Right])
end.
will result in better code. Here is how the compiler will rewrite the code:
shuffle5_split(List, State0, P, Left, N, Right) when is_integer(P), is_integer(N) ->
case List of
[X | L] ->
case P of
1 ->
M = 1 bsl 56,
case rand:uniform_s(M, State0) of
{V, State1} when is_integer(V) ->
shuffle5_split(L, State1, (V - 1) + M, Left, N, Right)
end;
_ ->
case P band 1 of
0 ->
shuffle5_split(L, State0, P bsr 1, [X | Left], N + 1, Right);
1 ->
shuffle5_split(L, State0, P bsr 1, Left, N, [X | Right])
end
end;
[] ->
{Left, N, Right, State0}
end.
Thinking a little bit more, I don’t see why there is need to randomize the order in which the left and right partitions are concatenated. The randomizing done in the partitioning should be enough.
That is, there should not be any need to keep track of the lengths, and the last clause of shuffle5_r() could be simplified to:
I have come to the conclusion that you are perfectly right and wrote an explanation of how I think the algorithm actually works in the PR I already linked to some times in this thread.