Benchmarking of maps and the proposed lists:uniq/1

bjorng · February 24, 2022, 7:20am

UPDATE: This post has been edited because the uniq_all_at_once/1 is broken. See below.

I wanted to share the results of some benchmarking I’ve done as part of evaluating the proposed lists:uniq/1 function.

First I compared two functions that create a map in two different ways. This one creates a map by adding one element at a time to an initially empty map:

one_by_one(Data) ->
    one_by_one(Data, #{}).

one_by_one([Key|T], Map) ->
    one_by_one(T, Map#{Key => {Key, Key+1}});
one_by_one([], Map) ->
    Map.

The other one creates a list of {Key, Value} pairs and then turns it into a map using maps:from_list/1:

all_at_once(Data) ->
    all_at_once(Data, []).

all_at_once([Key|T], Acc) ->
    all_at_once(T, [{Key, Key+1}|Acc]);
all_at_once([], Acc) ->
    maps:from_list(Acc).

So which one is fastest? Here is the result from the benchmark run (using Erlang/OTP 24.2):

1> mbench:run().
generated 2000000 data items
one_by_one/1        : 1.825983
all_at_once/1       : 0.504215
   .
   .
   .
ok

Before all of you start rewriting of all map updating code , note that the map in the benchmark had two million elements. Unless your code regularly have to handle that many elements or if part of a frequently used library or the Erlang compiler, you probably don’t have to worry about the performance difference.

I then turned to benchmarking two possible implementations of lists:uniq/1. Here is the first, similar to the one proposed by @cons in the pull request:

uniq_one_by_one(L) ->
    uniq_one_by_one(L, #{}).

uniq_one_by_one([X|Xs], M) ->
    case is_map_key(X, M) of
        true ->
            uniq_one_by_one(Xs, M);
        false ->
            [X|uniq_one_by_one(Xs, M#{X => true})]
    end;
uniq_one_by_one([], _) ->
    [].

The other one constructs the map by using maps:from_keys/1 in one go (UPDATED: but it is wrong):

uniq_all_at_once(L) ->
    M = maps:from_keys(L, true),
    uniq_all_at_once(L, M).

uniq_all_at_once([X|Xs], M) ->
    case is_map_key(X, M) of
        true ->
            uniq_all_at_once(Xs, M);
        false ->
            [X|uniq_all_at_once(Xs, M)]
    end;
uniq_all_at_once([], _) ->
    [].

(UPDATED: is_map_key/1 will always return true, so the result will always be [].)

Based on the previous benchmarking of map creation, it is no surprise that uniq_all_at_once/1 is almost three times as fast (UPDATE: an incorrect program can run arbitrarily fast):

uniq_one_by_one/1   : 1.942674
uniq_all_at_once/1  : 0.736359

However, those results were produced with an input list where all items are unique. If I change the input list to two million identical elements, uniq_one_by_one/1 is the winner:

uniq_one_by_one/1   : 0.007935
uniq_all_at_once/1  : 0.085007

Therefore, in my opinion, if lists:uniq/1 is to be included in the lists module, the implementation should be based on uniq_one_by_one/1.

Finally, I wanted to know how those implementations compared to lists:usort/1, which can be used if you don’t care about the order of the elements. So I added this function to the benchmark:

uniq_by_usort(L) ->
    lists:usort(L).

Here are the results:

uniq_one_by_one/1   : 1.942674
uniq_by_usort/1     : 0.030334

Seeing that sorting can make the unique operation an order of magnitude faster, I then tried to implement uniq/1 by sorting while still keeping the order:

uniq_by_sofs(L0) ->
    L = number(L0, 0),
    S0 = sofs:relation(L),
    S1 = sofs:relation_to_family(S0),
    S2 = sofs:converse(S1),
    S3 = sofs:to_external(S2),
    [E || {_,E} <- S3].

number([H|T], N) ->
    [{H,N}|number(T, N+1)];
number([], _) ->
    [].

When all elements in the input list are unique, uniq_by_sofs/1 turned out to be the fastest function:

uniq_one_by_one/1   : 1.934827
uniq_by_sofs/1      : 0.534989

When all elements are identical, uniq_one_by_one/1 is again the winner:

uniq_one_by_one/1   : 0.007935
uniq_by_sofs/1      : 0.349756

Here is my benchmark module:

gist.github.com

https://gist.github.com/bjorng/a93d0e9d703f0543c583870c5529eeb1

mbench.erl

-module(mbench).
%%-compile(export_all).
-export([run/0]).

run() ->
    Limit = 2_000_000,
    Data = gen_data(0, Limit, []),
    io:format("generated ~p data items\n", [Limit]),
    %%io:format("~P\n", [Data,10]),
    Tests = [{"one_by_one/1",fun one_by_one/1},

This file has been truncated. show original

elbrujohalcon · February 24, 2022, 7:31am

Hummm… I don’t think uniq_all_at_once/1 will work as expected.
I bet it will always return [] since is_map_key(X, M) will always return true.

bjorng · February 24, 2022, 7:37am

Yes, you are right, of course.

Note to self: Always add test cases for the functions I benchmark. If it isn’t tested, it doesn’t work.

josevalim · February 24, 2022, 10:17am

I wonder if this approach could work:

Pair all elements in the list with their index. So [a, b, c] becomes [[a | 0], [b | 1], [c | 2]]
Perform a usort on the head of the list
Perform a sort on the tail of the list

Now traverse the original list matching indexes on the sorted list. Something like:

 keep_indexed([H | T], I, [[_ | I] | Is]) ->
   [H | keep_indexed(T, I + 1, Is)];
 keep_indexed([_ | T], I, Is) ->
   keep_indexed(T, I + 1, Is);
 keep_indexed([], _I, []) ->
   [].

Perhaps this is what sofs is doing?

Step 1 is going to add additional processing but it could be handled by a usort_index implementation that adds the indexes as you traverse, instead of requiring a pre-pass.

Another approach is to have a slightly modified usort_indexes that returns the indexes of the removed elements. The benefit is that you can do a quick Removed =:= [] to check if you should return the original list as is.

josevalim · February 24, 2022, 11:48am

Here is a naive implementation:

uniq_indexes(L0) ->
    L1 = add_index(L0, 0),
    L2 = lists:usort(fun([A|_], [B|_]) -> A =< B end, L1),
    L3 = lists:sort(fun([_|A], [_|B]) -> A =< B end, L2),
    take_indexed(L0, 0, L3).

add_index([H | T], I) -> [[H | I] | add_index(T, I + 1)];
add_index([], _I) -> [].

take_indexed([H | T], I, [[_ | I] | Is]) ->
    [H | take_indexed(T, I + 1, Is)];
take_indexed([_ | T], I, Is) ->
    take_indexed(T, I + 1, Is);
take_indexed([], _I, []) ->
    [].

Here are the results:

benchmarking 2000000 data items
uniq_indexes/1      : 0.629356
uniq_one_by_one/1   : 2.724498
uniq_by_sofs/1      : 0.740294
uniq_by_usort/1     : 0.042347
uniq_by_usort/2     : 0.077485

uniq_by_usort/2 is calling usort/2 cause I wanted to measure the cost of the anonymous function.

If all elements are equal, then it behaves poorly:

benchmarking 2000000 data items
uniq_indexes/1      : 0.449931
uniq_one_by_one/1   : 0.009655
uniq_by_sofs/1      : 0.410562
uniq_by_usort/1     : 0.010609
uniq_by_usort/2     : 0.026661

It even performs worse than sofs but it is much better than sofs for small collections.

I have also added a benchmark where half of the elements are different and half is unique:

uniq_indexes/1      : 0.871175
uniq_one_by_one/1   : 2.727809
uniq_by_sofs/1      : 2.753117
uniq_by_usort/1     : 0.042967
uniq_by_usort/2     : 0.084409

Some basic profiling shows that 1/6 of the time is spent on add_index, 2/6 on take_indexes, and the rest in the sort operations. Part of the 1/6 can likely be squeezed by reimplementing the initial steps of usort to sort and index, and then delegating to the rest to ufmergel and rufmergel.

bjorng · February 24, 2022, 12:01pm

Interesting. In the “half/half” case uniq_indexes is much faster than the other two. I guess that is because it throws away so many elements in the first sorting pass, so the second sorting pass has much less work to do.

josevalim · February 24, 2022, 12:25pm

Here is the merging of add_indexed with the first pass of usort in uniq_indexes:

benchmarking 2000000 data items
uniq_indexes/1      : 0.420352
uniq_one_by_one/1   : 2.808513
uniq_by_sofs/1      : 0.752575
uniq_by_usort/1     : 0.034249
uniq_by_usort/2     : 0.07497

All equal:

benchmarking 2000000 data items
uniq_indexes/1      : 0.009649
uniq_one_by_one/1   : 0.00919
uniq_by_sofs/1      : 0.501969
uniq_by_usort/1     : 0.011112
uniq_by_usort/2     : 0.035831

Implementation here (to be added into the lists module as it uses private functions).

%% uniq/1

uniq([X, Y | L] = L0) when X < Y ->
    case L of
        [] ->
            L0;
        [Z] when X == Z; Y == Z ->
            [X, Y];
        [Z] ->
            [X, Y, Z];
        _ ->
          uniq_indexed(L0, uisplit_1([X|0], [Y|1], L, [], [], 2))
    end;
uniq([X, Y | L] = L0) when X > Y ->
    case L of
        [] ->
            [X, Y];
        [Z] when X == Z; Y == Z ->
            [X, Y];
        [Z] ->
            [X, Y, Z];
        _ ->
            uniq_indexed(L0, uisplit_2([X|0], [Y|1], L, [], [], 2))
    end;
uniq([X, _Y | L] = L0) ->
    uniq_indexed(L0, uisort_1(X, L, 2));
uniq([X]) ->
    [X];
uniq([]) ->
    [].

uniq_indexed(L0, Indexed) ->
  Sorted = sort(fun([_|A], [_|B]) -> A =< B end, Indexed),
  take_indexed(L0, 0, Sorted).

take_indexed([H | T], I, [[_ | I] | Is]) ->
    [H | take_indexed(T, I + 1, Is)];
take_indexed([_ | T], I, Is) ->
    take_indexed(T, I + 1, Is);
take_indexed([], _I, []) ->
    [].

uisort_1(X, [Y | L], I) when X == Y ->
    uisort_1(X, L, I + 1);
uisort_1(X, [Y | L], I) when X < Y ->
    uisplit_1([X|0], [Y|I], L, [], [], I + 1);
uisort_1(X, [Y | L], I) ->
    uisplit_2([X|0], [Y|I], L, [], [], I + 1);
uisort_1(X, [], _I) ->
    [[X|0]].

%% This is a version of usort that splits and indexes at once.

uisplit_1(X, Y, [Z | L], R, Rs, I) when Z > hd(Y) ->
    uisplit_1(Y, [Z | I], L, [X | R], Rs, I + 1);
uisplit_1(X, Y, [Z | L], R, Rs, I) when Z == hd(Y) ->
    uisplit_1(X, Y, L, R, Rs, I + 1);
uisplit_1(X, Y, [Z | L], R, Rs, I) when Z > hd(X) ->
    uisplit_1([Z | I], Y, L, [X | R], Rs, I + 1);
uisplit_1(X, Y, [Z | L], R, Rs, I) when Z == hd(X) ->
    uisplit_1(X, Y, L, R, Rs, I + 1);
uisplit_1(X, Y, [Z | L], [], Rs, I) ->
    uisplit_1(X, Y, L, [[Z | I]], Rs, I + 1);
uisplit_1(X, Y, [Z | L], R, Rs, I) ->
    uisplit_1_1(X, Y, L, R, Rs, [Z | I], I + 1);
uisplit_1(X, Y, [], R, Rs, _I) ->
    rufmergel([[Y, X | R] | Rs], [], fun([A|_], [B|_]) -> A =< B end).

uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) when Z > hd(Y) ->
    uisplit_1_1(Y, [Z | I], L, [X | R], Rs, S, I + 1);
uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) when Z == hd(Y) ->
    uisplit_1_1(X, Y, L, R, Rs, S, I + 1);
uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) when Z > hd(X) ->
    uisplit_1_1([Z | I], Y, L, [X | R], Rs, S, I + 1);
uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) when Z == hd(X) ->
    uisplit_1_1(X, Y, L, R, Rs, S, I + 1);
uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) when Z > hd(S) ->
    uisplit_1(S, [Z | I], L, [], [[Y, X | R] | Rs], I + 1);
uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) when Z == hd(S) ->
    uisplit_1_1(X, Y, L, R, Rs, S, I + 1);
uisplit_1_1(X, Y, [Z | L], R, Rs, S, I) ->
    uisplit_1([Z | I], S, L, [], [[Y, X | R] | Rs], I + 1);
uisplit_1_1(X, Y, [], R, Rs, S, _I) ->
    rufmergel([[S], [Y, X | R] | Rs], [], fun([A|_], [B|_]) -> A =< B end).

%% Descending.
uisplit_2(X, Y, [Z | L], R, Rs, I) when Z < hd(Y) ->
    uisplit_2(Y, [Z | I], L, [X | R], Rs, I + 1);
uisplit_2(X, Y, [Z | L], R, Rs, I) when Z == hd(Y) ->
    uisplit_2(X, Y, L, R, Rs, I + 1);
uisplit_2(X, Y, [Z | L], R, Rs, I) when Z < hd(X) ->
    uisplit_2([Z | I], Y, L, [X | R], Rs, I + 1);
uisplit_2(X, Y, [Z | L], R, Rs, I) when Z == hd(X) ->
    uisplit_2(X, Y, L, R, Rs, I + 1);
uisplit_2(X, Y, [Z | L], [], Rs, I) ->
    uisplit_2(X, Y, L, [[Z | I]], Rs, I + 1);
uisplit_2(X, Y, [Z | L], R, Rs, I) ->
    uisplit_2_1(X, Y, L, R, Rs, [Z | I], I + 1);
uisplit_2(X, Y, [], R, Rs, _I) ->
    ufmergel([[Y, X | R] | Rs], [], fun([A|_], [B|_]) -> A =< B end).

uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) when Z < hd(Y) ->
    uisplit_2_1(Y, [Z | I], L, [X | R], Rs, S, I + 1);
uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) when Z == hd(Y) ->
    uisplit_2_1(X, Y, L, R, Rs, S, I + 1);
uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) when Z < hd(X) ->
    uisplit_2_1([Z | I], Y, L, [X | R], Rs, S, I + 1);
uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) when Z == hd(X) ->
    uisplit_2_1(X, Y, L, R, Rs, S, I + 1);
uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) when Z < hd(S) ->
    uisplit_2(S, [Z | I], L, [], [[Y, X | R] | Rs], I + 1);
uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) when Z == hd(S) ->
    uisplit_2_1(X, Y, L, R, Rs, S, I + 1);
uisplit_2_1(X, Y, [Z | L], R, Rs, S, I) ->
    uisplit_2([Z | I], S, L, [], [[Y, X | R] | Rs], I + 1);
uisplit_2_1(X, Y, [], R, Rs, S, _I) ->
    ufmergel([[S], [Y, X | R] | Rs], [], fun([A|_], [B|_]) -> A =< B end).

Disclaimer: I have only ran very limited tests! There are probably more optimizations but duplicating all of usort is likely too much.

bjorng · February 25, 2022, 3:55am

Very nice speed up of the sort-based uniq, but I am not sure that we would to add that much extra code the lists module. I couldn’t resist thinking about whether some of the pathological cases could be eliminated using less code and came up with the following (now added to my mbench module):

uniq_indexes(L0) ->
    L1 = add_index(L0, 0, []),
    L2 = lists:usort(fun([_|A], [_|B]) -> A =< B end, L1),
    L3 = lists:sort(L2),
    take_indexed(L0, 0, L3).

add_index([H1 | T], I, [[_ | H2] | _]=Acc) when H1 == H2 ->
    add_index(T, I + 1, Acc);
add_index([H | T], I, Acc) ->
    add_index(T, I + 1, [[I | H] | Acc]);
add_index([], _I, Acc) ->
    Acc.

take_indexed([H | T], I, [[I | _] | Is]) ->
    [H | take_indexed(T, I + 1, Is)];
take_indexed([_ | T], I, [_ | _]=Is) ->
    take_indexed(T, I + 1, Is);
take_indexed([_ | _], _I, []) ->
    [];
take_indexed([], _I, []) ->
    [].

I did the following modifications:

I modified add_index to remove consecutive duplicate elements.
I placed the index in the head so that the second sort could be done without using a sorting fun.
I modified take_indexed to stop immeditely when the second list becomes empty.

Here are the results when all elements are equal:

uniq_one_by_one/1   : 0.007664
uniq_by_sofs/1      : 0.37169
uniq_indexes/1      : 0.00515
uniq_by_usort/1     : 0.007006

Unfortunately, my benchmark was broken in a way that favored the sort-based uniq. See my next post.

bjorng · February 25, 2022, 4:19am

Unfortunately, my benchmark module generated a list sorted in descending order. That favored the sort-based implementations because the sorting became much cheaper (O(N) instead of O(N * log N)).

I have now updated the data generation to increase the randomness:

gen_data(Limit, Limit, Acc) ->
    Acc;
gen_data(N, Limit, Acc) ->
    Item = erlang:phash2(N),
    gen_data(N+1, Limit, [Item|Acc]).

Here are the new results:

uniq_one_by_one/1   : 1.970418
uniq_by_sofs/1      : 4.147654
uniq_indexes/1      : 3.922919
uniq_by_usort/1     : 0.462431

That is, the map-based implementation is about twice as fast as the sort-based implementations.

I also added @josevalim’s implementation to the lists module (on the master branch):

generated 2000000 data items
one_by_one/1        : 1.878218
all_at_once/1       : 0.582533
uniq_one_by_one/1   : 2.009027
uniq_by_sofs/1      : 3.837489
uniq_indexes/1      : 3.702631
lists:uniq/1        : 3.429415
uniq_by_usort/1     : 0.453432

Maria-12648430 · February 25, 2022, 3:08pm

@bjorng I posted a suggestion for an augmentation of the uniq function in the use cases thread: Uses cases? for proposed new lists functions - #10 by Maria-12648430. I think something like this is necessary to make uniq more than just superficially useful. At the same time, I fear that it will take a lot of the steam out of the pre-filled map approaches discussed here, or make them impossible even (sorry ).

bjorng · February 26, 2022, 5:34am

Nothing to be sorry for. After many twists and turns, the simple map-based solution (uniq_one_by_one/1 in the first post) turned out to be fastest implementation in general and that implementation can easily be extended to take a fun.