Benchmarking of sofs:relation_to_family/1 and a variant of the proposed lists:groupby

bjorng · February 24, 2022, 9:40am

Some questions about the proposed lists:groupby family of functions came up. For example:

What is performance compared to sofs:relation_to_family assuming that your problem doesn’t require that order and duplicates are kept within the groups?
What are the performance implications if we don’t want a map but a list as a result? (For further processing or perhaps for turning it into an array.)
Since it returns a map, should it be in the maps module?

I have tried to answer some of those questions with a benchmark. In the benchmark, I have added a new variant of the groupby family that takes no funs:

groupby(List) ->
    groupby_0(lists:reverse(List), #{}).

groupby_0([{K,V} | Tail], Acc) ->
    NewAcc = case Acc of
                 #{K := Vs} -> Acc#{K := [V | Vs]};
                 #{} -> Acc#{K => [V]}
             end,
    groupby_0(Tail, NewAcc);
groupby_0([], Acc) ->
    Acc.

The corresponding function using sofs looks like this:

sofs_groupby(List) ->
    S0 = sofs:relation(List),
    S1 = sofs:relation_to_family(S0),
    S2 = sofs:to_external(S1),
    maps:from_list(S2).

I also created two functions that return lists instead of maps:

groupby_as_list(List) ->
    maps:to_list(groupby(List)).

sofs_groupby_as_list(Data) ->
    S0 = sofs:relation(Data),
    S1 = sofs:relation_to_family(S0),
    sofs:to_external(S1).

(My benchmark module also include the actually proposed groupby functions and their corresponding functions implemented using sofs.)

Here are some examples to show how they work:

1> gbench:groupby([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).
#{a => [99,1,99],b => [2,1],c => [1]}
2> gbench:groupby_as_list([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).
[{a,[99,1,99]},{b,[2,1]},{c,[1]}]
3> gbench:sofs_groupby([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).        
#{a => [1,99],b => [1,2],c => [1]}
4> gbench:sofs_groupby_as_list([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]). 
[{a,[1,99]},{b,[1,2]},{c,[1]}]

As you can see, the sofs_groupby functions sorts the values within the groups and remove duplicates.

Here are the results when the groups are relatively few but very large:

generated 2000000 two-tuples with 1% unique keys
groupby/1           : 0.327446
sofs_groupby/1      : 1.068666
groupby_as_list/1   : 0.333919
sofs_groupby_as_list: 1.051423

Here the map-based implementation is clearly superior to the sofs versions.

Testing with half as many groups as elements in the list the performance of the different implementations are more similar:

generated 2000000 two-tuples with 50% unique keys
groupby/1           : 1.111285
sofs_groupby/1      : 1.398401
groupby_as_list/1   : 1.114957
sofs_groupby_as_list: 1.150092

In the other extreme case where each group contains one element, the sofs based implementations are the fastest:

generated 2000000 two-tuples with 100% unique keys
groupby/1           : 1.748166
sofs_groupby/1      : 0.636978
groupby_as_list/1   : 1.825952
sofs_groupby_as_list: 0.193927

Here is my benchmark module:

gist.github.com

https://gist.github.com/bjorng/4089cb0a289ad5518abd8aaf59b0a86b

gbench.erl

-module(gbench).
-export([run/0]).
%%-compile(export_all).

run() ->
    Limit = 2_000_000,
    Data = gen_data(0, Limit, []),
    UniqKeys = length(sofs_groupby_as_list(Data)),
    PercentUniq = round(UniqKeys * 100 / Limit),
    io:format("generated ~p two-tuples with ~p% unique keys\n",

This file has been truncated. show original

bjorng · February 25, 2022, 3:20am

I just realized that the way I generated the test data favored the sofs-based implementations. My function for generating data looked like this:

gen_data(Limit, Limit, Acc) ->
    Acc;
gen_data(N, Limit, Acc) ->
    Item = {N div 100,rand:uniform(Limit)},
    gen_data(N+1, Limit, [Item|Acc]).

(for 1 percent unique keys)

The problem is that the generated tuples are already sorted (in reverse order). The sorting that sofs will need to do will be much faster compared to sorting a list with random elements.

So I changed my data generation to generate keys in a more random way:

gen_data(Limit, Limit, Acc) ->
    Acc;
gen_data(N, Limit, Acc) ->
    Item = {erlang:phash2(N div 100),rand:uniform(Limit)},
    gen_data(N+1, Limit, [Item|Acc]).

Here are the new results:

generated 2000000 two-tuples with 1% unique keys
groupby/1           : 0.318817
sofs_groupby/1      : 1.329506
groupby_as_list/1   : 0.351737
sofs_groupby_as_list: 1.306519

generated 2000000 two-tuples with 50% unique keys
groupby/1           : 1.111122
sofs_groupby/1      : 1.967342
groupby_as_list/1   : 1.116259
sofs_groupby_as_list: 1.782708

generated 2000000 two-tuples with 99% unique keys
groupby/1           : 1.616968
sofs_groupby/1      : 2.390517
groupby_as_list/1   : 1.762403
sofs_groupby_as_list: 1.971433

That is, the map-based groupby functions are consistently faster than the sofs-based functions for unordered input data.