Some questions about the proposed lists:groupby family of functions came up. For example:
-
What is performance compared to sofs:relation_to_family assuming that your problem doesn’t require that order and duplicates are kept within the groups?
-
What are the performance implications if we don’t want a map but a list as a result? (For further processing or perhaps for turning it into an array.)
-
Since it returns a map, should it be in the
maps
module?
I have tried to answer some of those questions with a benchmark. In the benchmark, I have added a new variant of the groupby
family that takes no funs:
groupby(List) ->
groupby_0(lists:reverse(List), #{}).
groupby_0([{K,V} | Tail], Acc) ->
NewAcc = case Acc of
#{K := Vs} -> Acc#{K := [V | Vs]};
#{} -> Acc#{K => [V]}
end,
groupby_0(Tail, NewAcc);
groupby_0([], Acc) ->
Acc.
The corresponding function using sofs
looks like this:
sofs_groupby(List) ->
S0 = sofs:relation(List),
S1 = sofs:relation_to_family(S0),
S2 = sofs:to_external(S1),
maps:from_list(S2).
I also created two functions that return lists instead of maps:
groupby_as_list(List) ->
maps:to_list(groupby(List)).
sofs_groupby_as_list(Data) ->
S0 = sofs:relation(Data),
S1 = sofs:relation_to_family(S0),
sofs:to_external(S1).
(My benchmark module also include the actually proposed groupby
functions and their corresponding functions implemented using sofs
.)
Here are some examples to show how they work:
1> gbench:groupby([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).
#{a => [99,1,99],b => [2,1],c => [1]}
2> gbench:groupby_as_list([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).
[{a,[99,1,99]},{b,[2,1]},{c,[1]}]
3> gbench:sofs_groupby([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).
#{a => [1,99],b => [1,2],c => [1]}
4> gbench:sofs_groupby_as_list([{a,99},{b,2},{a,1},{a,99},{b,1},{c,1}]).
[{a,[1,99]},{b,[1,2]},{c,[1]}]
As you can see, the sofs_groupby
functions sorts the values within the groups and remove duplicates.
Here are the results when the groups are relatively few but very large:
generated 2000000 two-tuples with 1% unique keys
groupby/1 : 0.327446
sofs_groupby/1 : 1.068666
groupby_as_list/1 : 0.333919
sofs_groupby_as_list: 1.051423
Here the map-based implementation is clearly superior to the sofs
versions.
Testing with half as many groups as elements in the list the performance of the different implementations are more similar:
generated 2000000 two-tuples with 50% unique keys
groupby/1 : 1.111285
sofs_groupby/1 : 1.398401
groupby_as_list/1 : 1.114957
sofs_groupby_as_list: 1.150092
In the other extreme case where each group contains one element, the sofs
based implementations are the fastest:
generated 2000000 two-tuples with 100% unique keys
groupby/1 : 1.748166
sofs_groupby/1 : 0.636978
groupby_as_list/1 : 1.825952
sofs_groupby_as_list: 0.193927
Here is my benchmark module: