I just realized that the way I generated the test data favored the sofs
-based implementations. My function for generating data looked like this:
gen_data(Limit, Limit, Acc) ->
Acc;
gen_data(N, Limit, Acc) ->
Item = {N div 100,rand:uniform(Limit)},
gen_data(N+1, Limit, [Item|Acc]).
(for 1 percent unique keys)
The problem is that the generated tuples are already sorted (in reverse order). The sorting that sofs
will need to do will be much faster compared to sorting a list with random elements.
So I changed my data generation to generate keys in a more random way:
gen_data(Limit, Limit, Acc) ->
Acc;
gen_data(N, Limit, Acc) ->
Item = {erlang:phash2(N div 100),rand:uniform(Limit)},
gen_data(N+1, Limit, [Item|Acc]).
Here are the new results:
generated 2000000 two-tuples with 1% unique keys
groupby/1 : 0.318817
sofs_groupby/1 : 1.329506
groupby_as_list/1 : 0.351737
sofs_groupby_as_list: 1.306519
generated 2000000 two-tuples with 50% unique keys
groupby/1 : 1.111122
sofs_groupby/1 : 1.967342
groupby_as_list/1 : 1.116259
sofs_groupby_as_list: 1.782708
generated 2000000 two-tuples with 99% unique keys
groupby/1 : 1.616968
sofs_groupby/1 : 2.390517
groupby_as_list/1 : 1.762403
sofs_groupby_as_list: 1.971433
That is, the map-based groupby
functions are consistently faster than the sofs
-based functions for unordered input data.