Efficient frequency counting of elements

zabrane · January 10, 2024, 6:06pm

Hi guys

I’ve an unordered list of terms which I’d like to compress. Let me explain.

The following list

[ 1, 2, 1, 1, 15, 3, 2 ]

has to be compressed like this:

[ {1,3}, {2,2}, {15,1}, {3,1} ]

or:
[ {3,1}, {2,2}, {1,3}, {15,1} ]

1 appeared 3x times in the original list. Similarly, 2 appeared 2x times, etc. Ordering of elements in compressed list isn’t important here. As long as the list is compressed, I’m ok with it.

While an algorithm to achieve this seems more or less obvious (I tried maps, dict, process dictionary, and ETS table), I couldn’t come up with a solution which scales well for a list of ~10Mil elements.

> crypto:start().
> L = [crypto:rand_uniform(1, 100000) || _ <- lists:seq(1, 10000000)]. %% not sure if this is the best way to generate random number

> mylist:compress(L). ???? anyone

Help appreciated.

NelsonVides · January 10, 2024, 6:22pm

Hmm… no single algorithm will be best than O(n), as you’ll need to traverse the list at least once. Hopefully, only once. So perhaps what you’d want to optimise for is the intermediate garbage generated: if using maps or dicts for example, every update essentially copies a new map, which can mean a lot of copying and garbage-collecting during the list pass. ets seems to me like the best option: you only need to know how many times each element appeared on the list, so on the ets table you might just contain the element and a counter, and you can use the ets:update_counter/N helpers to keep track of each count

zabrane · January 10, 2024, 6:59pm

@NelsonVides indeed. Easy enough with ETS, but it doesn’t scale well.

> [ ets:update_counter(Tid, Elem, {2, 1}, {Elem, 0}) || Elem <- L ].
> ets:tab2list(Tid).

So perhaps what you’d want to optimise for is the intermediate garbage generated:…

How?

NelsonVides · January 10, 2024, 8:05pm

When I said optimising for intermediate garbage I meant that an algorithm based on maps for example would copy the map on every single element of the list, as it needs to update it and data structures in Erlang are immutable. > [ ets:update_counter(Tid, Elem, {2, 1}, {Elem, 0}) || Elem <- L ]. looks exactly like I had in mind, and AFAIK this would not generate any garbage during the list traversal, so it should be the fastest.

An important thing to keep in mind, testing this on the shell means running evaluated code instead of compiled. Try writing your counting algorithm into a module

-module(mylist).
-export([compress/1]).
compress(Tid, List) ->
    [ ets:update_counter(Tid, Elem, {2, 1}, {Elem, 0}) || Elem <- List ],
    ets:tab2list(Tid).

And calling that one from the shell. But you probably knew that one already

Now, something faster, well, I’m not sure there is any other option here values in the BEAM are immutable, and message passing does a full copy, so dividing the list in two and having two processes doing half of the job each would actually be a lot slower, as the list is copied.

So, either a NIF, or, perhaps consider how the list was obtained to begin with and how you can divide the work before constructing the whole list, or, how you can even do the job while constructing the list… Otherwise, this problem is pretty linear and Erlang’s biggest strength is easy concurrency, not fast sequential code, don’t know if anyone else would have a better idea

nzok · January 11, 2024, 2:56am

The classic algorithm for this in pure functional languages is

sort the list O(n.lg n)
group the sorted elements.
The two steps can be fused. Thrown together in a hurry with
minimal testing and NO performance tuning:

count(Xs) →
{R, _} = count(Xs, length(Xs)),
R.

count(Xs, N) when N > 1 →
M = N bsr 1,
{A, Xs1} = count(Xs, M),
{B, Xs2} = count(Xs1, N-M),
{merge(A, B), Xs2};
count([X|Xs], 1) → {[{X,1}], Xs};
count(Xs, 0) → {, Xs}.

merge([Xp={X,}|Xr], Ys=[{Y,}|]) when X < Y → [Xp|merge(Xr, Ys)];
merge(Xs=[{X,}|], [Yp={Y,}|Yr]) when Y < X → [Yp|merge(Xs, Yr)];
merge([{X,Nx}|Xr], [{X,Ny}|Yr]) → [{X,Nx+Ny}|merge(Xr, Yr)];
merge(Xs, ) → Xs;
merge(, Ys) → Ys.

Yes, this is not linear. However, log n does not grow very fast.
log2(1000) is about 10, log2(10000000) is about 24. Whether this
is fast enough depends on how fast you need it to be. There are
some obvious hacks, like sampling + splitting before sorting.

raimo · January 11, 2024, 11:12am

If the value set is integer with a limited range, 1…100000 should work, you can use an counters or atomics array with the value as index to count the frequencies.

Regarding maps; when the map gets large, the implementation uses a tree (HAMT), so a value replace should rebuild just a part (subtree) of the map. It better than one might suspect.

NelsonVides · January 11, 2024, 11:36am

I thought of atomics but I was concerned with the fact that we don’t know a priori what elements we will find. We’d need to guess the number of different elements and create a big atomics array beforehand, and there are always the extreme deviations where for example you create an array of 100k elements to count and then the entire list had like just 3 different ones, feels like a waste

Maps, true, HAMT. I’m only not fully familiar with the internals so I don’t know how to guess what would happen, need to read about it

nzok · January 12, 2024, 2:20am

What exactly are you doing, anyway?

What are the elements of the collection?
How big can they get?
If the collection can have 1e7 elements, why not 1e8 or 1e9?
Why might you have a single large collection to process instead of many small ones?
Why do you want to count them?
If they are generated by some sort of combinatorial process, why can’t you generate unique elements together with counts?

max-au · January 20, 2024, 12:34am

We tried that and were surprised to find out that with smaller cardinalities it was much faster to use process dictionary. Bumping an atomic counter was slower than doing the same in pdict.
Maps were slower than that. ETS was slower as well.

Sorting, with lists:sort, might work, and practically be faster than any other approach. But the only way to figure that out is through benchmarking.

jimdigriz · January 20, 2024, 7:40am

Smells of AdTech and audience measurement…

Have a look at a data structure known as a Count-Min Sketch.

You can do an union operation on it too, which can prove useful.

If you are actually after cardinality and not frequency, look at HyperLogLog too.

Importantly both works on streams of data so you do not need to do a ghetto ‘gather’ operation prior to processing.

If you work on a team you may find compromising, going ghetto and introducing Redis into your solution helps with accessibility as it supports probabilistic data structures including both of these.

zabrane · January 23, 2024, 5:50pm

Sorry for the late answer. Big thank to all of you guys for these valuable feedback.