Performance of small vs large maps

RoadRunnr · November 21, 2024, 11:33am

I’ve been wondering what impact the type and size of a key in a map has on performance.

My primitive micro benchmark (Erlang maps benchmark · GitHub, run with something like t7:test(<<"foobar">>, 10_000_000).) shows that using lists or binaries vs atoms has a noticeable impact (about 10x in most cases).

However, what is more interesting is that the switch from small maps to large maps after 32 entries seems to be suboptimal (too late). For binaries and lists as keys, switching to large maps would be more beneficial after 4 or 8 entries. Even for atoms as keys, switching at that number wouldn’t have much impact in either way.

I know I could compile my own VM with a different setting for large maps, but maybe I’m missing something. Are there other workloads where using small maps for more entries is beneficial?

EDIT: Things become more interesting when adding updates to the mix. Using atoms as keys, switching to large maps beyond 32 entries is the best option and I guess that is what maps are optimized for. However, for binaries and lists as keys, switching beyond 16 entries (for lists even before that) would be more optimal.

josevalim · November 21, 2024, 2:58pm

I wonder if a viable approach here is to change the threshold based on the key being added. If you are adding a non-immediate term, maybe we upgrade it at 16, regardless of the other keys, as we can assume the other keys would be mixed.

Although it is worth noting that reducing the threshold for atom keys can be problematic during compilation. Small maps can share their key space during updates and in the module literal area. Large maps cannot. So this can impact memory usage.

garazdawi · November 21, 2024, 8:49pm

We did a bunch of benchmarks at the time when maps were introduced 10 years ago and IIRC different threashold were good at various things, but 32 was at the time a good compromise between different scenarios. The JIT probably has pushed the boundaries around a bit, so it is not very surprising that it might have changed, though I’m not sure it has changed enough to warrant updating. If someone wants to try and see in their production nodes it is a very simple change to do.

Regarding having a dynamic threashold, there is a bunch of code inside erts that depend on there being a set threashold for when maps become large. The most obvious example being equality, that is a large map cannot be equal to a small map. I don’t think it would be hard to change this, but it will be rather tedius and error prone to find all the places where this assumption is made.