On garbage-collecting atoms

garazdawi · May 24, 2022, 7:06pm

In ERTS today an atom is internally a number starting at 0 for the first atom and then counting up for each new atom created. That number is the only thing stored in messages, ets table, everywhere within a node. If you need to know the string representation of that number you need to lookup the integer in a hash table. If you need to send it outside the node (to disk or via distribution) you need to convert it to a string.

GC:ing atoms as they are implemented today would require a global (probably stop the world) GC, which is not something that we want to introduce to Erlang. So the alternative is to change the way that atoms are represented inside ERTS (for instance as described in EEP-20 linked by @asabil).

So why haven’t we done that yet? Well, the datatype proposed for atoms in EEP-20 uses a lot more memory for each atom and checking if two atoms are equal/not-equal becomes more costly. We have not wanted to make that tradeoff yet and instead recommend users to not generate dynamic atoms but instead use binaries where a dynamic token is needed. Not all applications follow this guideline (notably our own xmerl does not), so the discussion about removing the limit on the number of atom (by either making the GC smarter or changing how the atom is represented) pop up from time to time. There definitely exist legitamate usecases where it would be great to have unlimited atoms, but not enough for us to accept the worse performance of all atoms in the VM.

If anyone has a great idea about how to solve this problem I would be more than happy to discuss it and implement it if it turns out to work for us.