On garbage-collecting atoms

Maria-12648430 · May 24, 2022, 1:02pm

Here is my question up front: Why are atoms not garbage collected? Could they be, theoretically?

I have some guesses and odd memory fragments, but it would be nice to have it explained by an expert.
I assume it is not just a matter of not wanting to bother to clean up after oneself, and actually “just needs to be implemented”

The back story for my question is that I have lately been working on providing property-based tests for some modules in stdlib. When I got to the lists module recently, I unexpectedly got to a point where two worlds meet at odds.

PropEr (and the other similar frameworks are probably no different) generate input data at random, which is their job. Good.
This includes atoms, however, which are limited and never garbage collected. So with a sufficiently large and/or many test suites, the respective generators unwittingly burn through the available atom space until it is exhausted. Period.

Admittedly, I’m using the generators rather YOLO-ish right now, and can probably cut back and tune them to be not that atom-greedy. This makes the tests more artificial, though. And if we think of something like property-testing all of OTP or any other sufficiently large Erlang-based system, we see that we can’t

LostKobrakai · May 24, 2022, 3:23pm

Maybe you could generate a random set of atoms limited to the size of allowed atoms (bit less to allow for code created ones), put it into ets or persistent_term and on any subsequent call to the generator fetch a random one out of the set. This maintains randomness with more control over how many atoms are generated.

asabil · May 24, 2022, 3:27pm

There is an EEP (Eep 0020 - Erlang/OTP) that outlines a potential solution to the Atom garbage collection problem.

garazdawi · May 24, 2022, 7:06pm

In ERTS today an atom is internally a number starting at 0 for the first atom and then counting up for each new atom created. That number is the only thing stored in messages, ets table, everywhere within a node. If you need to know the string representation of that number you need to lookup the integer in a hash table. If you need to send it outside the node (to disk or via distribution) you need to convert it to a string.

GC:ing atoms as they are implemented today would require a global (probably stop the world) GC, which is not something that we want to introduce to Erlang. So the alternative is to change the way that atoms are represented inside ERTS (for instance as described in EEP-20 linked by @asabil).

So why haven’t we done that yet? Well, the datatype proposed for atoms in EEP-20 uses a lot more memory for each atom and checking if two atoms are equal/not-equal becomes more costly. We have not wanted to make that tradeoff yet and instead recommend users to not generate dynamic atoms but instead use binaries where a dynamic token is needed. Not all applications follow this guideline (notably our own xmerl does not), so the discussion about removing the limit on the number of atom (by either making the GC smarter or changing how the atom is represented) pop up from time to time. There definitely exist legitamate usecases where it would be great to have unlimited atoms, but not enough for us to accept the worse performance of all atoms in the VM.

If anyone has a great idea about how to solve this problem I would be more than happy to discuss it and implement it if it turns out to work for us.

Maria-12648430 · May 25, 2022, 9:14am

Yes, something like this is probably what it will boil down to. I was thinking of having a named process to manage generation and dealing out of atoms. The requested size needs to be taken into account, also. I haven’t thought it through fully

bjorng · May 25, 2022, 9:56am

That is probably not necessary. For testing the lists module, the actual text of the atoms don’t matter (with the possible exception of the tests for lists:concat/1).

Choosing atoms randomly from a list of fixed atoms is probably good enough.

That is the approach that the proper_erlang_abstract_code module in proper itself uses (see default_atom/0 in that module).

Maria-12648430 · May 25, 2022, 11:28am

Thanks for the explanation @garazdawi, I guessed something like this was the case I also think that use cases that run a risk of exhausting the atom table are rare enough to not impose large-scale refactorings and performance penalties in Erlang to accomodate for them.

That said, I have no brilliant idea to contribute to the matter, either. The best I can think of (and I’m pretty sure the idea has countless holes in it) is this below:

Reading EEP 20, it sounds like it proposes storing global (permanent, ie what we have now) as well as local (reclaimable) atoms in the same (global) lookup table. When a process exits, the local atoms it owned disappear and pointers to it are replaced with an equivalence representation such that other processes referring to it don’t end up with “broken” atoms. I may be mistaken here, though…

The EEP aims at making the global/local distinction fully opaque, ie a user has to do nothing (in fact, can’t do anything) and does not notice any difference, in semantics at least, at the cost of performance and memory impact .

But maybe the proposed solution is trying to be too smart in that regard?

What if we had the global table as we do now, and each process has an own local table?

By default, atoms are created globally, the way it is now, meaning any existing code would work (or not) same as before. But iff a process creates an atom locally by somehow explicitly saying so (don’t know how that could look in code), the atom is stored in the process’ atom table. Unless that atom already exists globally: in that case, it would make no sense to store it a second time locally (Maybe we could also think of some pruning mechanism, such that when a global atom is created, local atoms of the same name are invalidated and removed).

As long as a local atom is used only within the process itself, there is no real difference between local and global atoms. But if a local atom is sent to another process, it has to be sent as a string, the same way as if it was sent to another node; global atoms, on the other hand, can be sent the same way as they are today, they are known globally.

A process receiving a local atom from another process checks if it has such an atom itself already, or create and store it locally. Receiving a global atom is the same as it is now.

If a process exits, all the local atoms it had just vanish, without affecting anything else. The global atoms stay, as usual.

(Glossing over details now…)
By separating things that way, a real performance penalty only occurs when a process sends a local atom to another process. This can be documented. The memory increase is controllable since a user has to take explicit action to create an atom locally, instead of automatic like the EEP suggests.
Put differently, a user can control if and to what degree he wants reclaimable atoms and suffer the performance/memory penalty, or the other way round.

What do you think? (And please don’t hurt me )

Maria-12648430 · May 25, 2022, 11:58am

I was actually thinking beyond lists But for now, I guess that it will (have to) suffice

garazdawi · May 25, 2022, 1:56pm

One problem is what happens when a new global atom is introduced[1]? There could exist local atoms with the same value as that local atom. So now whenever we compare any atoms we need to check if the string representation of the global atom happens to be equal to the local atom.

[1]: For example, if a module with the previous local atom is loaded

bjorng · May 25, 2022, 2:01pm

As far as I understand, local atoms should not be stored in the global table, but on the heap for the process. When the process dies, all local atoms on the heap simply disappears. The rest of your suggestion of how to do is actually similar to what the EEP suggests, if I haven’t misunderstood something.

The EEP suggests that the only way to create global atoms is by loading code (or by registering a name for a process with a specific atom). Local atoms are created dynamically using BIFs. Quoting from the EEP:

Atoms returned by list_to_existing_atom/1 are always global atoms. Atoms returned by list_to_atom/1 or binary_to_term/1 are global atoms if and only if they are already existing global atoms, otherwise they are local atoms.

The garbage collector could probably replace local atoms with global atoms.

Another thing that will become more expensive is comparing atoms. Today an equality comparison of two atoms is just a comparison of two words. Introducing local atoms make that comparison more complicated (even if we prune local atoms that has subsequently been defined as global atoms by loading more code, that pruning cannot occur immediately and it must be assumed that an atom can exist as a local and global atom at the same time). The compiler could possibly mitigate some of that extra cost if it could figure out that two atoms being compared must be of the same type.

Maria-12648430 · May 27, 2022, 10:47am

@garazdawi @bjorng see, I told you my idea is full of holes

nzok · May 28, 2022, 12:01pm

As the author of EEP 20, I completely failed to recognise
the description of it in this thread. The mechanism
described in EEP 20

NEVER adds local atoms to the global atom table
NEVER does any locking that the present scheme
would not do
goes to some trouble to keep atom comparison fast
makes local atoms comparable in size to binaries
was proven in practice several decades ago.

The central defect of EEP 20 is that it does not entirely
replace the existing atom machinery, so the vulnerability

deep in Erlang remains. However, simply by making all
atoms have the form of local atoms (as EEP 20 hints), we
could fix that vulnerability.

For what it’s worth, Smalltalk Symbols have been garbage
collected for the last 40 years and java.lang.String.intern()
strings have been collected since at least Java 9 (that’s
as far back as I can check)

Way back in the mid-1980s I had the opportunity to try out
a fairly decent Prolog system on a Macintosh classic.
My program created too many atoms, the Prolog system
(which was only fairly decent) crashed, and erased the
entire floppy it was on. Lessons like that stick with you.
(a) Make backups first.
(b) Don’t overflow the atom table.
(c) Don’t make atom tables that can overflow.

(d) Something will always overflow; make that survivable.

The problem with “don’t overflow the atom table” is that it
isn’t actually under the programmer’s control. No matter
what my code is like, I can never be confident that some
library module won’t be revised and start creating lots of
atoms. This is one of many reasons why running multiple
Erlang nodes on a single host is a good idea, because as
long as there are shared resources with small bounds,
processes running in the same node aren’t really isolated.

Maria-12648430 · May 28, 2022, 5:09pm

Thanks for chiming in @nzok

Maybe EEP20 can be reviewed again and dusted off? I mean, almost 14 years have passed since then… It certainly seems worthwhile. @garazdawi @bjorng?

garazdawi · May 29, 2022, 6:20pm

In my opinion, the last paragraph in the comment that @bjorng made is the real showstopper for atoms in the way EEP-20 describes them.

atoms are compared a lot in Erlang and think that the added overhead of comparing atoms will be more than we would accept. Though without a prototype it is not possible to know for sure.

Maybe a way to fix part of that problem could be that we draw a line in the sand after all applications are started. i.e. all atoms created during system start are global atoms and any atoms loaded later are local atoms. For embedded mode, where all modules are loaded when the application is started, this should suffice for any use cases that do not load any new code. Not sure how good it would be for interactive mode.

mikpe · May 29, 2022, 8:40pm

Like most I would love to see atoms being garbage collected, but I understand only too well why they’re not. I can’t help but feel that the problem is that people abuse atoms by putting “text” in them, or are forced to convert other types to atoms for whatever reason(*). Atoms are effectively named symbolic constants, albeit all with unique values you don’t know, and that’s how they should be used, not as unbounded sets of random values.

(*) Some of the 3rd party APIs we use for metrics unfortunately force us to convert some unbounded types (like numbers) to atoms, and that really sucks.

nzok · May 30, 2022, 5:53am

I repeat, the scheme described in EEP 20 is based on
that used in the “Logix” implementation of Flat Concurrent
Prolog, which is a massively concurrent programming
language that compares atoms at least as often as Erlang
does.

The whole point of retaining global atoms in EEP 20 (which
is something Logix did not do) is to retain fast comparison
of atoms. If two terms are both ‘global atoms’ they are
equal if and only if they are identical. NO SLOWDOWN!

So the intent was that things that could safely be done
with atoms at the time could STILL be safely done with
atoms AT THE SAME SPEED, while things that would be done
with binaries could be done even faster with local atoms.
Some care has to be taken in the implementation, obviously,
but the representation permits this.

I’m not seeing where this show-stopping overhead is.
It’s possible that I’m reading EEP 20 through the memory
of what I meant to mean rather than what I did write.

I agree that atoms are often abused, and I can think of one
Prolog system that went out of its way to encourage that.
But I am typing on a very cheap 2nd-hand laptop with 8GB of
memory and 1TB of disc and a million atoms is just not very
large in such a world. I could buy an 8TB drive for USD 170.
There’s a data set I would like to play with that needs over

400 million node labels…

garazdawi · May 30, 2022, 6:44am

The problem is not checking if two global atoms are equal, but rather checking if a global atom and any term are unequal. For example in this code:

foo(a) ->
  hit;
foo(b) ->
  hit;
foo(_) ->
  miss.

With the way atoms are implemented today this translates to be basically this is the JIT:

foo(arg1) {
  if (arg1 == 'a') return 'hit';
  if (arg1 == 'b') return 'hit';
  return 'miss';
}

while if we split the atom, the same code would have to look something like this:

foo(arg1) {
  // I've simplified local atom comparison here
  if (arg1 == 'a' || strcmp(atom_to_string(arg1), "a") == 0) return 'hit';
  if (arg1 == 'b' || strcmp(atom_to_string(arg1), "b") == 0) return 'hit';
  return 'miss';
}

because we cannot know if arg1 is a local or global representation of the atom a or b as it could have been made global during a code upgrade. If this function is mostly called with the atom c, it will be a lot slower than it used to be. This is the added slowness I’m talking about.

I don’t know for sure if this will be a performance problem in real applications, but I think that it will.

Maria-12648430 · May 30, 2022, 9:39pm

Oh my, what have I started

Anyway, for the moment I’m out of bright (-looking) ideas I guess… I think we could live with local atoms introducing a somewhat higher memory overhead, but the performance issue is another story, more so if it affects systems as a whole.

juhlig · June 1, 2022, 8:17am

Is there a reason why xmerl does that? Or “just happened”?

nzok · June 8, 2022, 8:27am

Sigh. Evidently you didn’t read EEP 0020
carefully. To start with the most obvious thing
first, IT ISN’T strcmp(). It’s quite clear in the
text of the EEP that the design tries hard to avoid
looking at the names of atoms.
FIRST check for global vs global
Running existing code, there AREN’T any local atoms,
and this is already part of the tag code, so the
code that is executed does precisely what it does now.

SECOND check the stored hash codes to see if they
are identical. In experiments where for simplicity
I used a rather poor hash function, this check
almost always failed right away. That’s ONE word-
size comparison. Oh the humanity. Two whole extra
loads and a compare and a branch FOR A CASE THAT
CANNOT NOW OCCUR.

Third check the stored sizes to see if they are
identical. In the few cases where the low quality
hash function didn’t help, this almost always
distinguished the two atoms.

Fourth, check the characters a whole word at a time.
For unequal atoms, this almost always fails on the
first iteration.

So yes, there’s overhead when comparing a local atom
and a global atom that have the same spelling (so
should count as equal). The bit about “union-find”
is to make sure THAT NEVER HAPPENS AGAIN for that
particular pair. In any case, the measured cost on
this elderly and somewhat unreliable laptop was 20 nsec.

HOWEVER, the rest of the text explains that things
that would be global atoms now remain global atoms
under EEP 0020. To labour the obvious, this means NO
overhead for existing code.

What EEP 0020 does is to allow something that Erlang code
thinks is an atom to be used safely where you would now
have to use binaries.

I’d be the last to claim that EEP 0020 is perfect,
because after all, it isn’t actually my idea.
It’s something that was used successfully in another
concurrent programming language with a Prolog heritage,
as (mis)remembered by me.

But please, if you want to criticise it, read it
CAREFULLY beforehand and don’t make stuff up about strcmp.

(The union-find stuff relies on the garbage collector to
clean up. Fortunately this is something that can be done
per process, preserving genetic order.)