Per-process atom limit

Since I’m validating input from users (internal users who will get fired if they do something dumb) I was just going to do something like this (yes, I know it’s Elixir):

defmodule SafeAtom do

  @max_new_atoms 10

  def validate!(args, create? \\ false) do
    atoms = get_all_atoms(0, [])

    new_atoms =
      Regex.scan(~r/:\s*[^,;\[\]\s]+/, args) # A bit naive, needs thought
      |> List.flatten()
      |> Enum.map(&String.replace(&1, ":", ""))

    Enum.reduce(new_atoms, 0, &check_atom(&1, &2, atoms))

    if create? do
      Enum.map(new_atoms, &String.to_atom/1)
    else
      :ok
    end
  end

  defp check_atom(_atom_string, @max_new_atoms, _atoms) do
    raise "New atom count exceeded"
  end

  defp check_atom(atom_string, created, atoms) do
    if Enum.member?(atoms, atom_string) do
      created
    else
      created + 1
    end
  end

  defp get_all_atoms(count, atoms) do
    try do
      atom = :erlang.binary_to_term(<<131, 75, count::size(24)>>)
      get_all_atoms(count + 1, [Atom.to_string(atom) | atoms])
    rescue
      _ -> atoms
    end
  end
end

Plain counting doesn’t suffice, but if system_monitor could report creation of new atoms (conditions TBD) similar to e.g. long_gc and such, we should be able to identify the hot spots.

People in that thread were alleging that EEP 20 would create serious overheads in cases where it couldn’t possibly create any. If that’s not spreading FUD, what is?

Obviously there are new ideas. The first time I held an iPad in my hands, I had the idea of using the built-in camera to measure pulse and blood oxygen. That was about 12 years ago. Nobody could have had that idea until they first had the idea of a hand-held device with a built-in camera to have it about. This year I keep getting ads on my phone for a phone app which does exactly what I thought of 12 years ago. Once the context for an idea exists, people keep on having that idea.

And so until the atom table overflow problem is actually FIXED, the BEAM community will keep on seeing discussions about workarounds.

If you have to then I would use something that lives outside of the process. Given these processes have names and there is finite number of them (or at least a small number of them) then persistent term would be a good fit here in that your key could be simple, and the value an immediate term (i.e., integer). I’m not sure I see a need to track which atoms are created, you basically could try to_existing_atom/1, if this fails, then you know you must create one, and then you can increment the counter.

Also agreed, I would not use regex here, rather prefer binary pattern matching. IMHO, regex is always a last option, since we have nice things in Erlang :smile:

Atom checks are a single machine instruction at the moment, literally anything will have “serious overhead” relative to that.

If I recall correctly your previous suggestion relied on branches, which would never be taken after a certain point (a sufficient time after a local atom became global?). However, the extra test would still linger in the code, occupying valuable instruction cache, the branches could mispredict when evicted, so on.

This may seem negligible but expressed over a lookup table with a few hundred atoms it starts to become significant, especially when there are complications preventing the check from compiling to a select_val.

While we can debate what constitutes “serious overhead,” calling it FUD feels unwarranted to me. There are legitimate technical arguments and I’d rather they were argued against than dismissed as “FUD.”

5 Likes

What “atom checks” are you talking about?
Code like f(a, X) → …, case E of a → …, or a = X
would, under EEP 70, do EXACTLY what it does now.
ZERO overhead. It’s why EEP 70 proposes splitting the space
of atoms rather than completely replacing what we have now.

is_atom(X) would take longer, but are such tests really such
a dominant component of Erlang execution that the overhead
(for local atoms and non-atoms) would be detectable in practice?

If I create the local atom “flurb” and store it somewhere, then load a module that makes “flurb” global, and pass the local “flurb” to a function in said module, what happens?

f(flurb) → … should match the local one too, no?