erlang:atom_prefix/2 - zero-allocation atom prefix matching BIF

Checking whether an atom starts with a given prefix is a common pattern, but today it requires converting to a list/binary first:

lists:prefix(“erl_”, atom_to_list(Module))

This allocates on every call and creates GC pressure unnecessarily — the atom’s bytes are already sitting in the atom table.

I have a working BIF implementation of erlang:atom_prefix/2 that compares directly against the atom table’s internal UTF-8 bytes. Zero allocation, zero GC.

erlang:atom_prefix(hello_world, <<“hello”>>). %% true
erlang:atom_prefix(hello_world, “world”). %% false

The pattern already appears in at least 5 places in OTP (diameter_info, diameter_service, ct_netconfc, dialyzer tests) and 5 places in Elixir (code formatter, Mix help, type checker, autocomplete) via Atom.to_string/1 |> String.starts_with?/2.

Would this be a useful addition?

4 Likes

In Erlang 28, I find 3 occurrences in diameter and one in common test.
In my own programming, testing for a suffix is about half as common as
testing for a prefix.

I suggest lower level building blocks.
(1) size(<<“foo”>>) is already defined. It’s 3.
Extend size/1 to work on atoms, so that size(‘’) = 0, size(foo) = 3,
size(vexatious_kibitzing) = 19.
(2) atom_compare(Atom1, Drop1, Take1, Atom2, Drop2, Take2)
compare portions of two atoms.

With these primitives,
atom_prefix(Prefix, Whole)
when is_atom(Prefix), is_atom(Whole) →
Prefix_Size = size(Prefix),
Whole_Size = size(Whole),
Whole_Size >= Prefix_Size and
atom_compare(Prefix, 0, Prefix_Size, Whole, 0, Prefix_Size) =:= 0.

atom_suffix(Suffix, Whole)
when is_atom(Suffix), is_atom(Whole) →
Suffix_Size = size(Suffix),
Whole_Size = size(Whole),
Whole_Size >= Suffix_Size and
atom_compare(Prefix, 0, Prefix_Size,
Whole, Whole_Size - Prefix_Size, Prefix_Size) =:= 0.

and several other useful functions are definable.
(1) just removes an arbitrary restriction.
(2) isn’t that much more complicated than the BIF you already want.

1 Like
has_my_prefix(Atom) ->
    Atom >= my_prefix andalso Atom < my_prefiy.

:grinning_face:

10 Likes

You win. Of course, that assumes you know what the successor to prefix is without creating it, which the examples I’ve seen don’t.

Eh, that’s trivial fix:

has_prefix(Atom, Prefix) ->
    Atom >= Prefix andalso Atom < Prefix.

You just created the Prefix atom(), and now it lives forever in the atom table.

1 Like

Since the atom table contains all the binaries, would be cool if we could match binaries with atoms and even have a contains function.

1 Like

It looks like a problem XY. In 15 years, this has never been needed. Please describe the full problem, because it looks like an architectural error.

2 Likes

Since a few releases ago, atom_to_binary/1 will not produce any garbage. Each atom in the atom table is a binary, and atom_to_binary/1 will return that binary directly.

Unfortunately, I can’t think of a way to test whether a binary starts with a certain sequence of characters without creating garbage. Doing binary matching will create a match context/sub binary:

has_prefix(Atom, Prefix) ->
    case atom_to_binary(Atom) of
        <<Prefix:(byte_size(Prefix))/binary, _/binary>> -> true;
        _ -> false
    end.

Other tricks using the binary module that I can think of also produce garbage. For example:

has_prefix(Atom, Prefix) ->
    binary:longest_common_prefix([Prefix, atom_to_binary(Atom)]) =:=
        byte_size(Prefix).

However, testing whether an atom starts with a certain character without producing garbage is possible using binary:first/1.

2 Likes

I strongly suspect this too. If you want to peek at the contents then atoms are likely the wrong tool for the job, their very name means indivisibility. They’re meant to be used as constants in code and nothing more.

Why aren’t bitstrings good enough?

1 Like

On Kubernetes, the module name autocomplete in Elixir was killing the pods for me when the allowed memory was set too low. I was looking into optimizing this. One of the patterns I spotted was converting Atom.to_string and then running String.starts_with? on it. After looking into it, this pattern is used in multiple places in the standard library — probably even more so in tools that work with source code.

That sounds like a problem with how Elixir is doing things, and it’s probably good to look into that separately, I’ll talk to Jose about it. Thanks! :slight_smile:

1 Like

I guess this is my bat signal!

This is not an Elixir specific issue, as it does happen in Erlang as well, under the same circumstances. When you are autocompleting module names in the Erlang shell, you invoke this code:

Which calls `code:all_available/0`:

In interactive mode, `all_available` will return modules from all load paths by traversing the disk, plus all loaded code. In `embedded` mode, it runs only the branch above, which we can focus on right now.

The code above calls `all_loaded` and then returns `{ModString, Path, true}` lists. So when autocompleting modules in embedded mode, you can see the code above can be quite wasteful:

  • It loads `{Module, Path}` data from ETS (including modules and paths) and sends it across processes
  • Then it traverses all pairs, converts atoms into lists, and returns triplets
  • Then `edlin` converts all lists back into atoms
  • Then the `match` function converts the atoms back into lists

There is probably a lot that could be simplified in here! You could probably change `edlin` expand to check if the mode is embedded and, if so, bypass `code:all_available/0` altogether and settle on this core loop:

[M || {M, _} <- code:all_loaded(), lists:prefix(Hint, list_to_atom(M))]

(or use `erlang:all_loaded/0` directly, which should be fine in embedded mode). And this shows exactly the problem statement here: you potentially have thousands of modules, you are allocating now thousands of lists, only to find 30 that starts with `my_app_foo_`. A solution similar to the one here would allow you to bypass this allocation altogether:

[M || {M, _} <- code:all_loaded(), erlang:atom_prefix(Hint, M)]

TL;DR: I’d say core problem happens in Erlang but, because the conversions are hidden through a few different modules, it is not obvious that Erlang also is converting between atom and lists/binaries when there isn’t a need to. I hope this makes sense!

2 Likes

Looking at it, the entire thing could be reduced to this step by refactoring the code server a bit. No need to even load data into process memory, just pure ETS operations.

3 Likes

Expanding on this after talking to @jhogberg and sleeping on it.

Option 1: add prefix option to code:all_loaded/1

  • John’s idea is to have the code server store the binary name of the module in its ets table, so we could ordered_set and do the prefix search as suggested by @sverker. This the whole selection and filtering happens on ets and efficiently due to the ordered set. I believe this requires the binary module name to be the key in ets (I am not sure if this could impact overall insertion/lookup performance)

  • Assuming the above is acceptable, we could then add code:all_loaded([{prefix, ~"my_app_"}]) and code:all_available([{prefix, "my_app_"}]) APIs to be used by Erlang/Elixir autocoplete

  • The above still has the pitfall that code:all_loaded/0 return module names and paths, and the paths will be allocated and unused by edlin_expand. But perhaps this is fine, since the module names will be filtered anyway to only the relevant ones

Option 2: add erlang:all_loaded([{prefix, "my_app"}]) instead

As mentioned above, code:all_loaded/0 return module names and paths and it requires changing the code server. An alternative solution would be to add erlang:loaded([{prefix, "my_app"}]), which does the filtering on a NIF. Then we can change edlin to have its own version of code:all_available that only lists/filters modules, reducing the amount of indirection and back and forth between modules. Roughly this:

all_modules(Prefix) ->
    case code:get_mode() of
        interactive -> all_modules(code:get_path(), Prefix, []);
        embedded -> all_modules([], Prefix, [])
    end.

all_modules([Path|Tail], Prefix, Acc) ->
    case erl_prim_loader:list_dir(Path) of
        {ok, Files} ->
            %% IIRC the ++ after a comprehension is optimized by the compiler
            %% to avoid building list only to concatenate it
            NewAcc = [Module || File <- Files,
                                filename:extension(File) == ".beam",
                                Module = filename:rootname(File),
                                lists:prefix(Prefix, Module)] ++ Acc,
            all_modules(Tail, Prefix, NewAcc);
        _Error ->
            all_modules(Tail, Prefix, Acc)
    end;
all_modules([], Prefix, Acc) ->
    AllLoaded = [atom_to_list(M) || M <- erlang:loaded([{prefix, Prefix}])] ++ Acc,
    lists:usort(AllLoaded).

I’d probably take option 2, because having edlin compute its own modules will be clearer than depending on code APIs.

1 Like

Another variant is to add a function that does exactly what edlin_expand wants to do (be it for option 1 or 2). We’re not stuck with current APIs.