Adding atoms to the definition of IO data?

eproxus · September 5, 2024, 10:24am

I often want to render IO data where the source data contains atoms. It would be super nice if atoms were automatically converted to binaries when used in e.g. iolist_to_binary/1 and other IO functions.

I’m thinking of cases like these:

callback_url(Type) when is_atom(Type) ->
    iolist_to_binary([config(callback_root), $/, atom_to_binary(Type)]).
    % Which could become:
    iolist_to_binary([config(callback_root), $/, Type]).

(Ignore the fact that this might not be the cleanest way to join URI components, that is besides the point).

I don’t immediately see a problem with with it since it would only be at “encoding” time (and not “decoding” time which is another can of worms). It would just be a convenience for things that already looks like text to developers, e.g. iolist_to_binary([foo, $., bar, $. baz]) would generate <<"foo.bar.baz">>. Then one could use iodata() structures as a poor mans templating system as far as atoms are concerned.

What are your opinions? Any reasons why this would be a bad idea? Are there cases where it would cause problems or not be feasible?

williamthome · September 5, 2024, 10:53am

I’m not sure why, but lists:concat/1 seems to have a similar behavior:

1> lists:concat([foo, ".", bar, ".", baz]).
"foo.bar.baz"

But by using char it does not return the expected result:

2> lists:concat([foo, $., bar, $., baz]).
"foo46bar46baz"

eproxus · September 5, 2024, 10:58am

Interesting find. It even supports integers and floats too.

IO data is kind of ideal, because in the end one does not have to concatenate or flatten anything, so supporting atoms there would be the most performant choice.

It would be impossible to support integers since those are already used for code points, but personally I don’t want that anyway because there usually isn’t a good 1:1 mapping between an integer/float and the desired string representation since it usually varies (precision, thousand separator etc.). But for atoms if you really want them as a string by default it is easy, since they are their own string representation (there’s just no automatic conversion for it, which is what I wish for with this post ).

LeonardB · September 5, 2024, 1:06pm

This is because $. is a representation of 46, not a list, which
would be [$.] or ".".

1> lists:concat([foo, [$.], bar, [$.], baz]).
"foo.bar.baz"
2> 46 =:= $..
true

The implementation is:

integer() -> list().
float() -> list().
atom() -> list().
list() -> list().

Think there’s there’s a typo/bug in the documentation.

%% concat(L) concatenate the list representation of the elements
%%  in L - the elements in L can be atoms, numbers of strings.
%%  Returns a list of characters.

It should probably be numbers or strings

eproxus · September 5, 2024, 1:49pm

From the documentation it also looks like lists:concat/1 only handles flat lists so it’s not a replacement for deeply nested IO list creation, which is very efficient.

garazdawi · September 6, 2024, 7:25am

An iolist is a collection of bytes, not a collection of characters, so I don’t think that it would make sense to include something that would need to be converted to characters in there.

If we were to add it anywhere, then I would then add it to unicode:chardata/0, as that is a collection of characters where we have a known encoding to convert to. You could then to unicode:characters_to_list(["hello ", atom]) and get what you expect. I don’t think that we should do this, as dealing with chardata is complex enough in places like the string module.

As a small aside, the file:name/0 type is already almost what you describe that you want

eproxus · September 7, 2024, 1:09pm

Right, so with that explanation, I realize that what I’m really asking for is a built-in, singe-pass, recursive way to encode (some) terms to UTF-8… I suppose.

That is, something like foobar_to_binary/1,2 where ..._to_binary has the same meaning as in atom_to_binary/1,2 (since it is encoding aware) and foobar_to_... has the same recursive meaning and flexibility as iolist_to_....

If I understand you correctly what you are saying is that iolist_to_binary is purely a way to efficiently traverse BEAM memory (from various sources) and pipe it into an IO sink (socket, binary, file, whatever) and that due to complexity of encoding it will not get any support for exporting such data to the configured encoding of the VM (in the way atom_to_binary does).

The other alternatives are less interesting because they remove the (most) interesting property of iolist_to_binary (or whatever underlying mechanism is used that works with sockets/files etc.) in that it can be done in a single pass by the VM itself. The multiple pass way is what any Erlang project already uses today, either explicitly (homegrow recursive algorithm before the IO sink) or implicitly (via some library that does it wholly or partly, e.g. the new json module in OTP).

How are atom text representations stored internally in the BEAM? As UTF-8 binaries somewhere?

garazdawi · September 9, 2024, 8:17am

What you want is some simpler way to convert atoms to strings without having to litter the code with s(Atom) or adding some extra pass. This is very similar to the problem that EEP-62 tries to solve, only that it does it for more datatypes than just atoms. If EEP-62 were to be introduced, would this make adding atoms to iodata() still useful? or are they two different solutions to the same problem?

How are atom text representations stored internally in the BEAM? As UTF-8 binaries somewhere?

yes.

eproxus · September 9, 2024, 9:02am

The property I would like to keep is to be able to recursively create an IO data structure without flattening it more than once (which happens for free when you pass an IO list to some output sink). Interpolating it myself, with some library or with EEP-62 doesn’t really matter that much because it is still one additional (unnecessary) pass over the structure. If the EEP-62 syntax would be compiled to something that retains the IO data properties, then that of course would solve the problem.

Right, so in my naive dream world I’d like for something like this to work: iolist_to_binary(["foo", <<"bar">>, baz], utf8) where the atoms would just be piped/encoded automagically (and where the default option would of course be utf8 as with atom_to_binary/1).

nzok · September 10, 2024, 1:28am

It’s difficult to see how EEP-62 would help.
Suppose you want to include an atom in an IO list.
There are two cases.
(1) You have a specific atom in mind.
In this case, you just turn ‘X!&Z’ into “X!&Z” and there’s no problem.
(2) The atom will not be known until run-time.
In this case, EEP-62 is a rather heavy-weight alternative to atom_to_list(Atom).
which could be given a local abbreviation if used often.

This turns our focus to “where are the atoms coming from, and why are they atoms?”

I am actually very sympathetic to the idea of allowing atoms in I/O lists.
I have a library for another programming language in which its analogues of
I/O lists (called chartrees and bytetrees) DO allow atoms. If Erlang had
happened to allow this way back when, I doubt that anyone would have argued for
removing that feature when making Erlang Unicode-friendly.

Since the feature isn’t present in Erlang (yet), a useful response is to say
“what does the code look like where this would be helpful?”

Back in the day there used to be a joke about New Zealand.
When the USA came up with a new educational idea,
we’d wait 10 years until it was proven to be a very bad idea.
And then we would adopt it.
EEP-62 reminds me forcibly of that joke.

maxlapshin · September 11, 2024, 7:15am

this is a nice idea if not to consider that undefined is also an atom.

It may be really convenient to write some atoms like key names to binary, but undefined will become a pain I suppose.

However, the reason to write iolist due to atom may be a bit easier to read.

mmin · September 11, 2024, 8:42am

Any reason why undefined is so special in this case? I don’t see why undefined should be written as "undefined" by default. If you want to change that behavior you need to manually handle it anyways.

maxlapshin · September 13, 2024, 8:03am

undefined is special because it is our NULL.

When I write something like io:write(Fd, ['{',user,':', escape(UserId)]) it is rather clear that I assume that atom user will be written as 4-byte binary <<“user”>>. But undefined usually means that it is not what I wanted to write.

jhogberg · September 13, 2024, 8:57am

I think calling undefined our NULL is stretching it: sure, record fields that are left unset happen to be set to undefined, but practically everywhere else we let absence signal absence, and in the rare cases where sentinel values are used, they’re pretty inconsistent.

jhogberg · September 13, 2024, 9:14am

Here’s a quick and dirty implementation in case anyone wants to play around with it. As supporting Latin-1 would be annoying in the face of yielding, the encoding is always UTF-8.

(I don’t have a strong opinion on this, but am leaning a bit towards not having it since I think it’s more likely to hide bugs than to help)

mmin · September 13, 2024, 9:18am

So could be 'null', 'nil' or 'none'

I’d argue that, but no matter what undefined means, currently you still have to write something like:

maybe_write(undefined) ->
<<>>;
maybe_write(X) ->
X.

if you want to output nothing on undefined. If atoms get into io_data then you still can do it, BUT you don’t need to manually convert every atom to its binary representation. I don’t see undefined causing a single problem here

wojtekmach · September 13, 2024, 9:43am

I sometimes wished atoms would be handled like proposed here but speaking of undefined, I think the following would definitely trip up Elixir developers. In Elixir, nil is handled in many places, including interpolations, here’s a contrived example:

iex> "foo#{nil}#{true}#{:bar}"
"footruebar"

I’d sometimes change interpolation to iolists to avoid copies but now I’d get a different result:

iex> ["foo", nil, true, :bar] |> IO.iodata_to_binary()
"fooniltruebar"

eproxus · September 13, 2024, 1:18pm

For generering strings from atoms undefined has no special meaning in general (just as true and false doesn’t). The principle I want applied here is to automatically generate the text representation of atoms so I don’t have to do it manually everywhere recursively.

If atom has special meaning in some special context, one has to use the right formatting or library (which would be the json module in this example, as it looks like you want to encode JSON).

williamthome · September 13, 2024, 5:11pm

Compared with javascript, and maybe javascript is not a parameter in that case, but what it returns for undefined or null is:

> `foo${undefined}`
'fooundefined'
> `foo${null}`
'foonull'

There is no special treatment for them.

nzok · September 16, 2024, 11:19pm

I find the idea of “foo#{nil}#{true}#{:bar}” becoming “footruebar”
rather shocking.
If I used Elixir, and were sufficiently unwary as to use string
interpolation – which is very nearly as bad an idea as the
“billion-dollar mistake” – I can well imagine falling foul of that.

If atoms are ever allowed in I/O lists, there should be no
exceptions. Every atom should be mapped to its characters.
Special cases are the bane of programming.