Adding atoms to the definition of IO data?

elbrujohalcon · September 17, 2024, 7:38am

I believe that the issue with undefined is more of a Let it Crash! thing than a problem with how it should be encoded (as other comments point out).

Let me see if I can show an example:

iolist_to_binary([<<"user ">>, "name: ", db:get_name(UserId)]).

Currently, that expression will either…

return <<"user name: ", TheUserName/binary>> if db:get_name/1 returns a string; or…
crash if db:get_name/1 returns undefined, not_found, or any other thing, instead of the name.

I’m not promoting writing code like that (certainly not), but… I imagine that someone may use that in this context:

try iolist_to_binary([<<"user ">>, "name: ", db:get_name(UserId)])
catch
    _:_ -> <<"user not found">>
end.

If we start allowing atoms in iolists, that code will return <<"user name: undefined">> (or not_found, or whatever)… which will be… unexpected.

I think that might be a backwards compatibility issue.

mmin · September 17, 2024, 8:15am

I agree that backward incompatibility is a valid argument against this feature, but I don’t agree that undefined is a special case - someone may use any atom with this intention and someone may use undefined as a valid value.

eproxus · September 17, 2024, 11:55am

elbrujohalcon:

I’m not promoting writing code like that (certainly not), but… I imagine that someone may use that in this context:
try iolist_to_binary([<<"user ">>, "name: ", db:get_name(UserId)])
catch
    _:_ -> <<"user not found">>
end.
If we start allowing atoms in iolists, that code will return <<"user name: undefined">> (or not_found, or whatever)… which will be… unexpected.

I think that might be a backwards compatibility issue.

I thought about the backwards compatibility issue, but personally came to the conclusion that catering to code that relies on badargs from iolist_to_binary/1 to determine if you got “bad” data (for some arbitrary project-specific definition of “bad”) is not realistic. In my opinion, it is not really breaking backwards compatibility because it was never part of the official API in the first place.

Taken to its extreme, the argument can be made that we can’t add any new functions anywhere, because people might rely on undef errors as part of their official application logic

Well said. If this change is made, this is exactly how I think it should work. The contract fulfilled here should be “take a nested list of Things and produce a concatenation of their string representations.” The atom undefined has a string representation already:

1> atom_to_binary(undefined, utf8).
<<"undefined">>

No need to overload it with meaning that will differ between applications anyway.

jhogberg · September 17, 2024, 1:14pm

eproxus:

nzok:

If atoms are ever allowed in I/O lists, there should be no
exceptions. Every atom should be mapped to its characters.
Special cases are the bane of programming.

Well said. If this change is made, this is exactly how I think it should work. The contract fulfilled here should be “take a nested list of Things and produce a concatenation of their string representations.” The atom undefined has a string representation already:
1> atom_to_binary(undefined, utf8).
<<"undefined">>
No need to overload it with meaning that will differ between applications anyway.

On the other hand integers are already a special case. I find it kind of odd that atoms are to be included as text in a certain encoding, but that integers do not get the same treatment, nor is the final result guaranteed to conform to that encoding.

One could just as easily argue that iolist_to_binary(X, utf8) should treat all integers as Unicode code points (in a warped sense, they are already treated as Latin-1 code points), or that integer_to_binary/1,2 should be used for encoding them rather than interpreting them as bytes.

The more I think about it the more I’m with @garazdawi: if we’re going to do this, it makes more sense to expand unicode:chardata/0 instead. Let text be text and bits be bits.

Edit: On second thought, unicode:chardata/0 is not a great place to have it either since the string module and friends uses it and will become much more complicated and difficult to optimize. I’m leaning towards a no. :-\

mmin · September 17, 2024, 5:28pm

But wouldn’t be great if (in a paralel universe) iolist_to_binary([45]) returns <<"45">>? But due to how string are implemented we can’t make that work and we live without it. But if we could, would we add integers to iolists?

jhogberg · September 18, 2024, 7:48am

I’d say no, iolists are supposed to contain raw “bytes,” so its current behavior makes sense. It’s when we start interpreting them as containing text that things become rather arbitrary.

eproxus · September 19, 2024, 5:18pm

I don’t think it so odd, because integers doesn’t have a single 1:1 text representation (let’s not even talk about floats ). How integers should be rendered into text is highly dependent on the application. Should 123456789 be rendered as ”123456789”, ”123 456 789”, ”123,456,789.00” or something completely different?

Atoms already have a natural text representation, themselves! An atom is its own value, and that value is text (from a language point of view we don’t know that they are integers in memory under the hood, that’s implementation detail that’s not really exposed in the VM).

jhogberg · September 19, 2024, 7:28pm

Which is sort of my point: atoms as text makes sense if we’re dealing with text, but as soon as we say that we’re dealing with text, what we do with integers becomes completely arbitrary.

eproxus · September 19, 2024, 9:51pm

If I understand you correctly, you are saying that IO data is considered ”bytes” not ”text” and in that context integers mean byte values. And that allowing something new that only has ”text” meaning (but not strictly ”bytes” meaning) muddles that distinction?

I guess I can see the point there, but at the same time the IO data syntax already has a lot of allowances and shorthands for generating text. In my experience, that use case is a lot more common than for generating ”pure” byte data. Using the binary syntax is a lot more convenient for that.

Where would one draw the line here? To play a bit devils advocate, would you also consider it a ”mistake” to already allow syntax like "some text, not bytes", <<"obviously text, not bytes">> and $a (which is a letter, not a byte) in IO lists?

If we already allow [<<"a"/utf8>>], "a" and [$a] which are all shorthands for the raw byte sequence 97, what would be so different with allowing [a]?

jhogberg · September 20, 2024, 6:14am

More or less, but particularly that it affects the entire iolist and implies that the entire thing is text. What should iolist_to_binary(["björk", mört], utf8) mean?

Simple: you reason about "some text, not bytes" as a list of bytes that just so happen to be ASCII code points, not as a string.

Ambiguity. How should [å, ä, ö] be encoded? [<<"å"/utf8, "ä", "ö"/utf8>>] explicitly specifies the encoding, leaving no ambiguity. $ä is the byte 228, leaving no ambiguity either.

eproxus · September 20, 2024, 7:33am

Hmm, I think I’m beginning to see your point. Let’s take some other examples (from the Erlang documentation):

"√π"
~"√π"
'√π'
[$√, $π]

Of these, only #2 is allowed in IO lists (even though it is not very “explicit”, but I guess ~"..." implies UTF-8 so at one could argue the user “asked for it” ). #1 and #4 is not allowed because the byte values are higher that 255, and #3 is not allowed because it is neither a list, a binary or a byte.

The inefficiency (and burden of implementation) I was seeking to avoid about the current state is that the syntax ~"√π" (or the more explicit <<"√π"/utf8>>) hard codes the desired encoding at compile time, but atom_to_binary('√π', utf8) has to be executed at runtime with extra overhead in performance and in code complexity.

For binaries on the shared head vs atoms in the atoms table, the minimal difference at runtime is traversing the IO data and going to one area of the memory or another to fetch basically identical data (as can be seen by your prototype).

eproxus · September 20, 2024, 7:41am

But you coouuld argue that ~"√π" and '√π' both state a desired intention to use UTF-8 which is equally implicit/explicit.

jhogberg · September 20, 2024, 7:43am

Until Latin-1 enters the picture, at which point it gets annoying because the stored data is UTF-8 and we need to convert it during traversal.

That’s just an annoyance, though, and we could live with that if this was something we wanted, but I don’t think that we do: let bits be bits.

Is 'ö' Latin-1 or UTF-8?

eproxus · September 20, 2024, 8:00am

Right, it’s not only about the desired output format but what was intended with the input as well.

Okay, let’s shift the discussion a bit. Is there any other realistic direction to take this idea? Extending unicode:chardata/0 was mentioned, but would lead to other (arguably greater) complexity. So close, but so far away

maxlapshin · September 20, 2024, 10:25am

and it may be more convenient to have:

iex> ["foo", nil, true, :bar] |> IO.iodata_to_binary()
"footruebar"

nzok · September 20, 2024, 12:08pm

You know, the Smalltalk system I’ve been working on for way too long REALLY dodged a grenade.
I made one decision early on: encodings are an issue ONLY at external interfaces.
Inside a running program, characters is characters is characters.
Unicode is painful enough to deal with without having to think about encodings within your program.