Problem with non-latin symbols in Nitrogen element names

sisrahtak · November 29, 2022, 4:54pm

Hello!

I am reading “Build It With Nitrogen” book right now. I am on the page 82 and I have a problem with associates’ names. If I use only english letters for associates’ names then everything is fine.
But if I use non-latin (first or last) name for an associate on a previous stage, cyrillic “Фырг” for example, then at http://localhost:8000/iam I see this:

⚠ There was an error processing this page ⚠

error:badarg
--------------------------
[{erlang,list_to_binary,
         [[1060,1099,1088,1075]],
         [{error_info,#{module => erl_erts_errors}}]},
 {wf_convert,to_binary,1,[{file,"src/lib/wf_convert.erl"},{line,102}]},
 {element_dropdown,is_selected,2,
                   [{file,"src/elements/forms/element_dropdown.erl"},
                    {line,138}]},
 {element_dropdown,create_option_full,3,
                   [{file,"src/elements/forms/element_dropdown.erl"},
                    {line,115}]},
 {element_dropdown,create_options,3,
                   [{file,"src/elements/forms/element_dropdown.erl"},
                    {line,88}]},
 {element_dropdown,create_options,3,
                   [{file,"src/elements/forms/element_dropdown.erl"},
                    {line,88}]},
 {element_dropdown,render_element,1,
                   [{file,"src/elements/forms/element_dropdown.erl"},
                    {line,27}]},
 {wf_render_elements,call_element_render,3,
                     [{file,"src/lib/wf_render_elements.erl"},{line,158}]}]

I tried to verify the source of the problem:

1> <<1060,1099,1088,1075>>.
<<"$K@3">>
2> <<"Фырг">>.
<<"$K@3">>

Is it a problem with unicode support in Nitrogen? Should I choose another framework or maybe there’s some way to deal with that kind of problems?

dmsnell · November 30, 2022, 3:51am

@sisrahtak it does look like this is a defect in Nitrogen, and a bummer that it had to be discovered by crashing when entering someone’s name.

The core issue is that erlang:list_to_binary/1 is meant to operate on bytes while strings-as-lists hold a sequence of code points, whole numbers which may be too large to represent with a single byte.

<<1060,1099,1088,1075>>.

This is an illustration of the problem because binaries are byte sequences and not any higher-level text abstraction the way lists are. If you were wondering how it turned Фырг into $K@3 we can inspect how it’s converting whole numbers (as unsigned integers) into a bytestream output (your Erlang shell is making some of this more difficult to see because it displays things that could be Unicode as Unicode).

1> [<<C:16>> || C <- "Фырг"].
[<<4,36>>,<<4,75>>,<<4,64>>,<<4,51>>]

So each of those input code points would take at least two bytes to represent if they were encoded as unsigned integers, and when we pass these multi-byte sequences when constructing our binary, it takes the first byte and ignores the rest. What are the first bytes in each of those four characters?

2> [36, 75, 64, 51].
"$K@3"

If instead we tell the binary to keep all of the bytes from the original values with /binary, then it breaks.

3> <<1060/binary>>. % 1060 is a number, not a sequence of bytes
** exception error: bad argument
     in function  bit_size/1
        called as bit_size(1060)
        *** argument 1: not a bitstring
        …

The solution is to convert between the abstract notion of Unicode code points and the concrete realization of a byte stream which encodes those code points in some standard way; in our case, UTF-8. This is done through unicode:characters_to_binary/1.

3> ActualEncodedUTF8Bytes = unicode:characters_to_binary("Фырг").
<<"Фырг"/utf8>>

4> erlang:display(ActualEncodedUTF8Bytes).
<<208,164,209,139,209,128,208,179>>
true

Sorry for writing such a length post; human text is hard and almost everyone everywhere gets it wrong.

I’ve filed PR #151 with the project though I’m not sure if it’s the right fix. We’ll see if @gumm can share some guidance.

In the future it may be more helpful to start with an issue/bug report for Nitrogen, or for whichever project is crashing, as projects are probably more likely to offer support for their own code than the general Erlang forums.

Update: Clarified some of the more specific language around code points and unsigned integers when illustrating how we went from the input string to $K@3.

gumm · November 30, 2022, 3:55am

Hi @sisrahtak

You are completely right, this is an oversight in the #dropdown element.

Conveniently, @dmsnell saw this post and has submitted the PR to fix this. I’ll likely be merging this into mainline tomorrow and will post here when it’s done.

Thanks again for reporting this!

sisrahtak · December 1, 2022, 11:28pm

Awesome, thanks for clarification!

“it may be more helpful to start with an issue/bug report for Nitrogen” – got it.