Iterating through a string and other syntax issue

chrisdel101 · May 16, 2024, 11:24pm

I’m a total beginner to Erlang so I’m getting stuck with some very basic syntax stuff. Apologies for my ineptitude.

I have a string input and I need to get each character. To start with though I am not able to log anything. It crashes on the io.format. This is obviously not a valid approach to this.


   [H|T] = "UDDDUDUU",
   io:format(H). % sigils didn't seem to help either
   counter(0, [H|T])
{"init terminating in do_boot",{badarg,[{io,format,"U",[{file,"io.erl"},{line,99},{error_info,#{cause=>{device,put_chars},module=>erl_stdlib_errors}}]},{init,start_em,1,[]},{init,do_boot,3,[]}]}}
init terminating in do_boot ({badarg,[{io,format,U,[{_},{_},{_}]},{init,start_em,1,[]},{init,do_boot,3,[]}]})

Further investigation shows this. In learnyousomeerlang it talks about this issue but doesn’t clearly show how to handle it (not clearly to me). I’m not sure if this is the reason it is crashing, the fact that it’s an int and not a char, but it probably doesn’t help.

2> [H|_] = "UDDDUDUU".
"UDDDUDUU"
3> H.
85

So would I do this? How would I iterate through a string? Or convert it to a list of separated values? I’m looking for the conventional way to do this.

The overall gist is something like this, but it doesn’t work. Any attempt to log Head crashes it.

counter(Count, []) ->
    Count;
counter(Count, [Head|Tail]) ->
  # this comparison does not work - how would I do this?
   case Head =:= "D" of
      true -> 
         io:format("D"),
          counter(Count - 1, Tail); 
      false ->  
         io:format("U"),
         counter(Count + 1, Tail)
   end.


start() ->
   [H|T] = "UDDDUDUU",
   counter(0, [H|T]).

jrfondren · May 16, 2024, 11:49pm

The problem is that you’re trying to print a single character where the function requires a string. The error obscures this slightly as it renders the list of arguments (the length-1 list containing the character you’re trying to print) as a string. The error’s also slightly worse than it could be as you seem to be running this from the commandline or a script, rather than interactively.

1> [H|T] = "UDDDU".
"UDDDU"
2> H.
85
3> [H].
"U"
4> io:fwrite(85).
** exception error: bad argument
     in function  io:fwrite/1
        called as io:fwrite(85)
        *** argument 1: not a valid format string
5> catch io:fwrite(85).
{'EXIT',{badarg,[{io,fwrite,"U",
                     [{file,"io.erl"},
                      {line,203},
                      {error_info,#{cause => {device,format},
...

Your counter/2 has a similar problem: a single character will never match "D".

There’s no separate ‘char’ type that you’re missing: 85 is it. There are some different things you could do that would work:

-module(stairs).
-compile(export_all).
-include_lib("eunit/include/eunit.hrl").

counter_test() ->
    L = "UDDDU",
    -1 = counter(0, L),
    -1 = counter2(0, L),
    -1 = counter3(0, L),
    -1 = counter4(0, L).

counter(Count, []) ->
    Count;
counter(Count, [Head|Tail]) ->
    case Head of
        $D ->
            io:format("D"),
            counter(Count - 1, Tail);
        $U ->
            io:format("U"),
            counter(Count + 1, Tail)
    end. % let this crash if something other than $D or $U comes along

counter2(Count, []) ->
    Count;
counter2(Count, [$D|Tail]) ->
    io:format("D"),
    counter(Count - 1, Tail);
counter2(Count, [$U|Tail]) ->
    io:format("U"),
    counter(Count + 1, Tail).

counter3(Count, []) ->
    Count;
counter3(Count, "D"++Tail) ->
    io:format("D"),
    counter(Count - 1, Tail);
counter3(Count, "U"++Tail) ->
    io:format("U"),
    counter(Count + 1, Tail).

counter4(Count, List) ->
    Dirs = #{$U => 1, $D => -1},
    lists:sum([Count|lists:map(fun (C) -> maps:get(C, Dirs) end, List)]).

tested:

$ erlc stairs.erl && erl -s stairs test -s init stop -noshell
stairs.erl:2:2: Warning: export_all flag enabled - all functions will be exported
%    2| -compile(export_all).
%     |  ^

  Test passed.

nzok · May 17, 2024, 7:43am

Possibly the biggest question is “what do you mean,
‘iterate over a string’?” In Unicode, this is NOT
a well-defined operation. It’s not just that a
“character” can be a base character followed by
any number of floating diacriticals, it’s worse.

::=

U+303E – Ideographic Variation Indicator [%]
U+2FF0 – I’ Description Character Left to Right
U+2FF1 – I.D.C. Above to Below
U+2FF2 – I.D.C. Left to Middle and Right
U+2FF3 – I.D.C. Above to Middle and Below
U+2FF4 – I.D.C. Full Surround
U+2FF5 – I.D.C. Surround from Above
U+2FF6 – I.D.C. Surround from Below
U+2FF7 – I.D.C. Surround from Left
U+2FF8 – I.D.C. Surround from Upper Left
U+2FF9 – I.D.C. Surround from Upper Right
U+2FFA – I.D.C. Surround from Lower Left
U+2FFB – I.D.C. Overlaid

is part of it.

Off-hand I can think of four different things you might mean,
and I have a nasty feeling that I’ve forgotten two or three
more. Then there are several variations within each. Do you
want to iterate in display order or phonetic order?

I’ve been planning for years to write a book called “Strings
Made Difficult”, and that was without considering the
manifold weirdness of Unicode.

case Head =:= “D” % code-point =:= list is bound to be false

almost certainly should be

case Head =:= $D % code-point =:= code-point mighr be true

Why is this not

count(Counter, ) →
Counter;
count(Counter, [$D|Tail]) →
xxx,
count(Counter+1, Tail);
count(Counter, [$U|Tail]) →
xxx,
count(Counter+1, Tail).

It seems as if you do not want to iterate over “a(ny) string”
but simply over a sequence of $D and $U characters. This is
not problematic, even in Unicode. For that matter, you could
split the processing into separate “count” and “print” facets
and
Count = length(Us_And_Ds),
io:format(“~s”, [Us_And_Ds])
noting that io:format/2 wants its second argument to be a list
of things to write.

chrisdel101 · May 17, 2024, 3:14pm

By iterate over a string I mean get each character individually in order in which they appear. What I really want it is a list, but I don’t control the input, and I am giving a string.

I don’t understand much of what you wrote at this point in my development yet, such as the unicode docs.

case Head =:= “D” % code-point =:= list -. What is code-point?
Strangely, one your inputs is showing up as a square ha. Guessing it should be [].

50% of my issues were with not understanding how to use io:format. So thanks for the suggestions.

chrisdel101 · May 17, 2024, 3:26pm

I’ve gotten past the last hurdle. It was mainly an issue with io:format. I have this and it’s running but doesn’t work. The Count does not persist across to the base case call counter(Count, [])


counter(Count, []) ->
   io:format("~w end ~n", [Count]),
    Count;
counter(Count, [Head|Tail]) ->
    case Head =:= 85 of
      true -> 
         io:format("~w head count ~n", [Count]),
          counter(Count - 1, Tail); 
      false ->  
         io:format("~w tail count ~n", [Count]),
         counter(Count + 1, Tail)
     end.
start() ->
   [H|T] = "UDDDUDUU",
   counter(0, [H|T]),

This outputs


0 head count 
-1 tail count 
0 tail count 
1 tail count 
2 head count 
1 tail count 
2 head count 
1 head count 
0 end

0 end should be 1 end. Why is Count being “reset” back to 0, the initial input? By “reset” I mean that I would expect the last value of 1 to be the input. The same as the list input is [] and not “reset” to "UDDDUDUU". I hope this makes sense.

Note: I’m running in replit. That’s what the start function is.

LeonardB · May 17, 2024, 4:02pm

It’s not resetting anything. You’re printing the Count value before it is mutated.

IE the Count value coming into the final step is 1, you then subtract 1, which results in 0 being received in the counter(Count, []) case.

If you modified your io:formats to

io:format("~w head count ~n", [Count - 1])
...
io:format("~w tail count ~n", [Count + 1]),

you’d see:

-1 head count 0 tail count 1 tail count 2 tail count 1 head count 2 tail count 1 head count 0 head count 0 end

chrisdel101 · May 17, 2024, 4:47pm

Doh! your right. But how about this. ~~It should be 1, no?~~ ~~Like the actual solution is 1.~~

start() ->
   X =  counter(0,"UDDDUDUU"),
   io:format("~w counterrr", [X]).


0 head count 
-1 tail count 
0 tail count 
1 tail count 
2 head count 
1 tail count 
2 head count 
1 head count 
0 end 
0 counterrr

EDIT: Wait no, disregard that. It should be 1 coming into counter(Count, []) ->. I miswrote. I guess I don’t get it actually. I must not be understanding how the flow works
Maybe I need to make sure my logic isn’t wrong, erlang or not
EDIT2: So I was wrong. This correct and is working I think.

jrfondren · May 17, 2024, 4:54pm

What I really want it is a list, but I don’t control the input, and I am giving a string.

I have good news then: strings in Erlang are lists of characters. You had what you wanted this whole time! Iterating over them is also something you’ve already figured out. I also showed above that you can pass them to other list functions just fine, like lists:map, and that characters like 85 can be written more readably as $U.

Unicode is a popular topic due to trauma. You can save it for later, and come back if you encounter issues like

I printed a string and the output was completely wrong?
My string literals can have weird characters, but my binary literals are wrong?
I can read some, but not all of the files in this directory?
I have a function that just prints the strings it gets, and it usually works so I know I’m using io:format correctly, but some strings cause it to error? I finally figured out that an emoji of a horse was the problem??

You can be forced to learn about Unicode with many languages. Erlang is only uniquely hurt by its feature of heuristically printing lists of numbers as text:

% started with 'erl'
1> [1090,1077].
[1090,1077]

% started with 'erl +pc unicode'
1> [1090,1077].
"те"

LeonardB · May 17, 2024, 5:41pm

The more idiomatic way would be

counter(Count, []) ->
  io:format("final count: ~w~n", [Count]),
  Count;
counter(Count, [$U | Tail]) ->
  io:format("U :: In ~.2w Out ~.2w~n", [Count, Count - 1]),
  counter(Count - 1, Tail);
counter(Count, [$D | Tail]) ->
  io:format("D :: In ~.2w Out ~.2w~n", [Count, Count + 1]),
  counter(Count + 1, Tail).

Results

ctest:counter(0, "UDDDUDUU").
U :: In  0 Out -1
D :: In -1 Out  0
D :: In  0 Out  1
D :: In  1 Out  2
U :: In  2 Out  1
D :: In  1 Out  2
U :: In  2 Out  1
U :: In  1 Out  0
final count: 0
0

chrisdel101 · May 17, 2024, 5:46pm

So this is working. I was thinking it was doing something it wasn’t this whole time. Apologies for that!
But it’s working now. Here is final version, and I learned a bunch of new stuff from all these replies!


counter(Count, []) ->
    Count;
counter(Count, [Head|Tail]) ->
   case Head =:= 85 of % this is "U", thought it was D
      true -> 
         io:format("~w Up Count ~n", [Count]),
          counter(Count +1, Tail); 
      false ->  
         io:format("~w Down Count ~n", [Count]),
         counter(Count - 1, Tail)
   end.

Thanks for all the tips

nzok · May 20, 2024, 1:13am

“The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters. An encoded character is represented by a number from 0 to 10FFFF16, called a code point. A text element, in turn, is represented by a sequence of one or more encoded characters.” – Unicode 15, chapter 1.

A code-point is a number which stands for a character.
Erlamg does not have a character data type.
In a binary, Unicode text is represented using UTF-8,
which represents a sequence of code-points as a sequence
of bytes.
In a string, Unicode text is represented as a list of
code-points.

“What the user thinks of as a single character—which may or may not be represented by a single glyph—may be represented in the Unicode Standard as multiple code points” Unicode 15, chapter 2.

And this is where it gets complicated.

I could tell you “a string is a list of character codes,
traverse it just like any other list, using list comprehensions,
higher-order functions, or simple recursive loops.”

But if I did that I would be lying to you,
Everyone who wants to process “Unicode text” in any programming language needs to read the first two chapters of the Unicode standard.
Over and over and over again until it sinks in.

The concept “character” is EXTREMELY fuzzy, even downright ambiguous. One thing that the user thinks of as a “character” might be represented by a sequence of code-points. One code-point that Unicode thinks of as an “encoded character” may correspond to a group of things that the user thinks of as “characters”. The same code-point might be thought of as one thing by some users and two things by other users. There are even more than a few Unicode “encoded characters” that do not correspond to anything that users would think of as characters at all. “Variation selectors” are just one example (which on Unicode principles, arguably should not exist).

So iterating over the code-points in an Erlang “string” is pretty easy; iterating over characters in Unicode is somewhere between ill-defined and rather difficult.

LostKobrakai · May 20, 2024, 7:10am

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me I found this to be a good primer on how unicode works from the perspective of someone needing to deal with it.

nzok · May 20, 2024, 9:11am

That’s a very nice link, but there’s one questionable thing in it: the complaint that Cyrillic or CJK characters that render differently in different languages are unified.

In original Unicode, unification of scripts was a big deal because it was the only way that CJK characters could possibly be fitted into 16 bits. (Since there are over 97 000 CJK characters in Unicode 15.1, this attempt obviously ultimately failed.)

In January 1999, RFC 2482 https://datatracker.ietf.org/doc/html/rfc2482 introduced an in-line tagging feature, adding a copy of ASCII in plane 14. Basically, you could tag a word as Maori by writing
BEGIN-LANGUAGE-TAG m’ i’ -’ N’ Z’ k u p u BEGIN-LANGUAGE-TAG CANCEL-TAG
where the space-separated things are Unicode encoded characters (so there are 12 “characters” but only 4 “characters”) and the primed characters indicate that they’re actually a copy of printing ASCII in plane 14. Unicode officially adopted this feature.

In November 2010, RFC 6082 deprecated the feature, and Unicode 5.0 followed suit. But wait …

In Unicode 8.0, the feature was undeprecated again, but this time for emoji! You’re not supposed to use tag sequences for language tagging, but who is going to stop you? Especially when you have to handle the tag characters to do emoji right!

Above all, what if you have legacy data encoded during the nearly 11 years when language tags were part of Unicode, there to use if you (thought you) needed them?

So if you are processing Unicode “character by character”, and assuming you are using a library which interprets this as “extended grapheme cluster by extended grapheme cluster”, where do language tags go? Is my example above [mi-NZ tag] [k] [u] [p] [u] [end tag] or [k] [u] [p] [u] or what?

Emoji are not so much a pain in the backside as a rectal cancer on Unicode… Variant selectors to distinguish between black-and-white petrol pumps and coloured ones? Feh!