Is a recursive scanner a bad thing?

raimo · September 25, 2023, 2:37pm

In a recent suggestion: feature: String interpolation by TD5 · Pull Request #7343 · erlang/otp · GitHub, the implementation made erl_scan more or less recursive by implementing an interpolation context stack in the scanner, so whenever an interpolation sequence in a string was encountered, the scanner context was pushed on the interpolation stack and the scanner went back to the toplevel loop.

Interpolation state was popped when an interpolation end character was encountered.

My question is, mainly to computer language guys like @rvirding and @nzok; my first reaction to this is that it is very esoteric and just too weird. But I have no knowledge to back this up.

What say ye? Is this just too weird, or just the way to do it?

I note that allowing only variables in a string interpolation would avoid making the scanner recursive, since variables can be expanded by the parser without the need to re-scan whole expressions…

rvirding · September 25, 2023, 2:59pm

I definitely find baking this into the language weird. IIRC Elixir does it with a special call (IO.inspect ?). How would variables in the interpolation be handled? That would mean that the scanner would have to expand the string to an expression which is evaluated at runtime or handled by the compiler.

I will admit that I am very conservative when adding new stuff to the language so I do really wonder what this gives us.

TD5 · September 25, 2023, 4:21pm

It’s tokenised using a modal scanner to allow arbitrary expressions in a string interpolation. The complexity/overhead just being stacking up the state every time you nest a string interpolation, so it’s cheap to do. The scanner then outputs a stream of tokens which includes markers for the parser to determine where a particular interpolation begins and ends. Notably, if I recall correctly, the scanner proceeds linearly - it doesn’t call itself - but rather, it keeps a bit of context as to the level of nesting in the expression (in my mind, this is the distinction between the scanner itself being modal vs. recursive). In this way, I think it’s similar to how people often implement whitespace sensitivity.

The abstract syntax corresponding to the interpolation is desugared in an early compiler step, rewriting the interpolated string syntax into an expression of nested function calls which build the resulting string.

Here’s someone else’s take on the same topic: When Are Lexer Modes Useful?

starbelly · September 26, 2023, 4:46am

In the case of a simple interpolation, elixir will create a case expression. If it is binary, this is returned from the case expression, other wise delegates to String.Chars.to_string/1.

-module(foo).
-export([bar/1]).

bar(Baz) ->
  <<"This -> ",
      case Baz of
          _@1 when is_binary(_@1) ->
             _@1;
          _@1 ->
            'Elixir.String.Chars':to_string(_@1)
        end/binary,
        " -> eh?">>.

1> foo:bar(<<"that">>).
<<"This -> that -> eh?">>

In asm :

function, bar, 1, 11}.
  {label,10}.
    {line,[{location,"lib/foo.ex",3}]}.
    {func_info,{atom,'Elixir.Foo'},{atom,bar},1}.
  {label,11}.
    {allocate,1,1}.
    {move,{x,0},{y,0}}.
    {test,is_binary,{f,12},[{x,0}]}.
    {move,{y,0},{x,0}}.
    {jump,{f,13}}.
  {label,12}.
    {init_yregs,{list,[{y,0}]}}.
    {line,[{location,"lib/foo.ex",4}]}.
    {call_ext,1,{extfunc,'Elixir.String.Chars',to_string,1}}.
  {label,13}.
    {bs_create_bin,{f,0},
                   0,1,8,
                   {x,0},
                   {list,[{atom,string},
                          0,8,nil,
                          {string,<<"This -> ">>},
                          {integer,8},
                          {atom,binary},
                          2,8,nil,
                          {x,0},
                          {atom,all},
                          {atom,string},
                          0,8,nil,
                          {string,<<" -> eh?">>},
                          {integer,7}]}}.
    {deallocate,1}.
    return.

raimo · September 26, 2023, 1:21pm

I was after opinions on whether a recursive scanner (or modal scanner, that seems to be a variant of the idea that is not explicitly recursive) is something we want or something we should avoid in Erlang.

nzok · September 27, 2023, 10:43am

There’s an old joke about someone boasting
“This feature fills a much-needed gap.”
to which the reply was
“We REALLY needed that gap.”

To my mind, string interpolation is just such an example of
filling a much-needed gap. Consider

Language X: “foo($X) = bar($Y)”
Language Y: <| ps(‘foo(’); pr(X); ps(‘) = bar(’); pr(Y); ps(‘)’) |>

It’s obvious! Language X is better than language Y!

But wait.
Language Y: <| applisti(Map, lambda(K,V)

ps(‘foo(’); pr(X); ps(‘) = bar(’; pr(Y); ps(‘)\n’)
end) |>
Language X: stunned incomprehension.

PROBLEM 1: SMALL FIXED NUMBER.
String interpolation lets you interpolate a small fixed
number of strings (or things printed like strings). It
does not handle options well and it does not handle
iteration well.

String interpolation does not scale with structure.

PROBLEM 2: MISDIRECTION.
String interpolation seduces you into building a string.
I’m reminded of something that happened at Quintus. The
emulator was written in a high level language not entirely
unlike PL/360 in spirit, called Progol. There was a
program that translated Progol into assembly code for, oh
seven different architectures. We needed it to be much
easier to extend to the next five architectures (the x86
was by far the biggest pain) and we needed it to be slower.
The new guy who was given the task was SERIOUSLY in love
with strings. I mean SERIOUSLY. The result was that the
new Progol->assembler translator was bigger, harder to
extend, and markedly slower.

The issue was that there are data structures for processing,
and trees are just BRILLIANT for that. And there are data
structures for communicating with the rest of the world, and
they are called STREAMS. Strings are OK for talking to data-
bases.

Erlang gets a lot of mileage out of saying "hey, you don’t
actually want to construct the string X++Y++X, you just
want to send the characters of X followed by the characters
of Y followed by the characters of Z to the same stream;
why don’t you write [X,Y,Z] and I’ll take care of the rest?

This observation goes back to Smalltalk, which got brilliantly
right something that Java got abjectly wrong. If you want to
be able to convert data structures to character sequences,
what are the building blocks you want?
Smalltalk:
printOn:
where could discard everything, count the
characters, send them to a file, broadcast them to a dozen
other streams, do any of these while logging to another
stream, or build a string.
Java:
.toString()
and if you want this multi-megabyte string sent to a stream,
you have to first build the string, then write it, then clean up.

I’d done benchmarks comparing writing data structures in Smalltalk
and Java, and for all Java has had huge effort put into compiling
it, any decent Smalltalk leaves Java in the dust.

So Erlang HAS a perfectly good, indeed far more powerful and
efficient, alternative to string interpolation, and that’s
iolists.

String interpolation does not scale well with size,
in either time taken or storage turned over.

PROBLEM 3: LITTLE BOBBY TABLES.
Looking at the two libraries I maintain, I see support for

INI files
IBM stanza files
GNUstep / Apple property lists (textual, not XML or binary yet)
JSON
CSV
XML
generating (but not parsing) shell scripts

plus some more specialised things.
For the first two, simply pasting in the characters of the
variable or embedded expression works as well as anything
else. Everything else needs some sort of quoting/escaping,
and they all need DIFFERENT quoting/escaping.

Are you an H.P.Lovecraft or Clark Ashton Smith fan?
Then savour the eldritch horror of generating a PHP file
containing an XSLT script, by hand, then open the dread
portal of interdimensional horror on your quivering brain
by trying to imagine doing it with string interpolation.
Correctly!

And this, of course, is why you can’t stop at just
interpolating variables. Either you extend string
interpolation with a fixed collection of quoting/
escaping rules, e.g., “where X = $(X:SQL)” and
“

$(Body:XmlPCDATA)

”
OR you allow expressions like
“

$(xml:pcdata(Body))

And really,
“

” ++
xml:pcdata(Body) ++ “

”
doesn’t look so bad any more and
[”

“,
xml:pcdata(Body),”

"]
looks downright sexy.

The documentation on my Smalltalk system and library includes
a section on why string interpolation isn’t provided.
“But what happens when you have many kinds of data and
many ways to convert them to text?”
One particularly nasty on is the difference between numbers,
which you want interpolated by using the usual ‘print’
feature of your language, and strings, which you DON’T want
interpolated that way.

String interpolation does not scale to multiple
kinds of escaping and tends to produce insecure code.

Seriously, what NUL-termination is to C strings,
string interpolation is to Ruby and all the other
languages that have it. It reminds me of a Perl
book I had that argued that Perl was a great language
for writing specialised software engineering tools.
It did everything it could with strings and regular
expressions, and in a couple of hundred pages, I do
not believe there was a single program that worked.

There is a Tesla massive power storage battery being
installed in a little town in Queensland. It has 40
container-sized units, each I believe holding the
equivalent of 40 EV batteries. Right now, one of those
units is on fire, sending highly toxic fumes around the
community. The Fire Service and Tesla agree on how you
deal with such fires. Sit and wait for them to go out.

This is a parable. Massive power storage batteries are
a step in the renewable electricity dream. They are
cool tech. They are seductive, powerful. And we do not
know how to make them safe. (Although everything we DO
know about how to make bomb storage sites safe appears
to have been ignored in this case.)

String interpolation is cool tech. It’s seductive,
powerful, and dangerously toxic if misused, which it
very easily is.

ANY complexity added to a tokeniser to support string
interpolation is like effort put into the gun you are
planning to shoot your foot off with.

Having said that, once you allow string interpolation
into your programming language, you are pretty much
forced to allow at least function calls so that addicts
aren’t FORCED to make Little Bobby Tables mistakes.

nzok · September 27, 2023, 10:46am

“we needed it to be slower” should of course have read
“we needed it to be FASTER”. Slower is what we got,
not what we needed.

For the last 30 or so years I’ve been intending to write
a book called “Strings are Wrong.” Then when Unicode came
along the working title became “Strings Made Difficult”.
Whatever it ends up being called, string interpolation needs
its own chapter.

nzok · September 27, 2023, 11:01am

No, you don’t want a modal or recursive scanner.

There’s a language I’ve been working with, and I
wondered if it was feasible to provide the entire
library in the executable. So I wrote a compressor.
Overall, using a language-specific compressor with
bzip2 as followup, there’s a compression factor of
12, and yes, on today’s machines, it’s surprisingly
practical to include quite a large library in the
executable, and – what I was wondering about – it
would even make sense on a Raspberry Pi Pico.

How’s that relevant? Had the language in question
had string interpolation with nested expressions
so that string → expression → string → expression …
was possible, I just would not have bothered
conducting the experiment. “Too hard”, I’d have said.

String interpolation adds nothing of value to Erlang.
Erlang isn’t Ruby or Python and shouldn’t try to be.

rvirding · September 27, 2023, 8:35pm

Sorry I got carried away there.
I think I would prefer not to have a recursive scanner. But that could be just because I am not used to one.
How would it work with scanner generators like leex?
Could you generate a regexp for things like this where you are actually including code which needs to be parsed.
When does the checking of the expression occur? When scanning or when parsing?

nzok · September 28, 2023, 4:32am

Scanner generators like leex are based on regular expressions.
Regular expressions cannot count.
Given something like “xxx$(yyy)zzz” or whatever the exact
syntax is, (yyy) has to have matching brackets, and regular
expressions cannot do that. (Oh the FUN you can have when
there are syntax errors in the embedded expressions! Almost
as much fun as nailing your hand to the wall.)

TD5 · September 29, 2023, 11:23am

I think modal scanners are essentially the standard way to make string interpolation possible to parse. Most languages that come to mind do it like that (Java, C#, etc.), although since they’re OO languages, they written in a more recursive style (they spin up a new lexer within each interpolated expression). I think this is more a stylistic concern than anything else, since it’s stacking up state like we are here, just doing so implicitly in the call-stack.

How would it work with scanner generators like leex ?

I think the solution for lex-like scanners is to either enable re-entrancy (i.e. a recursive scanner), or use buffers (i.e. have modal scanner).

When does the checking of the expression occur? When scanning or when parsing?

None of this changes: The structure of the expression is checked in the parser, with the scanner only validating low-level, character-by-character issues, i.e. where the next k characters don’t correspond to any token in the language. Arguably, this is why this approach is attractive: The scanner still just efficiently streams through the characters using a small window of size k (N.B. k = 3 - that hasn’t changed either) and the parser handles interpreting the structure from that.

TD5 · September 29, 2023, 11:32am

On this point, in the way it’s done in the reference implementation, there’s nothing corresponding to string → expression → string → expression. It’s all just expressions in expressions as it already is. Just as you can have the sub-expressions 2 and 3 in the expression 2 + 3, you can have sub-expressions bf"Two plus two equals ~, 2 + 2, ~!" in the expression bf"Two plus two equals ~2 + 2~!". Notably, it is not modelled as a string which is later divided up into some constant bits of strings and some expressions, but rather the various components are modelled as first-class tokens and, later, (sub-)expressions like anything else.

nzok · September 30, 2023, 6:27am

Concerning string → expression → string → expression,
the following is a legal Ruby expression of EXACTLY that form:
“a #{if “#{2}” == “3” then 1 else 2 end} c”.

What on earth is the point of saying that
“there’s nothing corresponding to string → expression → string → expression”
when such things manifestly exist and are legal IF you allow
expressions to be interpolated into strings?

I am not talking about how any particular parser processes such
a construct. I’m talking about how a human being thinks about it
when writing such a thing and how another human being has to
struggle with it when trying to read it.

You can have a syntactic construct defined by precisely arranged
spaces and tags and the parser may be thrilled to the very socks
to process something so simple, but it’s not good for people.

String interpolation is something that ANY programming language is
better without. As Alan Perlis put it, “The string is a stark data
structure and everywhere it is passed there is much duplication of
process. It is a perfect vehicle for hiding information.”

He was talking about information you didn’t want hidden.

Why does anyone want string interpolation in Erlang?
What are the use cases?
Can we see some examples, and discuss what the advantages and
disadvantages are?

raimo · September 30, 2023, 6:42pm

I can stand on the interpolation side of the fence for a while…

The best use cases I have seen for string interpolation are, like @nzok already mentioned creating SQL queries and other “modern” text base protocols a’la Rest ones.

Today, since we have I/O lists, let’s say we create some useful stringifying/utf8-binaryifying helper function(s) s/1.

We can for example create a query with io_lib:format/2:

Q = io_lib:format("SOME PROTOCOL QUERY(~s, ~s);", [s(Id), s(Param)]),

That looks nice and readable, but when the number of arguments gets bigger and maybe has varying format characters, the variable names gets very far from the format specifier in the string, so it gets cumbersome to keep the format string in sync with the value list.

Using I/O lists we can do like this:

Q = ["SOME PROTOCOL QUERY(",s(Id),", ",s(Param),");"],

For large query strings with many variables that actually gets easier to maintain, simply because the variables are in their right position in the query string.

Enter string interpolation:

Q = ~i"SOME PROTOCOL QUERY(~s(Id)~, ~s(Param)~);",

The reason to want expression interpolation I guess is to be able to do simple things like that call to a formatting function, and for stuff like this, say for logging:

log("That failed because H: ~round(H + 0.5)~ > 10"),

I’d say that in most cases io_lib:format/2 can do the job, in other cases I/O lists and helper functions are enough, but for the combination of large strings, and many maybe awkward parameters, then string interpolation looks like an attractive feature.

But also in this case, I question if expression string interpolation is what we really want. The little brother variable string interpolation should get us far enough without going down the rabbit hole of having to understand a recursive expression string interpolation that only a parser can love.

What’s wrong with e.g Bash style variable expansion?

Q = ~i"SOME PROTOCOL QUERY($Id, $Param);",

I ignored the formatting helper, that can be fixed:

Q = ~i"SOME PROTOCOL QUERY(${Id%s}, ${Param%s});",

assuming that we allow simply a formatting helper function name after %. We can of course choose other interpolation characters and delimiters. We can have a default stringifying function that handles the values as suggested in the EEP for string interpolation, then the need for specifying a formatting helper would be small.

It is also possible to throw in io:format/2’s format character sequences in this mix, but my point is that we could get far enough with string interpolation of variable names (+ formatting function name).

And that can be done in the parser or in a parse transform, without a recursive or modal scanner. But having a Sigil to mark which strings that shall be interpolated would be a welcome scanner feature.

williamthome · October 1, 2023, 2:08am

Introducing the sigil in Erlang sounds a bit weird to me, but it also can bring some new possibilities to the language.
What I see as an alternative is to just expand strings inside a backtick to the io_lib:format/2 function. For example, ~s[Foo] would be expanded to io_lib:format("~s", [Foo]). The arguments are defined next to the control sequence inside brackets as a list and then these arguments are moved to the io_lib:format/2 arguments. Maybe the backtick string can always return a UTF-8 binary as suggested in the triple-quoted string.

A dirty example:

`Lorem ipsum ~p[Term] "dolor" sit 'amet' ~Kp[reversed, Map], consectetur adipiscing ~*.*.0f[9, 5, Float]` = io_lib:format("Lorem ipsum ~p \\\"dolor\\\" sit 'amet' ~Kp, consectetur adipiscing ~*.*.0f", [Term, reversed, Map, 9, 5, Float]).

What the scanner should do is keep the string without the brackets, collect the arguments inside these brackets, and use the raw string as the first argument and the flattened arguments list as the second argument of the io_lib:format/2 function.

Another example:

% If instead of this
X = 1,
Y = 1,
iolist_to_binary(io_lib:format("The sum of ~p and ~p is ~p.", [X, Y, X+Y])).

% We could do this:
X = 1,
Y = 1,
`The sum of ~p[X] and ~p[Y] is ~p[X+Y].`.

I believe this can help to make the code more readable and implement the interpolation in a simple way, also we continue having all the control sequences available.

Maybe what I said does not make too much sense, but maybe can bring some good discussion or other alternatives

eproxus · October 1, 2023, 8:53am

Would ~i"SOME PROTOCOL QUERY(~s(Id)~, ~s(Param)~);" be as fast as ["SOME PROTOCOL QUERY(",s(Id),", ",s(Param),");"] or would building IO lists still be faster for bigger queries?

starbelly · October 1, 2023, 5:16pm

I’m feeling this, I’m not against the other, but I’d nee to do a fair bit of research myself, thus staying out of it

That said, I think what I hear you saying is similar to how rust does string interpolation, or rather how it doesn’t. I do believe they use macros such that when writing code it looks like string interpolation, but it all just comes down to printf if you will.

let (apples, bananas) = (4, 3);
// println! captures the identifiers when formatting: the string itself isn't interpolated by Rust.
println!("There are {apples} apples and {bananas} bananas.");

I also had a strange idea for making format more pleasant to work with with the above aside. I’ve always found the syntax for formatting not the easiest thing to remember, so there’s that (no, I don’t know what something better would like right now). Also, what if we had support for named parameters?

io_lib:format("~s[:this], ~s[:that]", %{this => <<"eh?">>, that => "42"}).

Haven’t given enough though to as to how funky this may also make things Strange and random ideas are sometimes useful though!

williamthome · October 1, 2023, 6:46pm

I created a lib that does exactly what I said: GitHub - williamthome/interpolate: An Erlang library to interpolate strings.
I haven’t tested it a lot and don’t know how useful (or weird) it can be, but was worth it for fun.

A VSCode extension is also available to the syntax-highlighting: Erlang Interpolate - Visual Studio Marketplace

Some examples from the test module:

-module(interpolate_test).

-include("interpolate.hrl").
-include_lib("eunit/include/eunit.hrl").

interpolate_test() ->
    X = 1,
    Y = 1,

    Map = #{a => a, b => b},

    Float = 3.14159265,

    Result0 = ?f("The sum of ~p[X] and ~p[Y] is ~p[X+Y]"),
    Result1 = ?f("~Kp[reversed, Map]"),
    Result2 = ?f("~p[[1,2,3]]"),
    Result3 = ?f("¥£€$¢₡₢₣₤₥₦₧₨₩₪₫₭₮₯₹"),
    Result4 = ?f("😊"),
    Result5 = ?f("~*.*.*f[9, 5, $*, Float]"),

    [
        ?assertEqual(<<"The sum of 1 and 1 is 2"/utf8>>, Result0),
        ?assertEqual(<<"#{b => b,a => a}"/utf8>>, Result1),
        ?assertEqual(<<"[1,2,3]"/utf8>>, Result2),
        ?assertEqual(<<"¥£€$¢₡₢₣₤₥₦₧₨₩₪₫₭₮₯₹"/utf8>>, Result3),
        ?assertEqual(<<"😊"/utf8>>, Result4),
        ?assertEqual(<<"**3.14159"/utf8>>, Result5)
    ].

Sorry if my comments are off-topic

raimo · October 2, 2023, 11:24am

~i"SOME PROTOCOL QUERY(~s(Id)~, ~s(Param)~);" could be defined to be transformed into an I/O list by the parser.

We could also add new formatting functions to io_lib, helpers that create small binaries, to be used in I/O lists. We could also add a function a’la io_lib:bformat/2 that creates a binary, or an I/O llst instead of a deep char list.

These are optimization details, somewhat unrelated to the question about if and if so how to implement string interpolation.

raimo · October 2, 2023, 11:37am

The problem with `The sum of ~p[X] and ~p[Y] is ~p[X+Y]` is with the third interpolation, what the tokenizer sees as the string segment "X + Y". To figure out what expression the string "X + Y" is, it has to be scanned and the tokens from that scan have to be inserted in the token sequence. So the tokenization becomes recursive according to some definition.