Feature: Heredocs / Triple-quoted text

Unfortunately, Erlang has, as far as I can tell, always had string pasting without mandatory white space, so It did not follow the Prolog syntax on this point.

This means that it also always hasn’t interpreted double quote characters in a string as a quote character, but instead as end and start of two strings that get pasted.

The argument that the scanner is eager (that I also have used myself) is not really applicable for strings, but more for tokens like an operator or an unquoted atom.

For a string or a quoted atom, instead, as soon as the end delimiter is found, the token is emitted, and scanning continues on the next character.

A triple-quoted string has got a flexible start delimiter in that it can be any number of " characters (terminated by white space), and when the string content is scanned it is known exactly how many " characters that constitutes the end delimiter. As soon as it is found, the token is emitted and scanning continues on the next character.

Thus, strings and triple-quoted strings, according to the above, as implemented on the ‘master’ branch for OTP-27.0, behave consistently.

Although I think that strings pasted without white space are not useful and hard to read, unfortunately, when I implemented a warning for that, it immediately blew up in our test suites. If we have done it (by accident or whatnot) then others have too.

The warning for """ starting a string that is now present in OTP-26.1 is essential because that will change meaning in OTP-27.0 when triple-quoted strings are introduced. I have not heard of any problems due this, so apparently pasting an empty string to another string without intervening white space is not a common pattern.

If """ ends a string today, that will have the same meaning in OTP-27.0, therefore a warning for this is not essential.

Although I would love to make string pasting without white space an error (warning in OTP-26.2, error in 27.0), it seems to be too backwards incompatible even for own test suites’ code.

Pasting the empty string to another string seems to be a more uncommon pattern, but if we have to allow "" to end+start+paste two strings, it would be weird to not allow that for the special case empty string as in """ ending a string.

All in all, to me it seems that the best we can do is to keep the currently suggested behaviour for triple-quoted strings (and regular strings) currently on ‘master’ to be released in OTP-27.0, which is to end the string at the end delimiter and continue scanning on the next character. This allows pasting any string (also the empty string) to the end of a string, without white space.

1 Like

Hmm. A quick trawl through /usr for *.erl files found 1427 files.
The only occurrences of “” I could find therein were
“”
<<“”>>
“…"”
I didn’t find anything that a ‘“”’-in-“string” checker would have
complained about. Why would this be common in your test suite?

I don’t think it is common. I turned it into a compilation error (on master), and that failed our daily tests. So there was at least one ocurrence. It was some code that concatenated a file name a’la "name"".ext". I cannot find it in the code now, it might be done through a macro.

The """ warning I have never seen trigger, but the concatenation error did, so the latter should be more common. And I got scared when it triggered in our test suites’ code.

Edit:
It’s in prim_file_SUITE.erl, file_info_basic_file/1: "_basic_test"".fil"

I dug through some other files,
and I believe that reporting adjacent double quotes
would FIND ERRORS in existing code bases.
Here’s the line that convinced me:
w(" {-1, 0, “”}~n };~n",),
The clear intention of this was to generate
{-1, 0, “”}
not
{-1, 0, }

In this code body, there were other occurrences of “” in
a string that made no sense to me. For example,
join(…, Foo ++ “base”“.ext”)
If this was originally
join(…, Foo ++ “base” ++ “.ext”)
then why bother changing it, and why not convert to “base.ext”?
This is something that SHOULD be checked in a code review, and
it clearly wasn’t because nothing reported it.

Rather than wait for consensus to be reached on this,
I’ve written a little C program to check and run a few thousand
Erlang source files through it. Enjoy.

(Attachment dq.c is missing)

1 Like

There are a number of problems with triple-quoted strings,
or here-docs, or anything else whose objective is to embed
substantial amounts of text in notation X into a source
file written in notation Y.

They all boil down to “notation X can’t be expected to follow
ANY of the rules of notation Y, notation Y can’t be expected
to know ANY of the rules of notation X, and (with honourable
exception) the notation Y processor doesn’t even know what
notation X is .”

It has been 44 years since I last used a keypunch.
It has been 38 years since I last used a compiler than produced listings.
It has been 30 years since I last used an editor that couldn’t
do “take me to the file whose name the cursor is pointing to.”
The idea of putting foreign text in a file is SO last century.

So here’s the idea.
“\C ”
stands for the characters of
and
“\B ”
stands for the bytes of
allowing <<“\Bfoobar.bin”>>.

Notation X goes in a notation X file, where it can be

  • displayed by tools aware of notation X syntax
  • checked by tools aware of notation X semantics
  • maintained independently of the ‘containing’ file
  • be a single source for inclusion by multiple files
    if wanted
  • visited easily by a single editor command
  • of any length that is allowed in a string literal.
    leaving notation Y
  • easy to process using regular expressions (as N > 2
    opening quotes … N closing quotes literates are NOT)
  • uncluttered
1 Like

I see “attachment dq.c is missing”.
How do I provide dq.c for others to use?

My abject apologies. I am reading and writing this stuff in gmail.
Each message I send comes back to me in two copies; one of them as
I wrote it and the other most offensively mangled. The latest
botchup by the insane mangler took
double quote, backslash, capital C, less than, r,e,l,a,t,i,v,e,
,f,i,l,e, ,n,a,m,e, greater than, double quote
and erased “relative file name” and the surrounding angle brackets.
The whole point of the idea is that you have

  • a string literal
  • whose first two characters are a backslash and a capital C for
    characters or a capital B for bytes
  • followed by a relative file name (using slash, not backslash)
  • and not containing anything else.
    Since it is a string literal, it follows the same lexical rules
    as other string literals and can easily be processed by tools written
    in a wide range of programming languages.

Rather like the way I can include a file name like
dq.c
in one of these messages, and it gets through, but
if I tried to include the contents of dq.c, it would be
mangled in weird and to me unpredictable ways.

Are the mangling rules written down anywhere?
Is there a way to share a file without the sharer having
to run a web site?

Hmm. Maybe the mangler uses smackdown. Let’s try that.

/* File : dq.c
Author : Richard A. O'Keefe
Updated: 2023/10/31
Purpose: Find "...""..." and '...''...' in Erlang source code.
Usage : find $directory -name '*.[ehy]rl' -exec dq {} +
Usage : dq <foobar.erl 2>foobar.log

Assumptions:
(1) Files are encoded in UTF-8 or 8-bit extensions of ASCII,
so that we only need to check for ' " $ % \n as ASCII characters.
(2) Files may have LF or CR+LF line terminators, but not CR ones.
(3) Files may contain macro definitions and uses so that there is
no point in doing bracket matching checks as well.
(4) Files do NOT use the proposed Python-envy triple-quoting scheme.
This is a tool for preparing a code-base for that.

Erlang allows C-envy string pasting. This means that
"""foobar", "foo""bar", and "foobar""" are all read as "foobar".
There is never any reason to do that. Pasting an empty string
is the same as not pasting anything. If you really really want
to do it for some reason, use
"" "foobar", "foo" "bar", or "foobar" "" with a space between.

However, sometimes adjacent double quotes arise from a mistake.
Imagine
io:put_chars(" x = "";\n") % generating AWK
io:put_chars("<tag att="">foo</tag>") % generating XML
io:put_chars("{"" : "reply", "x":2}") % generating JSON
all of which will be quietly accepted by Erlang, and all of which
will subsequently misbehave. Since such mistakes DO occur in
existing Erlang code, it's useful to check for them.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SQ '\''
#define DQ '"'

static void check(char const *filename) {
long L = 1;
register int c;
FILE *in = 0 == strcmp(filename, "-") ? stdin
: fopen(filename, "r");

if (in == (FILE *)0) {
perror(filename);
exit(EXIT_FAILURE);
}
while ((c = getc_unlocked(in)) >= 0) {
if (c == '\n') {
L++;
} else
if (c == '%') {
do c = getc_unlocked(in); while (c >= 0 && c != '\n');
if (c < 0) {
if (ferror(in) == 0) {
fprintf(stderr, "%s:%ld: unterminated %% comment\n",
filename, L);
}
break;
}
L++;
} else
if (c == '$') {
c = getc_unlocked(in);
if (c == '\\') c = getc_unlocked(in);
if (c < 0) break;
} else
if (c == DQ) {
for (;;) {
c = getc_unlocked(in);
if (c < 0) break;
if (c == '\\') {
c = getc_unlocked(in);
if (c < 0) break;
} else
if (c == '\n') {
L++;
} else
if (c == DQ) {
c = getc_unlocked(in);
if (c == DQ) {
fprintf(stderr, "%s:%ld: adjacent double quotes\n",
filename, L);
if (fclose(in) != 0) perror(filename);
return;
}
(void)ungetc(c, in);
c = DQ;
break;
}
}
if (c < 0) {
if (ferror(in) == 0) {
fprintf(stderr, "%s:%ld: unterminated \"string\"\n",
filename, L);
}
break;
}
} else
if (c == SQ) {
for (;;) {
c = getc_unlocked(in);
if (c < 0) break;
if (c == '\\') {
c = getc_unlocked(in);
if (c < 0) break;
} else
if (c == '\n') {
L++;
} else
if (c == SQ) {
c = getc_unlocked(in);
if (c == SQ) {
fprintf(stderr, "%s:%ld: adjacent single quotes\n",
filename, L);
if (fclose(in) != 0) perror(filename);
return;
}
(void)ungetc(c, in);
c = SQ;
break;
}
}
if (c < 0) {
if (ferror(in) == 0) {
fprintf(stderr, "%s:%ld: unterminated 'atom'\n",
filename, L);
}
break;
}
}
}
if (ferror(in) != 0 || fclose(in) != 0) {
perror(filename);
exit(EXIT_FAILURE);
}
}

int main(int argc, char **argv) {
if (argc <= 1) {
check("-");
} else {
int i;
for (i = 1; i < argc; i++) check(argv[i]);
}
return 0;
}

I am going for a warning in OTP-26.2 (‘maint’) and an error in OTP-27.0-rc1 (‘master’). We’ll see if there are any strong objections.

That is: I will disallow string concatenation without intervening white space.

That looks like it could fit the proposed Sigil syntax https://github.com/erlang/eep/pull/53, for example:
~i"./foobar.bin"b to include a binary from a file, or ~i"./foobar.bin"s_utf8 for a string read with utf8 encoding.

Having the content in a separate file sure solves editor mode and syntax highlighting issues…

1 Like

As mentioned above, code producing a compile time warning for adjacent strings without intervening white space has been merged to ‘maint’ to be released in OTP-26.2, and code producing a compile time error for the same thing (in conjunction with triple-quoted strings) has been merged to ‘master’ to be released in OTP-27.0, and all pre-releases.

3 Likes