Asn1ct support for non-ascii files? Attempting to compile some TCAP-MAP ASN.1 files resulting in error

Benoit · July 18, 2022, 1:20pm

Hi.
I am brand new to Erlang, and wanted to test the ASN.1 environment provided through the compiler asn1ct. I am currently giving an attempt at compiling some files from the telecom TCAP-MAP ASN.1 specification, but am facing straight errors with what would look apparently as file encoding issues.

In short, here is an example of an error I am facing:

Erlang/OTP 25 [erts-13.0.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

Eshell V13.0.2  (abort with ^G)
1> asn1ct:compile("MAP-CommonDataTypes.asn", [ber, verbose]).                                                        
Erlang ASN.1 compiler 5.0.19
Compiling: "MAP-CommonDataTypes.asn"
Options: [ber,verbose,{i,"."}]
MAP-CommonDataTypes.asn:300: syntax error before: 'Â'
{error,[{structured_error,{"MAP-CommonDataTypes.asn",300},
                          asn1ct_parser2,
                          {syntax_error,'Â'}}]}

Those ASN.1 files are visible here: https://github.com/P1sec/pycrate/tree/master/pycrate_asn1dir/Pycrate_TCAP_MAP/. They have nothing really special, are standard ASN.1 files, may contained some non-ASCII / UTF-8 characters in comments, but I expect this should not bring any issue to the compiler.

With the error provided by Erlang, I fail to understand if the error corresponds to a line or character number, neither can I spot any Â character in the file from my terminal or any standard editor. Moreover, the corresponding hex codepoint 0xC382 does not exist in this file. I am using a recent Ubuntu system with a default terminal and environment config (LANG=en_US.UTF-8).

Any help of feedback will be really appreciated.
Thanks.

vances · July 19, 2022, 12:45am

For reference you could compare with this working project: github.com/sigscale/map

kennethL · July 19, 2022, 9:09am

The ASN.1 compiler does not handle UTF-8 encoded files and since UTF-8 and ISO-latin are the same for ASCII characters the problem does not show up until line 300 which contains the bytes 0xC2 and 0xA0 which is U+00A (non breaking space). These byte codes are converted directly to tokens since they are not interpreted as white space which they would have been if the scanner handled UTF-8.

On line 300 you have
alertingLevel-0 AlertingPattern ::= '00000000’B

the C2 A0 bytes appear just before ::= '00000000’B so they are not in a comment

It might be a good idea to improve the handling of UTF-8 here but it has not been necessary so far.
The standard MAP-CommonDatatypes and friends that we have tested with does not contain UTF-8 in this way.

This is the first time I have noticed a problem like this so apparently all ASN.1 specifications we have tested until this one has stayed away from UTF-8.

Benoit · July 19, 2022, 10:33am

Thanks a lot for your analysis. Those ASN.1 files are extracted from 3GPP docx documents converted to txt. I guess those unicode specific characters are coming from formatting oddities under Microsoft word.