Good examples to explain how bit syntax expressions works on non-byte-aligned situations

wallace · September 26, 2024, 7:25am

Many examples that explain how bit syntax works only demonstrate byte-aligned situations,
I wrote some simple examples about the non-byte-aligned situations, which helped me a lot.

I hope these examples can help people who are not familiar with bit syntax or people who are teaching Erlang to others.

Example 1

<<7:9, 0:23>>.
%> <<3,128,0,0>>

BYTE 0     BYTE 1     BYTE 2     BYTE 3
 M      L   M      L   M      L   M      L
+--------+ +--------+ +--------+ +--------+ 
|********| |*       | |        | |        | 
+--------+ +--------+ +--------+ +--------+ 
 0           0000000   00000000   00000000
  0000011   1
  ^---------^

Example 2

<<7:9/little, 0:23>>.
%> <<7,0,0,0>>

BYTE 0     BYTE 1     BYTE 2     BYTE 3
 M      L   M      L   M      L   M      L
+--------+ +--------+ +--------+ +--------+ 
|********| |*       | |        | |        | 
+--------+ +--------+ +--------+ +--------+ 
            00000000   00000000   00000000
 00000111
 ^------^

Example 3

<<7:9, 7:9, 0: 14>>.
%> <<3,129,192,0>>

BYTE 0     BYTE 1     BYTE 2     BYTE 3
 M      L   M      L   M      L   M      L
+--------+ +--------+ +--------+ +--------+ 
|        | | *******| |**      | |        | 
+--------+ +--------+ +--------+ +--------+ 
 00000011   10           000000   00000000
              000001   11
              ^---------^

Example 4

<<7:9, 7:9/little, 0: 14>>.
%> <<3,131,128,0>>

BYTE 0     BYTE 1     BYTE 2     BYTE 3
 M      L   M      L   M      L   M      L
+--------+ +--------+ +--------+ +--------+ 
|        | | *******| |**      | |        | 
+--------+ +--------+ +--------+ +--------+ 
 00000011   1           0000000   00000000
             0000011   1
             ^---------^

i-am-linja · September 26, 2024, 9:27pm

Are you certain of this? I’m not intimately familiar with binaries myself but I can’t help but notice that your bytes are joined LSB-to-MSB, even in the little-endian cases.

wallace · September 27, 2024, 2:41am

Good question ! And yes, I am certain of this. Endian can only decide byte order.

There are 2 phases for bit syntax expressions. Let’s keep using <<7:9, 0:23>> as the example.

PHASE 1: Get the Size (which is 9 in this example), mark the 9 bits we will use. Endian doesn’t matter in this phase.

BYTE 0     BYTE 1     BYTE 2     BYTE 3
 M      L   M      L   M      L   M      L
+--------+ +--------+ +--------+ +--------+ 
|********| |*       | |        | |        | 
+--------+ +--------+ +--------+ +--------+

PHASE 2: Now we can only see the marked bits, we re-organize these bits into virtual bytes.

If it’s big-endian, we start from right, otherwise we start from left. Inside the virtual byte, it’s always MSB-LSB.

BYTE 0     BYTE 1     BYTE 2     BYTE 3
 M      L   M      L   M      L   M      L
+--------+ +--------+ +--------+ +--------+ 
|********| |*       | |        | |        | 
+--------+ +--------+ +--------+ +--------+ 
 *0000011   1
  ^---------^

i-am-linja · September 27, 2024, 5:44am

Endianness does matter in phase 1. Say we’re on a little-endian machine, and we have a 64-bit pointer, from which we wish to extract the page offset, 12 bits. Under your model, we cannot do this. X marks the bits we want:

 M   0  L M   1  L M   2  L M   3  L M   4  L M   5  L M   6  L M   7  L
+--------+--------+--------+--------+--------+--------+--------+--------+
|XXXXXXXX|    XXXX|        |        |        |        |        |        |
+--------+--------+--------+--------+--------+--------+--------+--------+

And there is no contiguous range that captures them.

wallace · September 27, 2024, 6:00am

What I was describing is how bit syntax expressions in Erlang works.

For a little-endian machine, you can not fetch the low 12 bits from an address in one go.

<<LowBits:12/little, _:52>> = <<16#0000000012345678:64/little>>.

This is what you will get:

io:format("~.16B~n", [LowBits]).
%> 578

But you can do it like this:

<<LowBits1:8, _:4, LowBits2:4, _:48>> = <<16#0000000012345678:64/little>>.

io:format("~.16B~n", [LowBits1 + LowBits2*256]).
%> 678

i-am-linja · September 27, 2024, 8:00am

Well. I just tried it myself in the shell and it appears you are correct. That really sucks.

I know the likely answer, but is there any way to get little-endian bit-matching in one go? Perhaps declaring the entire binary as /little, somehow? Some parse transform or something?

wallace · September 27, 2024, 12:59pm

Maybe. But I would like to keep bit syntax expression as what it is now. Supporting both directions could make the rules too complex to implement or to explain to users.

Since most (if not all) network protocols are big-endian style, I think what bit syntax expression does now is a good fit for what it is designed for.

wallace · September 27, 2024, 4:36pm

Just a short comment here. The * is not necessary. To get the lowest 12 bits from a little-endian number, the easier and cleaner way is fetching out the whole number and using bitwise operator band.

<<A:64/little>> = <<16#0000000012345678:64/little>>.

io:format("~.16B~n", [A band 16#FFF]).
%> 678

i-am-linja · September 27, 2024, 11:33pm

Supporting both directions with per-field rules alone would be horrendous, I agree. However, it could be done simply and cleanly by marking the entire binary, such as <<Off:12, _:52>>/little (obviously with different syntax). In this way, bits within bytes would be counted from least to most significant.

wallace · September 28, 2024, 12:55am

I feel it’s still too complex. For example:

<<1:1, <<7:12>>/bits>>.

In current rule, it’s

BYTE 0     BYTE 1
 M      L   M      L
 +--------+ +--------+
 | *******| |*****XXX|
 +--------+ +--------+
  10000           
       000   00111
       ^---------^

Where XXX is the next free space to use. There is no hole, it’s simple to understand. All we have to do is mark continous bits then organize to virtual bytes.

But if we want to support from least to most style in inner bits, we are actually expecting this:

BYTE 0     BYTE 1
 M      L   M      L
 +--------+ +--------+
 | *******| |XXX*****|
 +--------+ +--------+
  10000         
       000      00111
       ^---------^

When we mixing 2 styles, there will be holes.

Iet’s use another bit-string <<1:1, A/bits-little, 7:10>>, how would the outter binary know where it should start and where it should skip?

On the implementation side, there will be a lot of inner data and complex logic to handle them. And that’s not the worst thing.

The worst thing will be on the user side, we can not use simple models or simple graphs to explain how it works.

The current rules for bit syntax expressions are already complex to some extent, that’s why so many people don’t understand how complex bitstring works. But with the right mental model (like what I described on the top of this topic), it’s still easy to manage even for beginners.

i-am-linja · September 28, 2024, 2:05am

Nested binaries get horrific like this, pretty much inevitably. Personally, I think Erlang’s decision to declare endianness per-field was a gigantic mistake. I think the best model would have been to declare it once, outside any binary, and disallow re-declaring it within a binary.

wallace · September 28, 2024, 2:50am

I don’t think that’s a good solution. Consider this situation:

-spec my_fn(bitstring()) -> bitstring().
my_fn(SomeBitString) ->
        Prefix = get_customized_prefix(),
        <<Prefix/bits, SomeBitString/bits>>.

How to make a limitation on endianness?

Forcing SomeBitString and Prefix to be of the same endianess and throwing an exception when they are not is not a good idea.

Dropping the feature of declaring endianness per-field is easy, but disallowing people from combining bits of different endianess doesn’t make sense.

bjorng · September 28, 2024, 7:09am

If performance and/or code size is important, it’s better to fetch only part of the number. Consider:

extract_1(<<A:64/little>>) ->
    A band 16#fff.

extract_2(<<A:16/little,_:48>>) ->
    A band 16#fff.

In extract_1/1, it is possible that the value of A will not fit in a small integer (60 bits signed), so the generated native code will have to handle that case. Here is the part of the native code that extracts a 64-bit segment:

# extract integer 64
    mov x4, 15
# reverse endian for a little-endian segment
    rev64 x9, x7
    lsr x9, x9, 0
# test whether it fits in a small
    lsr x10, x9, 59
    cbnz x10, L32
    orr x27, x4, x9, 4
    b L33
L32:
# store extracted integer as a bignum
    add x27, x23, 2
    mov x10, 72
    stp x10, x9, [x23], 16
L33:

The band operator will also need to handle a large integer that does not fit in a small:

# i_band_jIssd
    mov x2, 65535
    and x0, x27, x2
# simplified test for small operands since other types are boxed
    tbnz x0, 0, L34
    mov x1, x27
    bl L36
L34:
    mov x25, x0

The code that extracts a 16-bit segment in extract_2/1 is simpler because the result is always a small integer:

# extract integer 16
    mov x4, 15
    ubfx x9, x7, 48, 16
# reverse endian for a little-endian segment
    rev16 x9, x9
    orr x27, x4, x9, 4

The band operation is also much simpler:

# i_band_jIssd
# skipped test for small operands since they are always small
    and x25, x27, 65535

i-am-linja · September 28, 2024, 8:23am

Ok, how about this. This is an idea that allows proper little-endian field extraction, in a backwards-compatible way.

The idea is either a new endian specifier or reworking the existing /little specifier, that does affect “PHASE 1” above. When specified, it will take bits from the earliest available byte, from the least significant end:

 01234567  Byte 01234567
+--------+     +--------+
|08      | MSB |7       |
|19      |     |6       |
|2A      |     |5       |
|3B      |     |4       |
|4       |     |3B      |
|5       |     |2A      |
|6       |     |19      |
|7       | LSB |08      |
+--------+     +--------+
Left  : <<off:12     , _:52>>
Right : <<off:12/smol, _:52>>

It gets a bit wonky with mixed endiannesses and nested binaries, but in my opinion the current system gets a lot wonkier. This system is never ambiguous, it never leaves a non-contiguous range of bits in a byte, and it never takes bits from a later byte before using up all earlier. I think it would be a great addition.

wallace · September 28, 2024, 9:13am

Got it, Thank you !