Can tuple iteration ever be faster than list iteration?

mpope · April 5, 2024, 11:26pm

Tuples are contiguous in memory, while lists are not. This should mean that iteration over tuples can be faster than lists, however I am unable to actually achieve this. I’ve come up with the following Benchee to highlight:

Name                          ips        average  deviation         median         99th %
tuple_comp                53.02 K       18.86 μs    ±31.14%       17.62 μs       38.55 μs
list_iter_recursive       11.79 K       84.80 μs   ±122.94%       76.53 μs      204.74 μs
list_comp                  9.28 K      107.81 μs    ±19.39%      102.72 μs      186.49 μs
tuple_iter_no_check        7.69 K      130.08 μs    ±25.14%      121.61 μs      244.62 μs
tuple_iter                 6.23 K      160.46 μs    ±22.88%      147.60 μs      301.26 μs
list_iter                  2.13 K      469.78 μs    ±10.40%      447.09 μs      649.97 μs

Comparison:
tuple_comp                53.02 K
list_iter_recursive       11.79 K - 4.50x slower +65.94 μs
list_comp                  9.28 K - 5.72x slower +88.95 μs
tuple_iter_no_check        7.69 K - 6.90x slower +111.22 μs
tuple_iter                 6.23 K - 8.51x slower +141.60 μs
list_iter                  2.13 K - 24.91x slower +450.92 μs

Memory usage statistics:

Name                   Memory usage
tuple_comp                      0 B
list_iter_recursive             0 B - 1.00x memory usage +0 B
list_comp                       0 B - 1.00x memory usage +0 B
tuple_iter_no_check          1816 B - ∞ x memory usage +1816 B
tuple_iter                      0 B - 1.00x memory usage +0 B
list_iter                       0 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

What this measures is tuple comparison being the fastest, followed by recursive list iteration, then list comparison, followed by two forms of tuple iteration, and finally list iteration using an Elixir standard library function.

And the code for each of the test cases:

list = Enum.to_list(1..50_000)
list2 = Enum.to_list(1..50_000)
tuple = List.to_tuple(list)
tuple2 = List.to_tuple(list)
tuple_size = tuple_size(tuple)
iter = Enum.to_list(1..1000)

defmodule BenchMod do
  # Function to iterate over a tuple with a bounds check.
  def tuple_iter(_, tuple_size, tuple_size), do: :ok
  def tuple_iter(tuple, tuple_size, idx) do
    _ = elem(tuple, idx)
    tuple_iter(tuple, tuple_size, idx + 1)
  end

  # Function to iterate over a tuple without a bounds check.
  def tuple_iter_no_check(tuple, idx) do
    _ = elem(tuple, idx)
    tuple_iter_no_check(tuple, idx + 1)
  end

  # Function to iterate over a list.
  def list_iter([]), do: :ok
  def list_iter([_h|t]) do
    list_iter(t)
  end
end

list_iter = fn ->
    Enum.each(list, fn _y -> :ok end)
end

list_iter_recursive = fn ->
  BenchMod.list_iter(list)
end

tuple_iter = fn ->
  BenchMod.tuple_iter(tuple, tuple_size, 0)
end

tuple_iter_no_check = fn ->
  try do
    BenchMod.tuple_iter_no_check(tuple, 0)
  rescue 
    _e -> :ok
  end
end

list_comp = fn ->
  list == list2
end

tuple_comp = fn ->
  tuple == tuple2
end

Benchee.run(
  %{
    "list_iter" => list_iter,
    "list_iter_recursive" => list_iter_recursive,

    "tuple_iter" => tuple_iter,
    "tuple_iter_no_check" => tuple_iter_no_check,

    "list_comp" => list_comp,
    "tuple_comp" => tuple_comp
  },
  time: 20,
  memory_time: 2
)

I think it makes sense that tuple iteration the way that I have implemented it is slower, the BenchMod.tuple_iter/3 function has a size check. BenchMod.tuple_iter_no_check removes that check and becomes much faster using a try/catch to terminate iteration as a quick cheat. The enum and list variants aren’t completely similar as the list iterations discards the head of the list while the enum forces an access using elem and I assume uses an extra reduction to do so.

Beyond these initial thoughts is there any way to make tuple iteration as fast or faster than lists?

nzok · April 6, 2024, 3:41am

There’s an extension to Erlang patterns that makes sense,
and I think it could be used to make tuple iteration faster.
Not that it’s clear to me that tuple iteration needs to be
faster, but …

tuple pattern : ‘{’ [ (at pattern | normal pattern)*‘,’ ] ‘}’
at pattern : (variable | integer) ‘:=’ normal pattern

The idea is that there is a cursor which starts at the first
element of the tuple. An ‘at’ pattern with n:= or N:= fails
if N is not an integer or n < 1 or n or N > the size of the
tuple, then does a normal match. This would allow

tuple_to_list(Tuple) → aux(tuple_size(Tuple), Tuple, ).

aux(I, T={I := X}, L) → aux(I-1, T, [X|L]);
aux(0, _, L) → L.

When an at-pattern contains no Var:=, it’s still possible to
determine the exact size of the tuple to be matched in advance.
Otherwise, the compiler can only determine a minimum size.

This has other uses. Indeed,

element(N, {N := X}) → X.

would no longer need to be primitive.

bjorng · April 6, 2024, 11:00am

Actually, the list of small integers in your benchmark will be pretty contiguous, too. I expect that every other word on the heap will be the integers in the list, interleaved by tagged pointers.

Yes, by helping the compiler and JIT apply more optimizations. I added the following functions to the BenchMod module:

  def tuple_iter_guarded(tuple) do
    tuple_iter_guarded(tuple, tuple_size(tuple), 0)
  end

  defp tuple_iter_guarded(_, tuple_size, tuple_size), do: :ok
  defp tuple_iter_guarded(tuple, tuple_size, idx) do
    _ = elem(tuple, idx)
    tuple_iter_guarded(tuple, tuple_size, idx + 1)
  end

Here are results on my computer (an M1 MacBook Pro):

Name                          ips        average  deviation         median         99th %
list_iter_recursive       13.93 K       71.77 μs     ±2.30%       71.54 μs       79.25 μs
tuple_iter_guarded        10.44 K       95.78 μs     ±1.71%       95.63 μs      101.75 μs
tuple_iter_no_check       10.34 K       96.76 μs     ±8.20%       96.33 μs      107.83 μs
tuple_iter                 8.96 K      111.61 μs     ±2.24%      111.29 μs      124.42 μs
list_iter                  5.69 K      175.70 μs     ±9.73%      188.50 μs      199.33 μs

Comparison: 
list_iter_recursive       13.93 K
tuple_iter_guarded        10.44 K - 1.33x slower +24.02 μs
tuple_iter_no_check       10.34 K - 1.35x slower +24.99 μs
tuple_iter                 8.96 K - 1.56x slower +39.85 μs
list_iter                  5.69 K - 2.45x slower +103.93 μs

(I removed the tuple_comp and list_comp tests, because they mostly measure the performance of C code in the runtime system, not compiled Erlang code.)

List iteration is still faster. One reason could be that the list iteration function does not read the head of the list from memory. A fairer benchmark would be to, for example, use functions that calculate the sum of all elements.

To explain why the new iteration function is faster, I will translate the example to Erlang:

-module(bench).
-export([tuple_iter/3, tuple_iter_guarded/1]).

tuple_iter(_Tuple, Size, Size) ->
    ok;
tuple_iter(Tuple, Size, Index) ->
    _ = element(Index + 1, Tuple),
    tuple_iter(Tuple, Size, Index + 1).

tuple_iter_guarded(Tuple) ->
    tuple_iter_guarded(Tuple, tuple_size(Tuple), 0).

tuple_iter_guarded(_Tuple, Size, Size) ->
    ok;
tuple_iter_guarded(Tuple, Size, Index) ->
    _ = element(Index + 1, Tuple),
    tuple_iter_guarded(Tuple, Size, Index + 1).

Those function are my own translation of the elixir code. I haven’t checked, by I expect the Erlang code generated by the elixir compiler to be very similar or identical to this code. Note that those functions are not idiomatic Erlang code.

By running erlperf, we can confirm that the tuple_iter_guarded/1 is the faster than tuple_iter/3:

erlperf --init_runner_all 'list_to_tuple(lists:seq(1, 50000)).' 'r(T) -> bench:tuple_iter(T, tuple_size(T), 0).' 'r(T) -> bench:tuple_iter_guarded(T).'
Code                                                   ||        QPS       Time   Rel
r(T) -> bench:tuple_iter_guarded(T).                    1      10035   99681 ns  100%
r(T) -> bench:tuple_iter(T, tuple_size(T), 0).          1       8615     116 us   86%

Now let’s dig into why tuple_iter/3 is slower. In the first clause the second and third arguments are compared. Since tuple_iter/3 is an exported function, the compiler can know nothing about the type of the arguments, and will therefore not emit any type information for the arguments. Here is the BEAM code:

    {test,is_eq_exact,{f,3},[{x,2},{x,1}]}.

Without type information, the JIT will emit code that can handle all possible terms:

# is_eq_exact_fss
    cmp x27, x26
    b.eq L20
    orr x14, x27, x26
    and x14, x14, 3
    cmp x14, 3
    b.eq @label_3-0
    mov x0, x27
    mov x1, x26
    stp x15, x16, [x19, 160]
    bl L23
    ldp x15, x16, [x19, 160]
    cbz w0, @label_3-0
L20:

This code is for AArch64 (ARM64). The important thing here is not to understand what these instructions do, but how many instructions there are.

The corresponding BEAM instruction for comparing the second and third arguments in the first clause of tuple_iter_guarded/3 looks like this:

    {test,is_eq_exact,
          {f,8},
          [{tr,{x,2},{t_integer,{0,16777215}}},
           {tr,{x,1},{t_integer,{0,16777215}}}]}.

tuple_iter_guarded/3 is a local function, which is only called by tuple_iter/1 and itself, so the compiler does a much better job of figuring out the types and ranges for the arguments.

Knowing that both arguments are small integers the JIT will emit a much shorter code sequence for the comparison:

# is_eq_exact_fss
# simplified check since one argument is an immediate
    cmp x27, x26
    b.ne @label_8-2

See the blog post Type-Based Optimizations in the JIT for more information.

Now let’s look at the second clause of tuple_iter/3. The compiler eliminates the common subexpression Index + 1, essentially rewriting the second clause to:

tuple_iter(Tuple, Size, Index) ->
     IncrementedIndex = Index + 1,
    _ = element(IncrementedIndex, Tuple),
    tuple_iter(Tuple, Size, IncrementedIndex).

Again, since the function is exported, the compiler knows nothing about the type of the Index variable and emits a BEAM instruction for the addition without any type information:

    {gc_bif,'+',{f,0},3,[{x,2},{integer,1}],{x,2}}.

The JIT has to emit code that works with any possible type for Index:

# i_plus_jIssd
    mov x2, 31
    adds x0, x27, 16
    and x8, x27, 15
# test for not overflow and small operands
    ccmp x8, 15, 0, 9
    b.eq L26
    mov x1, x27
    bl L28
L26:
    mov x27, x0

The corresponding BEAM instruction in the local function tuple_iter_guarded/3 has type information:

    {gc_bif,'+',
            {f,0},
            3,
            [{tr,{x,2},{t_integer,{0,16777215}}},{integer,1}],
            {x,2}}.

Taking advantage of the type information, the JIT emits a single native instruction for the addition:

# i_plus_jIssd
# add small constant without overflow check
    add x27, x27, 16