Variable shadowing constraint of match spec generated using ms_transform

erlangys · June 1, 2023, 7:20am

Looks like that is a variable shadowing constraint of match spec that is generated using ms_transform. For example,

get_match_spec(Id) ->
    ets:fun2ms(fun({Id, Content} = Item) -> {Id, [mark | Content]} end).
Warning: variable 'Id' shadowed in ms_transform fun head

get_match_spec(123).
Output: [{{'$1','$2'},[],[{{'$1',[mark|'$2']}}]}]

Thus ets:select_replace can’t efficiently be used to select_replace an item with the given Id. One way is to use guard

get_match_spec(Id) ->
    ets:fun2ms(fun({Id1, Content} = Item) when Id1 == Id -> {Id, [mark | Content]} end).

test_fun2ms:get_match_spec(123).
Output: [{{'$1','$2'},
  [{'==','$1',{const,123}}],
  [{{{const,123},[mark|'$2']}}]}]
But then it needs to scan the whole ets table because the match of the Id is not in the MatchHead.

Is there any way to solve this variable shadowing issue? Otherwise this constraint makes match spec select_replace of little usage.

LeonardB · June 1, 2023, 10:13am

Why use ets:fun2ms at all?

I’ve always viewed fun2ms as a useful tool for the shell when developing, but never ever used it in production code.

jimdigriz · June 1, 2023, 11:02am

Because if you include:

-include_lib("stdlib/include/ms_transform.hrl").

It converts your very readable fun into the matchspec for you at compile time for free, right?

jimdigriz · June 1, 2023, 11:07am

I think the problem is the Id in the function, you need to use something (untested) more like:

get_match_spec(Id) ->
  ets:fun2ms(fun({I, Content} = Item) when I == Id -> {I, [mark | Content]} end).

Updated: …now tested:

17> c(test).                 
{ok,test}
18> test:get_match_spec(moo).
[{{'$1','$2'},
  [{'==','$1',{const,moo}}],
  [{{'$1',[mark|'$2']}}]}]

Contents of test.erl is:

-module(test).

-include_lib("stdlib/include/ms_transform.hrl").

-export([get_match_spec/1]).

get_match_spec(Id) ->
  ets:fun2ms(fun({I, Content} = _Item) when I == Id -> {I, [mark | Content]} end).

sverker · June 1, 2023, 2:26pm

This is a general language quirk with funs. You cannot refer and match already bound variables in the fun’s function head.

Workaround: As @LeonardB hinted, paste the matchspec from fun2ms in your code and substitute '$1' with your Id variable. Maybe keep the fun syntax in a comment for improved readability.

LeonardB · June 1, 2023, 2:46pm

Just a small example to show the performance difference between a $X bound match and a value match in the match spec.

Eshell V13.0  (abort with ^G)
1> ets:new(test, [named_table, ordered_set, {keypos, 1}]).
test
2> [ets:insert(test, {X,X}) || X <- lists:seq(1,1_000_000)].
[true,true,true,true,true,true,true,true,true,true,true,
 true,true,true,true,true,true,true,true,true,true,true,true,
 true,true,true,true,true,true|...]
3> ets:fun2ms(fun({I, Content}) when I == 1_000_000 -> {I, [mark | Content]} end).
[{{'$1','$2'},[{'==','$1',1000000}],[{{'$1',[mark|'$2']}}]}]
4> ManualMs = [{{1_000_000,'$1'},[],[{{1_000_000,[mark|'$1']}}]}].
[{{1000000,'$1'},[],[{{1000000,[mark|'$1']}}]}]
5> Ms = [{{'$1','$2'},[{'==','$1',1000000}],[{{'$1',[mark|'$2']}}]}].
[{{'$1','$2'},[{'==','$1',1000000}],[{{'$1',[mark|'$2']}}]}]
6> [io:format("~p~n", [timer:tc(fun() -> ets:select(test, Ms) end)]) || _ <- lists:seq(1,10)].
{132734,[{1000000,[mark|1000000]}]}
{126449,[{1000000,[mark|1000000]}]}
{121694,[{1000000,[mark|1000000]}]}
{122800,[{1000000,[mark|1000000]}]}
{120502,[{1000000,[mark|1000000]}]}
{115237,[{1000000,[mark|1000000]}]}
{117500,[{1000000,[mark|1000000]}]}
{114709,[{1000000,[mark|1000000]}]}
{125850,[{1000000,[mark|1000000]}]}
{128768,[{1000000,[mark|1000000]}]}
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
7> [io:format("~p~n", [timer:tc(fun() -> ets:select(test, ManualMs) end)]) || _ <- lists:seq(1,10)].
{19,[{1000000,[mark|1000000]}]}
{4,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
{4,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
{3,[{1000000,[mark|1000000]}]}
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok]
8>

And in case you were wondering about duplicate_bag

Eshell V13.0  (abort with ^G)
1> ets:new(test, [named_table, duplicate_bag, {keypos, 1}]).
test
2> [ets:insert(test, {case (X rem 1000) =:= 0 of true -> moo; false -> X end, X}) || X <- lists:seq(1,1_000_000)].
[true,true,true,true,true,true,true,true,true,true,true,
 true,true,true,true,true,true,true,true,true,true,true,true,
 true,true,true,true,true,true|...]
3> Ms = ets:fun2ms(fun({I, Content}) when I == moo -> {I, [mark | Content]} end). 
[{{'$1','$2'},[{'==','$1',moo}],[{{'$1',[mark|'$2']}}]}]
4> 
4> ManualMs = [{{moo,'$1'},[],[{{moo,[mark |'$1']}}]}].            
[{{moo,'$1'},[],[{{moo,[mark|'$1']}}]}]
5>            
5> {MsTime, MsTimes} = lists:foldl(fun(_, {T0, Times}) -> {T1, _} = timer:tc(fun() -> ets:select(test, Ms) end), {T0 + T1, [T1 | Times]} end, {0, []}, lists:seq(1,10)).
{2060486,
 [205817,203358,210674,201202,198532,200487,207024,208889,
  205530,218973]}
6> lists:foreach(fun(T) -> io:format("~w~n", [T]) end, MsTimes).
205817
203358
210674
201202
198532
200487
207024
208889
205530
218973
ok
7> io:format("MsTime Avg: ~w~n", [MsTime / 10]).
MsTime Avg: 206048.6
ok
8> {ManualMsTime, ManualMsTimes} = lists:foldl(fun(_, {T0, Times}) -> {T1, _} = timer:tc(fun() -> ets:select(test, ManualMs) end), {T0 + T1, [T1 | Times]} end, {0, []}, lists:seq(1,10)).
{1140,"jkj{abe[ZÝ"}
9> lists:foreach(fun(T) -> io:format("~w~n", [T]) end, ManualMsTimes).
106
107
106
123
97
98
101
91
90
221
ok
10> io:format("ManualMsTime Avg: ~w~n", [ManualMsTime / 10]).
ManualMsTime Avg: 114.0
ok

sverker · June 1, 2023, 2:58pm

That’s the “fully bound key” optimization:
“Traversals using match and select functions may not need to scan the entire table depending on how the key is specified. A match pattern with a fully bound key (without any match variables) will optimize the operation to a single key lookup without any table traversal at all.”

jimdigriz · June 1, 2023, 5:11pm

Is there a technical edge case why the constant (I am only thinking of the case of a single =:= guard) is not lifted into the bound key by ms_transform or is it just a case of “patches welcome?”

This seems like a candidate for a common pattern, at least for me it is.

sverker · June 1, 2023, 5:33pm

There already exist a pendng pull request that solves this problem.

github.com/erlang/otp

stdlib: Add query optimisation to `ets:fun2ms/1`

erlang:master ← TD5:ets-opt

opened 05:11PM - 22 Mar 23 UTC

TD5

+1626 -155

Unlike writing match specs directly, `ets:fun2ms/1` generates queries by transla…ting an erlang function expression. This is convenient and makes for readable queries, but it necessarily trades-off some expressiveness in favour of simplicity (for example, it's not possible to generate a match spec pattern guard that matches against an in-scope variable: users are forced to use something like an equality guard instead). Here, we resolve that issue by reading the author's _intention_ from the given function expression, generating the match spec as before via `ms_transform`, but then running an optimisation pass over it during compilation in order to generate more efficient queries. Performance =========== Amongst other things, we optimise equality guards by moving them into the pattern, which can avoid scanning the whole table, making queries `O(1)` or `O(log(n))` (depending on the table type), rather than `O(n)`, (where `n` is the number of rows in the table). In other words, this is not primarily a micro-optimisation, but rather a very substantial algorithmic complexity improvement for many common queries. In practice, I have seen no situations where the new `ets:fun2ms/1` queries are slower, but many simple queries can be executed drastically faster when the number of rows in the table is large. For example, even a simple query over a table of a million rows made up of pairs of keys and values queried with: ```erlang make_query(Key) -> ets:fun2ms(fun({K, V}) when K =:= Key -> {K,V} end). ``` now executes **>1000x faster** with my local benchmarks. Almost any query which requires that a `=:=` guard always hold will potentially see a substantial performance improvement for queries over big datasets. Theory ====== From the existing ETS match spec docs: > Traversals using match and select functions may not need to scan the > entire table depending on how the key is specified. A match pattern with > a fully bound key (without any match variables) will optimize the > operation to a single key lookup without any table traversal at all. For > ordered_set a partially bound key will limit the traversal to only scan > a subset of the table based on term order. A partially bound key is > either a list or a tuple with a prefix that is fully bound. We can leverage this knowledge to re-write queries to make better use of the key. For example: ```erlang make_query(Key) -> ets:fun2ms(fun({K, V}) when K =:= Key -> {K,V} end). ``` was previously compiled to: ```erlang { {'$1', '$2'}, [ {'=:=', '$1', Key} ], [{'$1', '$2'}] } ``` This was sub-optimal, since the equality guard is less efficient than the functionally-equivalent pattern match because the equality guard did not result in a fast lookup using the table's key. Now, the same function expression is compiled to this, more efficient, query: ```erlang { {Key, '$2'}, [], [{Key, '$2'}] } ``` We can also simplify constant parts of queries statically, and perform other rewritings to improve efficiency, but the largest win comes from inlining the values of variables bound by guards such as `(K =:= Key)`. Implementation ============== This optimisation is implemented for all relevant literals that I could find. Floats were given extra consideration and testing because of the differences in `==`/`=:=` vs. pattern matching. In this situation, the handling of floats in `ordered_set` is safe because we only inline `=:=` guards into the the match head and body, but we leave `==` as a guard, since determining statically whether the table type would make this a safe operation or not is not feasible using the the information available in the parse transform. New unit tests cover the parse transform compiling to the expected match expression, the match expression matching the expected rows, and the equivalence between the naive match expression and the optimised one in terms of data returned. See the changes to `ms_transform_SUITE.erl` for more information. This optimisation is specifically applied in `ets:fun2ms/1`, because I think users would expect generated match specs to avoid trivial inefficiencies (and, indeed, utilising the key efficiently when it was given as a parameter was impossible to express before). Moreover, by making use of `ets:fun2ms/1`, users have already ceded some control of the generated match spec to the tooling. Users who construct match specs directly will be unaffected. Notably, since `ets:fun2ms/1` is transformed at compile time (outside of the shell, at least), we don't pay any unnecessary performance penalty at runtime in order to apply these optimisations, and the cost of doing them at compile time is low relative to other operations. Later work could explore runtime query-planning for ETS, but avoiding introducing performance regressions for at least some queries will be harder to guarantee, since we then we would have to consider the runtime cost of computing the optimisation itself. Optimisation can be disabled with the `no_optimise_fun2ms` compiler flag, but by default it is enabled. The flag can be altered via the usual compile flag mechanisms, including the `-compile(no_optimise_fun2ms)` attribute.

and it’s my fault as a slow reviewer that it did not make it into OTP 26.0.

TD5 · June 1, 2023, 5:58pm

Hi folks, I made the PR @sverker referred to for optimising ets:fun2ms as OP wants. I made that PR, then go distracted with a million other things (have you seen feature: String interpolation by TD5 · Pull Request #7343 · erlang/otp · GitHub ), but I’ll revisit this now and get the noise I left in the commit cleaned up.

If there’s any further feedback or requests, see free to comment on the pull request and I’ll see what I can do.

Why use ets:fun2ms at all?

Historically, I suppose you wouldn’t if you were performance sensitive, but it does make for quite readable queries, imo. As you can see from the PR though, we can have the best of both words and just write a query optimised for ets:fun2ms

erlangys · June 1, 2023, 6:23pm

Didn’t know you already had a PR to solve this specific problem. Thank you very much for doing this. It looks like there are always pleasant surprises when you ask in this forum.
Look forward to seeing it merged into mainline.

erlangys · June 5, 2023, 7:32am

Stumble across this pull request Add a new operator ^ for pinning of pattern variables by richcarl · Pull Request #2951 · erlang/otp · GitHub and the corresponding EEP: eep/eep-0055.md at master · erlang/eep · GitHub. If this pinning operation is added, it will fix this variable shadowing issue nicely.

TD5 · June 5, 2023, 9:46am

I think this is a nice alternative, and before making the optimiser PR, I looked into getting variable pinning merged instead, but it looks like there’s quite a lot of discussion still ongoing around the merits of that change (although I am very much in favour of it!), so it didn’t seem to offer an easy solution to the issues with ms_transform.

erlangys · June 5, 2023, 4:36pm

Yes. It looks like this pinning operation may not be accepted in a short time. It will be faster to get your optimizer PR instead of waiting for it.