The question is, how can it determine which values to copy? Does it keep a list of all the referenced variables in the function body in order to do so?
Sort of. In the BEAM instruction that creates the fun there is a list of the values that need to be part of the funās environment. To demonstrate the concept of the environment, I have created the following function in module t:
incrementer(Inc) ->
fun(Value) -> Value + Inc end.
That is, references to any variables in the fun containing the fun are stored in the environment for the fun. When storing a value in the environment, no deep copy is done. If a term in the environment is, for example, a tuple, only a tagged pointer to the contents of the tuple is stored in the environment.
When a fun is passed to spawn/1, as part of copying the fun, all values in the funās environment will be deep-copied into the the heap of the newly spawned process (that is, tagged pointers will be followed and what they point to will be copied).
Thanks for the detailed explanation. That helps a lot to understand how funs are working in Erlang. A follow up question is, since all the information is parsed and stored, is it possible to further analyze to avoid the accidental copying automatically? In the given example, we can tell from map_get(info, State) that only Info is needed then we just store Info instead of the whole State. That way even the code is written as accidental2 it still gets automatically optimized by the compiler to avoid unnecessary copying.
or a āsafeā function applied to extractable
expressions, where map_get/2 and + and so on
might count as safe.
Let a Fun contain an extractable expression E
that contains every occurrence of a non-local
variable V in Fun. Then the value of E can be
computed and stored in Funās environment instead
of V.
This is a well known compiler optimisation, called
ācode motion out of loopsā. (Well, itās basically
that optimisation wearing a Voltaire mask and
whistling a secret spy tune. Itās what code motion
out of loops would look like in Erlang.)
The problem is that there are no safe functions,
and in particular, map_get/2 isnāt one.
Risk 1: evaluating an extractable expression might
result in an expression being raised that would NOT
have been raised.
Risk 2: when Fun does use the value of E, and an
exception is raised, it will be raised in the wrong
process, not just the wrong function.
Risk 3: the value of E might be bigger than V.
Itās only safe to hoist map_get(info, State) out of
the fun+spawn when you KNOW that State has an
info slot. Type checking can help with that, and in
the presence of type information this optimisation might
actually be useful.