Segfault when calling nif after ct:print statement

cfclavijo · December 19, 2024, 11:11pm

Hello y’all! I’m prototyping something and ended up writing a nif to call tree-sitter, I took the opportunity to learn C and understand how NIFs work.
With a bit of help from copilot and ELisp I ended up writing almost all the APIs from tree-sitter; everything was good until I got a segfault.
After moving pieces around I found that calling ct:pal would “kind of flush” the resources, curiously, not all the resources, but the Node type ones. Here is the repo for the lib GitHub - cfclavijo/erl_ts at develop

I added a test to replicate this problem (initially I thought it was related to the function erl_ts:node_end_byte, thus the name)

node_end_byte_segfault_post_print(_Config) ->
  SC = "fun(A)->1+2.",
  {ok, Parser} = erl_ts:parser_new(),
  {ok, Lang} = erl_ts:tree_sitter_erlang(),
  true = erl_ts:parser_set_language(Parser, Lang),
  Tree = erl_ts:parser_parse_string(Parser, SC),
  RootNode = erl_ts:tree_root_node(Tree),
  ct:print(default, ?LOW_IMPORTANCE, "this ends in segfault wtf", [], []),
  %% erl_ts:tree_language(Tree), %% this makes the subsequent calls to erl_ts work

  %% RootNodeB = erl_ts:tree_root_node(Tree), %% this also makes the subsequent calls to erl_ts work
  FunDeclNode = erl_ts:node_child(RootNode, 0), %% here comes the segfault

running rebar3 ct not always throws a segfault, tho
If I remove the ct:print or call the nif passing a different type of reference (Like the Tree), everything works.
Any Idea what could be happening?
(Suggestions on how to improve the C code are welcome as well )

Thanks!

vkatsuba · December 19, 2024, 11:43pm

The segfault you’re encountering is likely due to resource management or memory handling issues in your NIF implementation. This kind of behavior is common when there’s a mismatch between how resources are allocated, referenced, and freed, particularly in the interaction between C and the Erlang VM.

It seems like the issue arises with how the RootNode is being handled. The ct:print call might be changing the execution context slightly - such as by triggering a garbage collection or altering resource timing - which could explain why it affects the behavior. Similarly, when you pass a different reference (like the Tree) or re-fetch the RootNode, it might “refresh” or reset the resource, avoiding the segfault.

There are a few things to consider here:

First, look into how you’re managing the lifetime of RootNode and Tree in your C code. If the RootNode depends on the Tree in any way, ensure that the Tree is still valid and hasn’t been released when you try to access the RootNode. If Tree gets garbage-collected or invalidated while RootNode is still in use, it could lead to undefined behavior and the segfault you’re seeing.

Also, pay attention to the possibility of resource ownership issues. If you’re using enif_alloc_resource, make sure the corresponding enif_release_resource is being called appropriately. Any mismatch in resource allocation and deallocation can cause problems like dangling pointers or double frees.

Since running “rebar3 ct” doesn’t always throw the segfault, there might also be a concurrency or timing issue. Parallel test execution could expose subtle race conditions in your resource handling code. Tools like Valgrind or AddressSanitizer can be very helpful for catching these types of bugs.

Adding detailed debug output to your NIF can also help track resource allocation and deallocation. Logging every resource’s creation, use, and release will give you a clearer picture of what’s happening under the hood.

Finally, try to reproduce the issue with a minimal, standalone example. Simplifying the problem can help isolate the root cause, making it easier to identify where things are going wrong in the C implementation. If you can pinpoint the exact function or line in the C code where the segfault occurs, it’ll be much easier to debug.

cfclavijo · December 22, 2024, 11:31pm

@vkatsuba Thanks for your reply, it encouraged me to start digging a rabbit hole on how to debug Erlang and NIFs!

I managed to have an Erlang+NIFs compilation with AddressSanitizer and the report indicates that the ErlNifResource that holds the Tree (tree-sitter Tree pointer) is being “freed” by calling its destructor.

ERROR: AddressSanitizer: heap-use-after-free
...
  in ts_node_child lib/src/node.c:610
  in node_child_nif
...
freed by thread T4 here
...
  in ts_tree_delete lib/src/tree.c:37
  in run_resource_dtor beam/erl_nif.c:2892

As to why it happens after the ct:print statement is beyond my knowledge, I want to believe it is an optimization where the resource is released due to the lack of usage in the erlang code afterwards, which is aligned to your comment:

I’m having a hard time thinking of a good solution for this case where erlang would take care automatically. I guess that being explicit about calling the destructor is my simplest option

Once more, thank you!