Seg fault in erl_gc.c OTP 24.1.3

Hi,

This post is related to Seg Fault OTP 24.1.3 · Issue #9385 · erlang/otp · GitHub on the OTP repo. I posted the question, and am trying to reproduce the issue with OTP 26.2.5 at present, but am still searching for the root cause.

I have gone through the commits to erl_gc.c between tags OTP-24.1.3 and OTP-26.2.5, and cannot seem to find the commit that has fixed the problem, if it has been fixed at all (my OTP 26.2.5 test is still on-going)

Has anyone come across this issue, or know what the root cause is/was?

Here is a snippet of the core dump:

Core was generated by `/usr/lib/erlang/erts-12.1.3/bin/beam.smp -C multi_time_warp -K true -A 30 -- -r'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  do_minor (p=p@entry=0x70a9e168, live_hf_end=live_hf_end@entry=0x6c433368, mature=mature@entry=0x6c40ede0 "\200", mature_size=mature_size@entry=556, new_sz=<optimized out>, objv=<optimized out>, 
    objv@entry=0x7457ec88, nobj=nobj@entry=1) at beam/erl_gc.c:1545
--Type <RET> for more, q to quit, c to continue without paging--
1545	in beam/erl_gc.c
[Current thread is 1 (LWP 1310)]

And this is the line of code in question: otp/erts/emulator/beam/erl_gc.c at OTP-24.1.3 · erlang/otp · GitHub, and here is snippet for quick viewing (line with ptr = boxed_val(gval); )

for ( ; g_sz--; g_ptr++) {
            gval = *g_ptr;

            switch (primary_tag(gval)) {

	    case TAG_PRIMARY_BOXED: {
		ptr = boxed_val(gval);
                val = *ptr;
                if (IS_MOVED_BOXED(val)) {
		    ASSERT(is_boxed(val));
                    *g_ptr = val;
                } else if (ErtsInArea(ptr, mature, mature_size)) {
                    move_boxed(ptr,val,&old_htop,g_ptr);
                } else if (ErtsInYoungGen(gval, ptr, oh, oh_size)) {
                    move_boxed(ptr,val,&n_htop,g_ptr);
                }
                break;
	    }

	    case TAG_PRIMARY_LIST: {
                ptr = list_val(gval);
                val = *ptr;
                if (IS_MOVED_CONS(val)) { /* Moved */
                    *g_ptr = ptr[1];
                } else if (ErtsInArea(ptr, mature, mature_size)) {
                    move_cons(ptr,val,&old_htop,g_ptr);
                } else if (ErtsInYoungGen(gval, ptr, oh, oh_size)) {
                    move_cons(ptr,val,&n_htop,g_ptr);
                }
		break;
	    }
	    default:
		break;
            }
        }

PS:
This issue has come out of Compile regular beam.smp and output symbols to file - BEAM Forum / BEAM Chat / Discussions - Erlang Forums, but I decided to keep them separate, as the Compile regular beam.smp and output symbols to file - BEAM Forum / BEAM Chat / Discussions - Erlang Forums is more about getting the debug information rather than fixing the specific bug.

The fault is most likely not in the GC, but somewhere in some code that produces corrupt terms, which means it could be in any C code in the erts or in your own NIFs/linked-in drivers.

Try running the testcase using the debug emulator and see if any fault pops up there.

1 Like

Thanks @garazdawi .

I am running the regular emulator on an embedded ARMv7 device with very limited resources, and I found that trying to run the debug emulator causes other problems in itself, such as gen_servers crashing due to building up with messages, and just other general performance degradations (which are to be expected, from what I read in the debug emulator documentation).

I currently do not have a specific test case. This seg fault is quite elusive, and only seems to be appear in a cross-compiled environment after 5-28 days. What I do know is that my device was being SNMP-walked by one machine at the time of the seg fault, so I have set up multiple machines to perform a walk on a single device to try to speed the reproducibility up, but as I said, it still takes many days.

Do you think it is possible that Fix ets match map copy bug by garazdawi · Pull Request #7712 · erlang/otp fixed this issue? I read the issue in BEAM crashes with segmentation fault · Issue #7683 · erlang/otp and saw that the PR fixed this issue in OTP 24.3.4.13 and some variant of OTP 25. The lines of code that triggered the seg faults in those two releases are different, but also very close to my line - they’re all within the same case in the switch block.

Read @garazdawi’s response again. The problem is most likely NOT in the GC (implemented in erl_gc.c). Some other code has caused a broken Erlang term on the process heap and it went unnoticed until the process got Garbage Collected by erl_gc.c.

The hotel housekeeper is in most cases not the murderer even if they were the one finding the dead person in the hotel room.

3 Likes

Thanks @sverker , but please read my response again, and also perhaps take a read of the issue I linked in it.

I did not mention that I thought it was in erl_gc.c in my response, I accept what was said by @garazdawi . I simply asked if @garazdawi thought another fix may have also fixed my issue, whatever the issue actually is, given that the issue I linked also contained a seg fault in erl_gc.c (a few lines away from mine), while not actually being caused in erl_gc.c itself.