Curious 18*10^20 bytes eheap_alloc crash

sysupbda · February 8, 2025, 12:09pm

At my client site, on which I can run commands/get screenshots but not extract files, the RabbitMQ instance crashed with the most interesting memory allocation attempt I have ever seen.

The 4GB erl_crash.dump file’s slogan is:
eheap_alloc: Cannot allocate 18446744073350910976 bytes of memory (of type "heap")

We are using Erlang/OTP 26 (erts-14.2.5). The reason I am posting here instead of in the RabbitMQ chat, is I looked for active processes (lines with “Current Process State: Running”) and only found one in the dirty_io_scheduler:17. The process ID was <0.74.0> and the limited stack trace showed:

(prim_tty:reader_loop/6 + 848)
(proc_lib:init_p_do_apply/3 + 208)
(<terminate process normally>)

To me that looked more like a Erlang kernel kind of error than an application error from RabbitMQ, but I am happy to be corrected.

The entire thing is running within a Windows Server VM with 8 GB of RAM, so the 18 Quintillion bytes memory allocation failed. The erl_crash.dump had some statistics that can also be of interest:

=memory
total: 9023312216
processes: 8306044992
processes_used: 8305987760
system: 717267224
atom: 1565025
...
binary: 559868256
code: 40598622
ets: 52082272

It is not clear what events might have caused this, because we had limited monitoring of the VM. I can not say with certainty that the process was not being halted, or that there was no increased processing at that point or whether the machine was running out of memory either way.

I cannot find any explicit eheap_alloc calls within OTP-26.2.5.7/lib/kernel/src/prim_tty.erl 's prim_tty:reader_loop/6, so I am not sure how to investigate any further.

Any thoughts about how I ccould investigate this further?

mikpe · February 8, 2025, 5:48pm

18446744073350910976 decimal equals 16#FFFFFFFFEA9F9400 hex. You can see that the high 32 bits are sign-extended from the low 32 bits. This tells me it’s probably some size or address computation in 32-bit precision and signed type that gets sign-extended to 64 bits. A typical case would be something computed as plain int then passed or assigned to something that’s size_t, intptr_t, or similar. (Does Windows x64 still make long 32 bits? That was a major issue 10-15 years ago when I last did development on Windows. If so, watch out for code assuming long is the size of a pointer, because it won’t be.)

I’m frankly surprised to see this since I though we fixed most of these 64-bit bugs well before OTP-20. I hope you don’t have any NIFs in the process.

Ideally you should attach a debugger to the process and capture a stack trace at the point of this invalid allocation attempt.

dgud · February 8, 2025, 6:07pm

Does Windows x64 still make long 32 bits?

Yes

sysupbda · February 9, 2025, 11:59am

That would make a lot more sense than anything else. Sadly I won’t be able to run a debugger on that environment and I am also not sure how to reproduce the issue. I was hoping I could keep fiddling around with the erl_crash.dump until I found clues of where to investigate.

From the documentation, I expected that only running processes could be responsible for my crash. But the one I mentioned is purely from the kernel and I have my doubts it could be responsible.

Is it possible that the only other process that had an Internal State of ACTIVE was maybe the culprit? The reason I did not look at it is because its State is Garbing. Would that mean that the GC of that process might be causing all these issues? Is that where I should maybe investigate? If so, I could imagine that other process (from RabbitMQ) having a race condition causing wrong math to create unexpected negative numbers and converting back and forth between signed ints and a unsigned size_t.

Before I dive too deep into that very long dark tunnel, I am going to continue investigating how to analyze the erl_crash.dump without being able to move it to a machine where I have proper developer tools.

garazdawi · February 9, 2025, 12:52pm

The reason I did not look at it is because its State is Garbing . Would that mean that the GC of that process might be causing all these issues?

Yes, that process most likely is the culprit. That the state is “Garbing” means that the process in just now garbage collecting and part of that is requeting new memory using eheap_alloc. That something fails in the GC usually means that something somwhere else has created a currupt term that the GC looks at. So I would look closely at any native code (nifs or drivers) that the process in question is interacting with. If you do find a way to reproduce the fault, running the debug emulator may cause the error to happen before the GC which makes it a lot easier to debug. You can do that by passing -emu_type debug to erl.exe.