Does `max_heap_size` kill the process before/during/after garbage collection?

justgage · July 9, 2024, 9:55pm

We’re doing some pretty memory (and time) intensive customer calculations (like 32gb of memory or more) which require us to jail our customers calculations using the process flag max_heap_size. This flag will effectively kill any process who’s heap grows beyond the set amount.

This generally works well but we’ve noticed that some especially gluteus calculations tend to enter this kind of “death spike” where the memory will go up dramatically near the end. We’ve been using tprof to help us understand what’s going on but some added understanding of the BEAM is probably helpful.

Reading through the OTP C code I noticed that max_heap_size is triggered on a minor GC. But the code is a bit unclear when it’s used and before/after what.

I know garbage collection also needs extra memory to copy stuff from the “old” (garbage filled?) heap to the new heap (the one without the garbage) My main question: can this extra memory being used by garbage collection cause the VM to kill the process due to max_heap_size being hit? My reading of the code is that no, the extra memory used by GC is not counted toward this limit but I couldn’t disprove it either.

Also I’m wondering if the garbage collection happens before max_heap_size is checked or if the check happens after. Because if the check is happening too soon maybe triggering the GC would have saved the process. My reading of the C code is that it only happens when we’re needing more memory (hence the GC being triggered) and so for at least that moment all the stuff on the heap is “needed” and thus GC wouldn’t have saved it but I would love others to confirm.

What does seem likely is that the 20% jumps in heap size get bigger and bigger and thus can cause it to more quickly hit max_heap_size when the process needs more memory.

Finally any tips for generating less memory is always appreciated.

Thanks!
~ Gage

sverker · July 10, 2024, 6:34pm

max_heap_size is checked at the start of a GC before the new heap, where live terms will be copied to, is allocated. The size checked against the limit includes both the old heap with existing terms (live and garbage) and the new not yet allocated heap.

With the current design of the GC, it was a bit of a catch-22 problem to introduce max_heap_size. We don’t know until the GC is done how much live data we have, but the GC need to start by allocating a large enough heap to hold all data that can possibly survive. And we want to kill the process before that big allocation is made that would otherwise exceed the limit.

starbelly · July 12, 2024, 5:32pm

This is an area I’m terribly interested in. I wonder how much of that 32GB which ultimately leads to erts reaping the process is garbage? I suppose that’s a place to start in folks trying to help you reduce your foot print. You’d have to do a major gc on one of these processes before it gets reaped and observe the reduction.