Shrunk single blocks in single block carriers - shrunk beyond the relative shrink threshold

masleeds · February 22, 2023, 2:43pm

Apologies in advance, a fairly long setup for my question, to give some context.

For some time I have been struggling with memory issues in a large (in terms of number of keys and data) Riak cluster running OTP 22.3. Nodes in the cluster can suffer from wild fluctuations and unexpected growth in their memory footprint - up from 50GB to 200GB in some cases.

The situation has been improved through a number of changes, mainly focused on trying to reduce the size of the loop state of processes (where there are thousands of such processes), hibernating processes where possible, and being careful with binary handling to avoid holding onto references unnecessarily.

One thing that has been observed was that at times fragmentation of the eheap_alloc multiblock carriers was high (recon_alloc reporting this at around 30%). In Riak there are a lot of processes with a memory footprint of around 150-200KB, so a change that has been tried is to reduce the sbct threshold from 512KB to 128KB - so that these processes would now use single block carriers. This change proved to be consistently successful in reducing memory footprint in test, and so now that setting is in some production systems.

However, in the large and previously problematic production system, a fragmentation issue has now also occurred, even with the threshold change. In this case the erlang:memory/0 was reporting 30GB of memory in use, but the OS was reporting the beam using more than 150GB.

In summing the {mbcs_block_size, mbcs_carrier_size, smbcs_block_size, smbcs_carrier_size} across all eheap allocators (1 to 48), as reported by recon_alloc:fragmentation(current). the following result is returned:

{2356877512,4138991616,26392513472,158942248960}

So as would be expected with the reduced sbct threshold, the majority of the blocks allocated by the eheap_alloc are in single_block carriers not multi_block carriers (26GB vs 2.4GB). However, what is unexpected is the discrepancy between the single block carrier block size and single block carrier carrier size (26GB vs 148GB). So it looks like a lot of blocks in single block carriers have been shrunk - so much so that the average block size is only 17% of the average carrier size.

So the primary question is how does this occur, especially given that the default rsbcst value for the eheap_alloc is 50%. My understanding is that for any single-block carrier from eheap_alloc where the block has shrunk to less half the size of the carrier, then the carrier itself should be shrunk to free the memory up.

Is there another process that needs to be initiated to make this happen? Or some known situation where shrinking the carrier will be blocked? Or have I misunderstood the meaning of these thresholds?

Thanks

masleeds · February 24, 2023, 9:26pm

I’ve been speculating as to the cause of these symptoms. I have a hypothesis, which is based on these further questions:

Are single block carriers actually memory-mapped files?
When the carriers are shrunk, is that done by changing the underlying file, or by using MADVISE to indicate that the unused space in the memory-mapped file is no longer required?
If MADVISE is used, and we’re not careful in how we measure memory used by the application, until the OS reclaims the space, might the free space still be reported as being in use by the BEAM process?
When calculating the aggregate carrier size, the is size of the underlying memory-mapped files being calculated, ignoring the fact that some fo the space has been shrunk using madvise?

My knowledge of memory-mapped files is weak, but I think this would explain all the symptoms we see. this includes the fact that the aggregate size of all the carriers across all the allocators was bigger than the amount of memory being reported by the OS. The carriers had been shrunk, it was just that the OS had not yet found an alternative use for the freed up memory, and whilst those pages were still in memory they were still accounted as being in use by the beam.

max-au · February 25, 2023, 4:08am

They are msegs. By default, that’d be anonymous memory mappings (“mmap” in Linux), so you can probably call it “memory mapped files” (although there is no actual file to back it).

As for MADVISE… I recall it being on and off in the BEAM. Official documentation - search for +M<S>acful <utilization>|de says it is used, and source code also has this PR, which appears to do exactly what you need (not that I’m suggesting to port RIAK to OTP 26 RC, but it could be helpful).

masleeds · February 25, 2023, 8:20am

Thank-you Max, some helpful links and yes, I am mangling my terminology here, thank-you for the clarification.

Reading the manual, I saw that when using MADV_DONTNEED it states that the resident set size of the process is updated immediately, but the same assurance isn’t given when using MADV_FREE. So perhaps there is a trade-off between performance and observability when using MADV_FREE to reclaim memory. If OTP is using MADV_FREE an operator may not easily be able to see the return of memory, and may erroneously believe a memory problem exists.

I stumbled across a very helpful blog post from NextRoll, which described the performance side of that trade off (what happens when MADV_FREE falls-back to MADV_DONTNEED). So clearly not a simple choice.

masleeds · February 28, 2023, 9:56am

I’ve been trying to do some research into the issues of monitoring memory usage of a process when MADV_FREE is used, and I came across this from the go community:

github.com/golang/go

runtime: default to MADV_DONTNEED on Linux

opened 12:28AM - 02 Nov 20 UTC

closed 04:15PM - 02 Nov 20 UTC

aclements

NeedsDecision FrozenDueToAge

In Go 1.12, we changed the runtime to use `MADV_FREE` when available on Linux (f…alling back to `MADV_DONTNEED`) in [CL 135395](https://golang.org/cl/135395) to address issue #23687. While `MADV_FREE` is somewhat faster than `MADV_DONTNEED`, it doesn't affect many of the statistics that `MADV_DONTNEED` does until the memory is actually reclaimed. This generally leads to poor user experience, like confusing stats in `top` and other monitoring tools; and bad integration with management systems that respond to memory usage. We've seen numerous issues about this user experience, including #41818, #39295, #37585, #33376, and #30904, many questions on Go mailing lists, and requests for mechanisms to change this behavior at run-time, such as #40870. There are also issues that may be a result of this, but root-causing it can be difficult, such as #41444 and #39174. And there's some evidence it may even be incompatible with Android's process management in #37569. I propose we change the default to prefer `MADV_DONTNEED` over `MADV_FREE`, to favor user-friendliness and minimal surprise over performance. I think it's become clear that Linux's implementation of `MADV_FREE` ultimately doesn't meet our needs. We've also made many improvements to the scavenger since Go 1.12. In particular, it is now far more prompt and it is self-paced, so it will simply trickle memory back to the system a little more slowly with this change. /cc @mknyszek @rsc

It appears that after originally moving to MADV_FREE for performance reasons, they have now moved back to MADV_DONTNEED, in main due to the confusion this caused in operations trying to track memory usage.

max-au · February 28, 2023, 4:18pm

That’s exactly why I said “it’s been on and off”, not just for the BEAM, but for the entire industry.

We ended up with watching RSS (for OOM killer) and erlang:memory/0,1 return values, ignoring other OS-reported counters.