VM settings to combat large single block carriers and fragmentation

I’m looking for allocator settings advice. What I am seeing is that total memory remains low, and eventually the VM gets OOM’d by the Linux kernel. Since there is low memory, ~10%, this leads me to believe that fragmentation causing this crash. It is known that this application does a large number of small allocations. Under small sustained load the memory utilization does not end up dropping, only when the system is quiet is memory coalesced. Addtl, there is a super carrier enabled at 80% of available memory and we allow mmap from the underlying OS.

Here are some metrics.

Using recon to look at the average block sizes, there are pretty large binaries allocated, as well as large heap blocks(?).

> rpc:call(N, recon_alloc, average_block_sizes, [current]).

So there are quite large sbcs allocated. Several guides online suggest that one wants less sbcs and more allocation in mbcs, as mbcs coalesce better than sbcs. Here is the ratio at peak (running multiple times look similar):

> rp(rpc:call(N, recon_alloc, sbcs_to_mbcs, [current])).

I’m not sure if this is classified as ‘good’ or ‘bad’, but a ratio exists.

The instrument module shows the following:

> rpc:call(N, instrument, allocations, []).
     #{crypto =>
           #{binary => {0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       prim_buffer =>
           #{binary => {0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,14,0,7,3,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_internal => {0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       prim_file =>
           #{binary => {0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,9,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0}},
       prim_socket =>
           #{binary => {0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex => {27,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,0,0},
             nif_internal => {8,1,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       system =>
           #{binary => {35,109,17,10,1,0,2526,1,0,0,0,0,1,0,0,0,0,0},
             driver => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_rwlock => {41,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_internal => {0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             microstate_accounting =>
             nif_internal => {0,19,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port_data_lock => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},

Looks pretty benign to me, but I could be missing something.

Finally, this is what recon_alloc:fragmentation reports:


Are there any diagnostic functions that should be tested?

With these metrics in mind, I’m experimenting with the following settings:

+MBsbct 2000
+MBlmbcs 100000
+MBsmbcs 1024
+MBas aobf

+MHsbct 2000
+MHlmbcs 100000
+MHsmbcs 1024
+MHas aobf

Using these, some of the above functions look better but the general problem still remains. One addition now is eheap_alloc now is of greater size with a single large block allocated:

> rp(rpc:call(N, instrument, carriers, [])).


There are 9 eheap_allocs with this exact pattern.

Additionally, the sbc to mbc ratio looks better, except for one eheap

> rp(rpc:call(N, recon_alloc, sbcs_to_mbcs, [current])).

And here is what recon_alloc:fragmentation reports:


Are there any other suggested allocator settings to adjust?


I’m curious, have you tried doing “global” gc? You you can iterate over all processes and perform a garbage_collect/1 on each pid. You can also run recon:bin_leak(N) where N is some arbitrary number to get the same effect. I wonder if you see a healthy chunk of memory reclaimed and perhaps happy carriers.

I’m curious because we have a similar set up at work and had a similar problem. While we have not adjusted any memory settings, we did find an issue where by the generational garbage collection would run too late/never and thus memory would never get freed. We solved this issue by narrowing it all down to a few processes and using combination of {fullsweep_after, 0} and setting the process to hibernate every so often. This actually straightened out what we thought were memory fragmentation issues that could only be sorted by adjusting allocator settings. Not to say adjusting allocators isn’t this isn’t the right fix in the long term or perhaps examining the life cycle of the process and making adjustments there, however it did turn out to be a solid mitigation method at the very least.

That said, I would wait for someone with more experience in this area to come along and advise you :slight_smile:


Thank you for the suggestions.

  1. bin_leak doesn’t indicate binaries are leaked really, I think this is a high amount of process memory.
  2. I did try the fullsweep_after for both 10 and 0, didn’t make much of a difference. However I did not try to hibernate the processes. Will try that next.
    a. The old heap does look quite big on some of the processes in the crashdump though.
  3. Have not tried a global gc, will give that a try.
1 Like

A global GC didn’t have much effect.

Hibernation too didn’t do much.

1 Like

Have you tried disabling super-carrier?
The way it works, it allocates (mmap) memory from the OS, but never gives it back. Hence you will never see VSZ below 80% of available memory, and also if you ever exhausted the supercarries, RSS will also stay at that 80%.

First thing I’d look into is erlang:memory reports compared to the outside view (process RSS, with super-carrier turned off). I am not sure how recon_alloc calculates fragmentation, but you may want to learn erlang:statistics(allocators) to calculate actual fragmentation


Thank you for your insights.

Turning off the super carrier didn’t help with ever increasing fragmentation. Will try to repro a crash at full memory utilization. I could see that if all of the memory is atleast reclaimed by the OS an OOM could be harder to trigger…

I will check the memory stats, however allocators doesn’t seem to be a valid arg to the statistics function. erlang:system_info({allocator, <ALLOCATOR>}). seems to return some promising information, so I’ll dive into that more. Good shout on going directly to the source.


You may also want to look at the instrument module for some useful histograms.


Other question: Can you isolate the issue to a process or two? Also, did you have custom allocator settings before or did you fiddle with those after you noticed the issue?


It seems that removing the supercarrier did indeed prevent OOMing. It was initially added as it reduced fragmentation at a lower load level. It looks catastrophic at higher load levels.

In addition to removing the super carrier, the hibernations did seem to have a slightly positive effect. I have yet to test with the settings from above, but so far these seem like pretty good improvements.


It is quite a few processes causing the issue. The custom allocator settings were after setting the supercarrier


Hibernation (mostly GC coming from it) is great for process that aren’t active (those that are sleeping most of the time). Supercarrier helps to avoid minor page faults (during memory allocation). So I’d expect higher CPU usage with these mitigations.

Figuring out what causes fragmentation may not be easy. I’d still recommend looking into allocators statistics, calculating fragmentation per-carrier, and adjusting allocation strategy appropriately. E.g. you may be looking into +MHas aobf if your heap allocator tends to be fragmented. See Erlang -- erts_alloc


Right, to be clear I didn’t mean literally a few processes as in 3 processes, but rather a few types of actors if you will, you could have one actor in your system but hundreds of thousands of instances. Thus, looking for that one bad apple isn’t possible, which seems to be what you’re saying.


I could have said that a little better with : It’s not one bad apple, it’s one bad type of apple and there’s hundreds of thousands of them :slight_smile:


I thought about making this adjustment, but I’m going to have to wait for memory fragmentation to rear its head again, if it does at all.

Even the fullsweep_after and hibernation, while they work, it seems like we should be able to dig into the life cycle of the process and sort out what’s going on.

Though, that leads to a good question : Is it worth it? Do we maybe just say “Let’s not alter our code, let’s adjust the properties of the process and call it a day”.