VM settings to combat large single block carriers and fragmentation

mpope · October 18, 2022, 9:19pm

I’m looking for allocator settings advice. What I am seeing is that total memory remains low, and eventually the VM gets OOM’d by the Linux kernel. Since there is low memory, ~10%, this leads me to believe that fragmentation causing this crash. It is known that this application does a large number of small allocations. Under small sustained load the memory utilization does not end up dropping, only when the system is quiet is memory coalesced. Addtl, there is a super carrier enabled at 80% of available memory and we allow mmap from the underlying OS.

Here are some metrics.

Using recon to look at the average block sizes, there are pretty large binaries allocated, as well as large heap blocks(?).

> rpc:call(N, recon_alloc, average_block_sizes, [current]).
[{binary_alloc,[{mbcs,7061.379407616361},{sbcs,8.53e5}]},
 {eheap_alloc,[{mbcs,11122.552204176334},{sbcs,166084924.0}]},

So there are quite large sbcs allocated. Several guides online suggest that one wants less sbcs and more allocation in mbcs, as mbcs coalesce better than sbcs. Here is the ratio at peak (running multiple times look similar):

> rp(rpc:call(N, recon_alloc, sbcs_to_mbcs, [current])).
[{{eheap_alloc,11},0.14285714285714285},
 {{eheap_alloc,28},0.07692307692307693},
 {{eheap_alloc,1},0.024193548387096774},
 {{eheap_alloc,0},0.022222222222222223},
 {{eheap_alloc,2},0.012195121951219513},
 {{eheap_alloc,3},0.01098901098901099},
 {{binary_alloc,3},0.006147540983606557},
 {{binary_alloc,1},0.001183431952662722},

I’m not sure if this is classified as ‘good’ or ‘bad’, but a ratio exists.

The instrument module shows the following:

> rpc:call(N, instrument, allocations, []).
{ok,{128,0,
     #{crypto =>
           #{binary => {0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       prim_buffer =>
           #{binary => {0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,14,0,7,3,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_internal => {0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       prim_file =>
           #{binary => {0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,9,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0}},
       prim_socket =>
           #{binary => {0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex => {27,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,0,0},
             nif_internal => {8,1,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       system =>
           #{binary => {35,109,17,10,1,0,2526,1,0,0,0,0,1,0,0,0,0,0},
             driver => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_mutex => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_rwlock => {41,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_internal => {0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             microstate_accounting =>
                 {1,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_internal => {0,19,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port_data_lock => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},

Looks pretty benign to me, but I could be missing something.

Finally, this is what recon_alloc:fragmentation reports:

[{{binary_alloc,1},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.5149541837032711},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,5416560},
   {mbcs_carriers_size,10518528}]},
 {{ll_alloc,0},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.7967681884765625},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,18798120},
   {mbcs_carriers_size,23592960}]},
 {{eheap_alloc,2},
  [{sbcs_usage,0.9977274729793233},
   {mbcs_usage,0.6334435096153846},
   {sbcs_block_size,2174120},
   {sbcs_carriers_size,2179072},
   {mbcs_block_size,5396736},
   {mbcs_carriers_size,8519680}]},
 {{eheap_alloc,1},
  [{sbcs_usage,0.997382155987395},
   {mbcs_usage,0.2850613064236111},
   {sbcs_block_size,972296},
   {sbcs_carriers_size,974848},
   {mbcs_block_size,1008816},
   {mbcs_carriers_size,3538944}]},
 {{binary_alloc,28},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.2025390625},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,431392},
   {mbcs_carriers_size,2129920}]},
 {{binary_alloc,2},
  [{sbcs_usage,0.9965972222222222},
   {mbcs_usage,0.7005076911878882},
   {sbcs_block_size,918464},
   {sbcs_carriers_size,921600},
   {mbcs_block_size,3695632},
   {mbcs_carriers_size,5275648}]},
 {{binary_alloc,11},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.2882286658653846},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,613904},
   {mbcs_carriers_size,2129920}]},

Are there any diagnostic functions that should be tested?

With these metrics in mind, I’m experimenting with the following settings:

+MBsbct 2000
+MBlmbcs 100000
+MBsmbcs 1024
+MBas aobf

+MHsbct 2000
+MHlmbcs 100000
+MHsmbcs 1024
+MHas aobf

Using these, some of the above functions look better but the general problem still remains. One addition now is eheap_alloc now is of greater size with a single large block allocated:

> rp(rpc:call(N, instrument, carriers, [])).

      {eheap_alloc,false,134217728,0,
                   [{eheap_alloc,1,85562816}],
                   {0,0,0,0,0,0,0,0,0,0,0,0,0,1}},

There are 9 eheap_allocs with this exact pattern.

Additionally, the sbc to mbc ratio looks better, except for one eheap

> rp(rpc:call(N, recon_alloc, sbcs_to_mbcs, [current])).
[{{eheap_alloc,0},0.06521739130434782},

And here is what recon_alloc:fragmentation reports:

[{{binary_alloc,26},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.07153902666284404},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,1277584},
   {mbcs_carriers_size,17858560}]},
 {{binary_alloc,24},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.02615636041057505},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,439688},
   {mbcs_carriers_size,16809984}]},
 {{binary_alloc,25},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.08361704415137615},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,1493280},
   {mbcs_carriers_size,17858560}]},
 {{binary_alloc,27},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.09941137471330275},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,1775344},
   {mbcs_carriers_size,17858560}]},
 {{eheap_alloc,1},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.0963664683260659},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,1629392},
   {mbcs_carriers_size,16908288}]},

Are there any other suggested allocator settings to adjust?

starbelly · October 18, 2022, 11:43pm

I’m curious, have you tried doing “global” gc? You you can iterate over all processes and perform a garbage_collect/1 on each pid. You can also run recon:bin_leak(N) where N is some arbitrary number to get the same effect. I wonder if you see a healthy chunk of memory reclaimed and perhaps happy carriers.

I’m curious because we have a similar set up at work and had a similar problem. While we have not adjusted any memory settings, we did find an issue where by the generational garbage collection would run too late/never and thus memory would never get freed. We solved this issue by narrowing it all down to a few processes and using combination of {fullsweep_after, 0} and setting the process to hibernate every so often. This actually straightened out what we thought were memory fragmentation issues that could only be sorted by adjusting allocator settings. Not to say adjusting allocators isn’t this isn’t the right fix in the long term or perhaps examining the life cycle of the process and making adjustments there, however it did turn out to be a solid mitigation method at the very least.

That said, I would wait for someone with more experience in this area to come along and advise you

mpope · October 19, 2022, 12:52am

Thank you for the suggestions.

bin_leak doesn’t indicate binaries are leaked really, I think this is a high amount of process memory.
I did try the fullsweep_after for both 10 and 0, didn’t make much of a difference. However I did not try to hibernate the processes. Will try that next.
a. The old heap does look quite big on some of the processes in the crashdump though.
Have not tried a global gc, will give that a try.

mpope · October 19, 2022, 2:23am

A global GC didn’t have much effect.

Hibernation too didn’t do much.

max-au · October 19, 2022, 5:15am

Have you tried disabling super-carrier?
The way it works, it allocates (mmap) memory from the OS, but never gives it back. Hence you will never see VSZ below 80% of available memory, and also if you ever exhausted the supercarries, RSS will also stay at that 80%.

First thing I’d look into is erlang:memory reports compared to the outside view (process RSS, with super-carrier turned off). I am not sure how recon_alloc calculates fragmentation, but you may want to learn erlang:statistics(allocators) to calculate actual fragmentation

mpope · October 19, 2022, 3:30pm

Thank you for your insights.

Turning off the super carrier didn’t help with ever increasing fragmentation. Will try to repro a crash at full memory utilization. I could see that if all of the memory is atleast reclaimed by the OS an OOM could be harder to trigger…

I will check the memory stats, however allocators doesn’t seem to be a valid arg to the statistics function. erlang:system_info({allocator, <ALLOCATOR>}). seems to return some promising information, so I’ll dive into that more. Good shout on going directly to the source.

starbelly · October 19, 2022, 7:40pm

You may also want to look at the instrument module for some useful histograms.

Edit:

Other question: Can you isolate the issue to a process or two? Also, did you have custom allocator settings before or did you fiddle with those after you noticed the issue?

mpope · October 19, 2022, 10:14pm

It seems that removing the supercarrier did indeed prevent OOMing. It was initially added as it reduced fragmentation at a lower load level. It looks catastrophic at higher load levels.

In addition to removing the super carrier, the hibernations did seem to have a slightly positive effect. I have yet to test with the settings from above, but so far these seem like pretty good improvements.

mpope · October 19, 2022, 10:23pm

It is quite a few processes causing the issue. The custom allocator settings were after setting the supercarrier

max-au · October 19, 2022, 10:26pm

Hibernation (mostly GC coming from it) is great for process that aren’t active (those that are sleeping most of the time). Supercarrier helps to avoid minor page faults (during memory allocation). So I’d expect higher CPU usage with these mitigations.

Figuring out what causes fragmentation may not be easy. I’d still recommend looking into allocators statistics, calculating fragmentation per-carrier, and adjusting allocation strategy appropriately. E.g. you may be looking into +MHas aobf if your heap allocator tends to be fragmented. See Erlang -- erts_alloc

starbelly · October 19, 2022, 10:40pm

Right, to be clear I didn’t mean literally a few processes as in 3 processes, but rather a few types of actors if you will, you could have one actor in your system but hundreds of thousands of instances. Thus, looking for that one bad apple isn’t possible, which seems to be what you’re saying.

Edit:

I could have said that a little better with : It’s not one bad apple, it’s one bad type of apple and there’s hundreds of thousands of them

starbelly · October 19, 2022, 10:42pm

I thought about making this adjustment, but I’m going to have to wait for memory fragmentation to rear its head again, if it does at all.

Even the fullsweep_after and hibernation, while they work, it seems like we should be able to dig into the life cycle of the process and sort out what’s going on.

Though, that leads to a good question : Is it worth it? Do we maybe just say “Let’s not alter our code, let’s adjust the properties of the process and call it a day”.

mpope · December 23, 2022, 5:29pm

As a follow up to this issue:

The offending processes ran for quite some time and performed heavy computation for the entire duration of their life. They would accumulate massive heaps, create many transient terms, etc. A fix that is being explored is to spread this work out over many smaller processes concurrently. In our testing environment this seems to have a great effect, memory consumption was reduced by an order of magnitude. Fragmentation and small free blocks appear to be reduced too. Schedulers were relatively underutilized before this change and now the work is spread out evenly across them.

Results the suggestions:

Tuning garbage collection did not have the desired (or really any) effect. This includes forcing global garbage collection sometimes as well as forcing garbage collection on single processes at key points.
Hibernating processes at key points had devastating effects on performance.
Changing allocation strategies and allocator sizes had slight benefits but nothing we were willing to commit to long term.
Turning off the super carrier did prevent OOMing, without the super carrier the ERTS appeared to not overrequest memory from the kernel. I can’t tell if this is a ‘bug’ in this feature. This is fairly easy to reproduce.

max-au · December 23, 2022, 7:21pm

It’s quite surprising to hear that changing allocation strategy (to aobf for example) did not help with memory fragmentation.

Since splitting the work into many smaller processes appears to help, I can imagine two potential areas to look into for the original approach:

Terms staying in the process rootset for longer than necessary. We had a number of issues when developers wanted to “keep an instance of some old term for future logging purposes”, by storing a reference in the process dictionary or more elaborate locations. Finding those might get tricky, but at least there is total_heap_size returned by process_info, together with process_info(Pid, binary) (now fixed in OTP 25 too).
This may also be happening when functions are written in a way that unnecessarily keeps larger GC rootset:

inefficient() ->
    Term1 = some_function(),
    Term2 = some_function(),
    ...
    Term3 = some_function(),
    ?LOG_INFO("Term1 was: ~p", [Term1]).

This code creates Term1 at the very beginning of function, and it stays on the process heap for no good reason (and it often gets promoted to the old heap, staying until next major GC). The obvious improvement would be to either move Term1 creation next to logging statement, or move logging statement next to Term1 creation (so it can be GC-ed in the next minor cycle).

Unexpected large term promotion from young to old heap. It was a surprise for me to find out that any BIF call may trigger an implicit GC promoting terms to the old heap. See this example:

-module(test).

-export([start/0]).

start() ->
    timer:sleep(500),
    spawn(fun no_gc/0),
    timer:sleep(500),
    spawn(fun yes_gc/0).

no_gc() ->
    Bin = crypto:strong_rand_bytes(1024 * 1024),
    Appended = <<Bin/binary, ".">>,
    %% erlang:yield(),
    {total_heap_size, TotalHeap} = erlang:process_info(self(), total_heap_size),
    is_binary(Appended) andalso io:format("Without implicit GC: ~b~n", [TotalHeap]).

yes_gc() ->
    Bin = crypto:strong_rand_bytes(1024 * 1024),
    Appended = <<Bin/binary, ".">>,
    erlang:yield(), %% does not matter which BIF to call, all of them trigger GC
    {total_heap_size, TotalHeap} = erlang:process_info(self(), total_heap_size),
    is_binary(Appended) andalso io:format("With implicit GC: ~b~n", [TotalHeap]).

yet_gc function reports larger total heap size:

Without implicit GC: 233
With implicit GC: 284

It isn’t immediately obvious why it happens that way. Adding a BIF call that does nothing to process heap or stack is not expected to affect memory characteristics.
What actually happens is that the yield (or any other) BIF notices some virtual binary heap overhead and decides to run minor GC. Which results in promoting Appended sub-binary to the old heap (because it’s used in the code after process_info). It will stay there until next major GC.

Now to try one more thing, what if you set fullsweep_after to a very small value (say, 2) only for your long running processes? It can be done via process_flag(fullsweep_after, N) or by specifying spawn_opt arguments for that process.

mpope · December 23, 2022, 10:41pm

You are correct in your long held term observation. We suspected that a reference to a large binary was held for the duration of the process. It was passed in the spawn function args and only referenced very early on, it was referenced not later in the function. We didn’t have a good read on if this was actually part of the issue or not. There can be up to 5 of these large binaries processed at a single time, so it lines up that we could release that reference much faster.

Your BIF causing the heap promotion is a interesting observation, thank you for sharing.

As part of the repro I add the fullsweep_after to the spawn_opt args, for values of 0 and 5. I can’t recall what the results were but they were not spectacular. I’ll try and get an exact repro of this soon, but spreading the work between processes dwarfed any gains from the fullsweep_after experiments.

starbelly · December 23, 2022, 11:49pm

That sounds like gc was never an issue (e.g., gc’s were happening at a good interval already). But agree with Maxim, try fullsweep_after just in case, especially if you’re mostly working with binaries. You can go straight to 0 as a test, we do this in production for some processes, iirc the tcp dist module in erlang/otp also uses fullsweep_after 0. But I believe, you’re not going to see any gains from this.

That makes sense if your processes are constantly busy, which also would explain why adjusting gc settings didn’t result in improvements.

Are you still facing internal fragmentation issues? You had pointed out this in your original post:

mpope:

> rp(rpc:call(N, instrument, carriers, [])).

      {eheap_alloc,false,134217728,0,
                   [{eheap_alloc,1,85562816}],
                   {0,0,0,0,0,0,0,0,0,0,0,0,0,1}},

There are 9 eheap_allocs with this exact pattern.

This actually looks quite good FWIW. That single large free block is likely not a problem, did you take several samples? I’d also be interested to know if your carrier counts grows and grows and never comes back down, at least considerably. What you do not want to see is lots of free blocks towards the left.

I’ve been experimenting quite a bit myself. We also do lots of small allocations, the payloads we normally work with are very small, but lots of them (1KB to 4KB). With that said, going back to the original post in this thread, those settings could hurt you. You may want to try tiny or small carriers for eheap and binary, if your case is like ours.

You may just want to step down the size of the carriers at a slow rate to see what you get each time. I’d try 2 or 4 MB lbmcs for eheap and binary first. We have found that the default allocation strategy for binary and eheap do perform the best, but everyone’s workload is different

Currently experimenting with 512KB lmbcs for eheap and binary and have seen improvements, but haven’t tried this in production yet. Experiment is the keyword here! We do currently utilize default settings in production sans super carrier settings. While eheap and binary internal fragmentation looks better, I may ditch these, have to see what it looks like over a long period of time.

To be clear, my interest in this is purely mitigation to get better odds when things go south in prod vs fighting a nasty problem with ERTS.

Can you tell us how much memory your systems have and what overcommit mode you’re running in?

I’m terribly curious about this. As stated, we use super carrier in prod, with about 750GB of memory, of which we give 700GB (iirc) to the super carrier, that leaves lots of wiggle room for nifs and such. We only get OOMed when we actually exhaust memory in the system and linux is right to OOM us in these situations. We have used mode 0 and 1 without problems, though we were more likely to get OOMed with mode 0 vs 1. To note, we moved to super carrier to solve OS fragmentation issues vs internal fragmentation problems. AFAIK I do not believe super carrier would ever make internal fragmentation better or worse, it might look a bit different though!

Other question: What version of erlang/otp are you using?

max-au · December 24, 2022, 8:01am

It was passed in the spawn function args and only referenced very early on, it was referenced not later in the function

That’s what I often trip over as well. What may be happening, a BIF-triggered GC moves that binary into the old heap soon after entering the spawned function. And it makes that binary lifecycle unnecessarily long.
Usual GC heuristic does not work well for such case, so I may end up putting erlang:garbage_collect() call as soon as I know that the binary is no longer reference in that function.

mpope · December 27, 2022, 4:40pm

Fragmentation greatly decreased with this change. I attribute this to GCing the large binaries faster, and avoiding large heap accumulation.

You are right, I might have been grasping for straws trying to relate a single large carrier to fragmentation. I was trying to make some correlation to the many small free blocks.

This is interesting, thank you for your insight. We’ll look into these settings more. We only tuned these up to much larger values and not down.

As for memory this env was 128GB RAM, SC set to 50% of available memory. I am not sure of the overcommit setting, so whatever the ‘default’ was. I wasn’t aware this was configurable actually, so maybe this is something else to explore. ‘Overcommit’ isn’t the same as +MMsco is it?

This was on 24.3

starbelly · December 27, 2022, 7:50pm

Awesome. This gives me some idea to try on my side

Yup! I’ll update this thread on whether these settings pan out for us such that we try it in prod and if so, what does that look like. All depends on your workload, but that said, large carriers can hurt bad if you’re facing fragmentation problems, the probability of exhausting memory is greater because you have to create more carriers to satisfy an allocation request, and you just keep doing that song and dance until you’ve run out of memory. They should also be harder to nuke from orbit (i.e., you got many parts all sharing a few large carriers vs a few parts sharing lots of small carriers). The defaults are really good in this regard, but I think for us, smaller than the defaults is probably better, remains to be seen.

This case study is worth the read : Troubleshooting down the logplex rabbithole

Right, this has to do with linux (though from what I understand most operating systems these days use overcommit) and not with supercarrier. Though as documented and discussed in another thread, the linux overcommit strategy greatly effects how the supercarrier will behave. The default is 0 which is heuristic mode. If you’re not using supercarrier anymore, and don’t plan on it, then changing this is probably not worth it if everything is running fine in general for you now.

That said, here are docs on overcommit