I’m looking for allocator settings advice. What I am seeing is that total memory remains low, and eventually the VM gets OOM’d by the Linux kernel. Since there is low memory, ~10%, this leads me to believe that fragmentation causing this crash. It is known that this application does a large number of small allocations. Under small sustained load the memory utilization does not end up dropping, only when the system is quiet is memory coalesced. Addtl, there is a super carrier enabled at 80% of available memory and we allow mmap from the underlying OS.
Here are some metrics.
Using recon to look at the average block sizes, there are pretty large binaries allocated, as well as large heap blocks(?).
> rpc:call(N, recon_alloc, average_block_sizes, [current]).
[{binary_alloc,[{mbcs,7061.379407616361},{sbcs,8.53e5}]},
{eheap_alloc,[{mbcs,11122.552204176334},{sbcs,166084924.0}]},
So there are quite large sbc
s allocated. Several guides online suggest that one wants less sbc
s and more allocation in mbc
s, as mbc
s coalesce better than sbc
s. Here is the ratio at peak (running multiple times look similar):
> rp(rpc:call(N, recon_alloc, sbcs_to_mbcs, [current])).
[{{eheap_alloc,11},0.14285714285714285},
{{eheap_alloc,28},0.07692307692307693},
{{eheap_alloc,1},0.024193548387096774},
{{eheap_alloc,0},0.022222222222222223},
{{eheap_alloc,2},0.012195121951219513},
{{eheap_alloc,3},0.01098901098901099},
{{binary_alloc,3},0.006147540983606557},
{{binary_alloc,1},0.001183431952662722},
I’m not sure if this is classified as ‘good’ or ‘bad’, but a ratio exists.
The instrument
module shows the following:
> rpc:call(N, instrument, allocations, []).
{ok,{128,0,
#{crypto =>
#{binary => {0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
prim_buffer =>
#{binary => {0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
drv_binary => {0,14,0,7,3,0,0,0,0,0,0,0,0,0,0,0,0,0},
nif_internal => {0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
prim_file =>
#{binary => {0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
drv_binary => {0,9,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0}},
prim_socket =>
#{binary => {0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
driver_mutex => {27,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
drv_binary => {0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,0,0},
nif_internal => {8,1,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
system =>
#{binary => {35,109,17,10,1,0,2526,1,0,0,0,0,1,0,0,0,0,0},
driver => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
driver_mutex => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
driver_rwlock => {41,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
driver_tid => {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
driver_tsd => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
drv_internal => {0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
microstate_accounting =>
{1,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
nif_internal => {0,19,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
port => {0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
port_data_lock => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
Looks pretty benign to me, but I could be missing something.
Finally, this is what recon_alloc:fragmentation
reports:
[{{binary_alloc,1},
[{sbcs_usage,1.0},
{mbcs_usage,0.5149541837032711},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,5416560},
{mbcs_carriers_size,10518528}]},
{{ll_alloc,0},
[{sbcs_usage,1.0},
{mbcs_usage,0.7967681884765625},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,18798120},
{mbcs_carriers_size,23592960}]},
{{eheap_alloc,2},
[{sbcs_usage,0.9977274729793233},
{mbcs_usage,0.6334435096153846},
{sbcs_block_size,2174120},
{sbcs_carriers_size,2179072},
{mbcs_block_size,5396736},
{mbcs_carriers_size,8519680}]},
{{eheap_alloc,1},
[{sbcs_usage,0.997382155987395},
{mbcs_usage,0.2850613064236111},
{sbcs_block_size,972296},
{sbcs_carriers_size,974848},
{mbcs_block_size,1008816},
{mbcs_carriers_size,3538944}]},
{{binary_alloc,28},
[{sbcs_usage,1.0},
{mbcs_usage,0.2025390625},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,431392},
{mbcs_carriers_size,2129920}]},
{{binary_alloc,2},
[{sbcs_usage,0.9965972222222222},
{mbcs_usage,0.7005076911878882},
{sbcs_block_size,918464},
{sbcs_carriers_size,921600},
{mbcs_block_size,3695632},
{mbcs_carriers_size,5275648}]},
{{binary_alloc,11},
[{sbcs_usage,1.0},
{mbcs_usage,0.2882286658653846},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,613904},
{mbcs_carriers_size,2129920}]},
Are there any diagnostic functions that should be tested?
With these metrics in mind, I’m experimenting with the following settings:
+MBsbct 2000
+MBlmbcs 100000
+MBsmbcs 1024
+MBas aobf
+MHsbct 2000
+MHlmbcs 100000
+MHsmbcs 1024
+MHas aobf
Using these, some of the above functions look better but the general problem still remains. One addition now is eheap_alloc
now is of greater size with a single large block allocated:
> rp(rpc:call(N, instrument, carriers, [])).
{eheap_alloc,false,134217728,0,
[{eheap_alloc,1,85562816}],
{0,0,0,0,0,0,0,0,0,0,0,0,0,1}},
There are 9 eheap_alloc
s with this exact pattern.
Additionally, the sbc to mbc ratio looks better, except for one eheap
> rp(rpc:call(N, recon_alloc, sbcs_to_mbcs, [current])).
[{{eheap_alloc,0},0.06521739130434782},
And here is what recon_alloc:fragmentation
reports:
[{{binary_alloc,26},
[{sbcs_usage,1.0},
{mbcs_usage,0.07153902666284404},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,1277584},
{mbcs_carriers_size,17858560}]},
{{binary_alloc,24},
[{sbcs_usage,1.0},
{mbcs_usage,0.02615636041057505},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,439688},
{mbcs_carriers_size,16809984}]},
{{binary_alloc,25},
[{sbcs_usage,1.0},
{mbcs_usage,0.08361704415137615},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,1493280},
{mbcs_carriers_size,17858560}]},
{{binary_alloc,27},
[{sbcs_usage,1.0},
{mbcs_usage,0.09941137471330275},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,1775344},
{mbcs_carriers_size,17858560}]},
{{eheap_alloc,1},
[{sbcs_usage,1.0},
{mbcs_usage,0.0963664683260659},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,1629392},
{mbcs_carriers_size,16908288}]},
Are there any other suggested allocator settings to adjust?