Any ideas how to locate memory leak?

benonymus · May 16, 2023, 2:22pm

Hey,

I have an elixir application that shows the following symptoms:

The application itself reports around 300mb memory usage.
If I check in iex erlang reports the same:

:erlang.memory
[
  total: 275998648,
  processes: 81298904,
  processes_used: 81293272,
  system: 194699744,
  atom: 2687473,
  atom_used: 2663790,
  binary: 21397256,
  code: 93399782,
  ets: 21150928
]

The host OS however reports constantly growing memory usage dues to the beam.smp process.
Up until OOM occurs and then restarts.

Any idea how could I find what causes this?
I tried the recommended recon checks from Erlang in anger to no avail.
https://s3.us-east-2.amazonaws.com/ferd.erlang-in-anger/text.v1.1.0.pdf#subsection.7.2.1

I appreciate any help!

mpope · May 16, 2023, 2:47pm

I’ve had success using recon to debug memory leaks. It is an amazing tool. It has functions beyond bin_leak, too. fragmentation/1 and sbcs_to_mbcs/1 are useful to understand memory usage and general fragmentation. get_state/2 works will if you think a certain process has alot of memory but you don’t know exactly what could be causing it. proc_count/2 is useful to look at what is running at a specific time.

LeonardB · May 16, 2023, 3:27pm

Another app we’ve found very useful and integrate in our builds is observer_cli.

https://github.com/zhongwencool/observer_cli

It uses recon under the hood and gives us a more logical UI for inspecting processes/finding issues

jhogberg · May 16, 2023, 3:55pm

Try running instrument:allocations() and see if there’s any allocation type that sticks out or increases a lot over time. By default it’ll only track binaries and NIF allocations made through our APIs, so if you can’t find it right away you’ll want to enable tracking on all our allocators by passing the +Muatags true emulator flag.

Chances are you won’t find much since the leak should have been seen with erlang:memory/0 if it were made through our allocators, and if you still can’t find it then the leak is most likely in a NIF that uses plain malloc(2) or similar instead of using our allocators. What NIFs/drivers are you using?

https://www.erlang.org/doc/man/instrument.html

benonymus · May 17, 2023, 12:51am

Hey all,

Thanks for all the help!

I tried various functions in recon but nothing revealed the source of the excess memory.
I also connected Erlang -- Observer to the deployment but all the processes seem fine there too.

@jhogberg when I tried to run :instrument in iex I got that the module is not available.

I found a clue that the culprit might be the appsignal nif.
When I observe the memory leak we stop getting reports on appsignal and when we redeploy we see them for a swift period. The graphs line up perfectly.
I am talking to their team.

Is there any way I can validate if a nif is causing the problem, and if so which one?

Thank you

jhogberg · May 17, 2023, 9:18am

In OTP 25 and earlier you need to include the tools application for it to be available, which is often excluded from releases, so that’s probably why (it’s been moved to runtime_tools in OTP 26)

Looking at the source code for the appsignal NIF I think it would be a good idea to temporarily include the tools application to check this.

The NIF seems to allocate a tiny resource (just a pointer) for each transaction which won’t make much of a dent in erlang:memory() unless tons of them have been allocated, but the pointed-to memory seems to be allocated using system allocators (malloc(2) et al) which will hide it from view: what looks like a megabyte or two in erlang:memory() might actually be hundreds.

If this is the culprit it’ll stick out like a sore thumb in the instrument:allocations() output.

Aside from using instrument you can try running your test cases under valgrind, though it’ll take some editing to make it work with Elixir.

Make sure valgrind is installed, and build OTP from source with the special valgrind emulator type:

$ export ERL_TOP=`pwd`
$ export MAKEFLAGS=-j
$ ./otp_build setup -a
$ (cd erts/emulator/; make valgrind)

# Save this for later
$ readlink -f bin/cerl

Then edit your bin/elixir script to use it, changing the following line…

set -- "$ERTS_BIN$ERL_EXEC" -noshell -elixir_root ... et cetera et cetera

To:

set -- "the path you got from readlink" -valgrind -noshell -elixir_root ... et cetera et cetera

If all goes well, you should see something like the following starting iex:

$ iex
==607653== Memcheck, a memory error detector
==607653== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==607653== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==607653== Command: xyz
==607653== 
==607653== Warning: set address range perms: large range [0x521b000, 0x4521b000) (noaccess)
Erlang/OTP 27 [DEVELOPMENT] [erts-14.0] [source-5776496e72] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit] [valgrind-compiled]

Interactive Elixir (1.15.0-dev) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)>

It will run like a snail going uphill but if there’s a regular memory leak it’ll tell you all about it. Unfortunately it can’t tell you about things like a forever-expanding queue where all elements are reachable but never end up being freed, though.

benonymus · May 17, 2023, 10:04am

Thanks a lot John!

I passed along your points to Appsignal.
When I find time I will try to give valgrind a go if needed.
For now we disabled Appsignal and seem to be in the green, but time will tell.

I will update the post once this is sorted!

benonymus · June 5, 2023, 1:25am

It seems we found the culprit!
Appsignal creates a folder in the tmp system folder.
We were running a background worker that automatically cleaned that folder too aggressively, resulting in this memory leak in Appsignal.

Thank all for the help!

xand · July 19, 2023, 4:55pm

My goto for slow leaks is https://github.com/xandkar/beam_stats, which collects detailed stats per process and per ETS table.

Per-process collection is not straight forward since there’re both short-lived and anonymous processes at play. My solution, which proved useful-enough, was to aggregate per-process data by process ancestry (the code origin of the spawn call). See:

Watching these metrics in Grafana has been very effective at spotting leaks.

It currently lacks a Prometheus backend though, which is more popular these days.