Memory tracing prototype

Hello!

As part of some optimizations we are doing I did a quick prototype implementing profiling memory usage in an Erlang application. The prototype replaces the current call_time tracing (used by eprof) to record the amount of memory allocated by a process instead of the amount of time spent in a function.

The idea, in the end, is of course not to replace the current call_time, but to add call_memory.

The result when run on the ssl 1.3 payload benchmark is something like this:

gen_statem:loop_actions_list/13                   4797736     1.09    9595472  [      2.00]
crypto:do_exor/2                                  2399796     1.09    9599184  [      4.00]
binary:copy/2                                     3599696     1.09    9599202  [      2.67]
gen_statem:loop_event/5                           4811598     1.10    9623196  [      2.00]
tls_record_1_3:encode_plain_text/3                1199900     1.37   11999000  [     10.00]
tls_sender:send_application_data/4                1199890     1.50   13198790  [     11.00]
tls_record_1_3:encode_plain_text/2                1199900     1.50   13198900  [     11.00]
tls_record_1_3:cipher_aead/6                      1199900     1.50   13198900  [     11.00]
gen_statem:loop_next_events/10                    4811588     1.64   14398776  [      2.99]
gen_statem:loop_state_callback_result/11          4811598     1.64   14434794  [      3.00]
gen_statem:loop_timeouts/12                       7209528     1.65   14470802  [      2.01]
tls_record_1_3:encode_iolist/3                    1199890     1.78   15598570  [     13.00]
tls_record:decode_tls_records/8                   2399792     1.91   16798558  [      7.00]
tls_gen_connection:next_event/4                   3611688     1.91   16798568  [      4.65]
tls_record:binary_from_front/5                    3599688     2.19   19198353  [      5.33]
ssl_gen_statem:get_data/3                         1213744     2.19   19198448  [     15.82]
tls_record_1_3:aead_ciphertext_split/2            1199896     2.32   20398610  [     17.00]
crypto:aead_cipher_nif/7                          2399796     2.60   22798046  [      9.50]
tls_record_1_3:nonce/2                            2399796     3.01   26397770  [     11.00]
erlang:split_binary/2                             2399792     3.01   26397872  [     11.00]
ssl_gen_statem:hibernate_after/3                  2411792     3.02   26529742  [     11.00]
gen_statem:loop_timeouts_cancel/13                4797736     3.82   33558534  [      6.99]
gen:do_call/4                                     2399800     5.61   49195845  [     20.50]
tls_record_1_3:encode_tls_cipher_text/2           1199900     5.61   49195900  [     41.00]
tls_record_1_3:decode_cipher_text/2               1199896     6.43   56395255  [     47.00]
gen_statem:loop_receive/3                         3611692     8.21   72072758  [     19.96]
erlang:setelement/3                               9637069    13.75  120672224  [     12.52]
----------------------------------------------  ---------  -------  ---------  [----------]
Total:                                          307692609  100.00%  877580394  [      2.85]

From this we can see that the function that allocated the most memory per call is tls_record_1_3:encode_tls_cipher_text/2, while the function that allocates the most memory overall is erlang:setelement/3.

Using eprof has the same problems for memory pressure as it does for time, i.e. you don’t know the stack of the processes that called you. It should be relatively simple to extend fprof to have access to the same information and thus get context about what functions called erlang:setelement/3.

My question is this: Do you think that this would be useful when profiling? Most of the time, the time spent in a function is proportional to the amount of memory allocated, so the normal eprof and fprof would give the same results. So I’m a bit unsure how useful this would actually be in practice.

You can find the very incomplete and probably incorrect prototype code here: GitHub - garazdawi/otp at lukas/erts/call_gcwords_trace

10 Likes

Yes, this is incredibly useful!
To a degree that we’ve been trying to get this done in many different ways:

  1. Use DTRACE/systemtap tracepoints (heap grow events) - averaging many samples creates a reasonable ‘heatmap’ (similar to perf flamegraph), because with JIT we have stacks where allocation is happening
  2. Use parse_transform to inject process_info(memory) calls (similar to cover), although it affects all processes using recompiled module.
  3. Monitoring large_heap events - however I didn’t find a good way to grab stacks.

We no longer use fprof for profiling, as it skews results towards tiny functions executed. I think the cost of sending trace message plays too significant role. So small function executed 500 times will show up high in fprof, while a long one executed 3 times will be rather low.
So we ended up using erlperf in comparison mode - running original and allegedly improved code at the same time, measuring throughput.

For memory I believe fprof-like approach would make a lot more sense. Especially when just a single process can be profiled, with something like memprof:apply(Mod, Fun, Args), and every allocation is tracked/accounted.

4 Likes

It can be extremely helpful!

3 Likes

My main concern is that making sure that it is correct will be very hard and that it could prevent future optimizations.

Maybe it is time to start to tackle some of the problems of fprof, although now that we have perf it is not as useful as it used to be.

I think that my final goal with the functionality is to be able to gather data in order to create simulations of memory allocation patterns to try to tackle GC of large heaps.

2 Likes

Since sending the message is a constant overhead in first order approximation: couldn’t fprof or some wrapper compensate for the frequency?

If most of the time its proportional: would then maybe the time/memory quotient interesting, which could be calculated when both are shown. What would it tell us? Ideas:

  • should show which functions are slower because of memory pressure produced (-> reduce memory pressure first to optimise)

  • functions slow at low memory pressure need a different look at optimisation

  • How would a plot of memory vs. time look? Would expect everything align along a line (or some other curve?) .What can different areas tell us here?

2 Likes