Memory tracing prototype

garazdawi · December 6, 2021, 10:44am

Hello!

As part of some optimizations we are doing I did a quick prototype implementing profiling memory usage in an Erlang application. The prototype replaces the current call_time tracing (used by eprof) to record the amount of memory allocated by a process instead of the amount of time spent in a function.

The idea, in the end, is of course not to replace the current call_time, but to add call_memory.

The result when run on the ssl 1.3 payload benchmark is something like this:

gen_statem:loop_actions_list/13                   4797736     1.09    9595472  [      2.00]
crypto:do_exor/2                                  2399796     1.09    9599184  [      4.00]
binary:copy/2                                     3599696     1.09    9599202  [      2.67]
gen_statem:loop_event/5                           4811598     1.10    9623196  [      2.00]
tls_record_1_3:encode_plain_text/3                1199900     1.37   11999000  [     10.00]
tls_sender:send_application_data/4                1199890     1.50   13198790  [     11.00]
tls_record_1_3:encode_plain_text/2                1199900     1.50   13198900  [     11.00]
tls_record_1_3:cipher_aead/6                      1199900     1.50   13198900  [     11.00]
gen_statem:loop_next_events/10                    4811588     1.64   14398776  [      2.99]
gen_statem:loop_state_callback_result/11          4811598     1.64   14434794  [      3.00]
gen_statem:loop_timeouts/12                       7209528     1.65   14470802  [      2.01]
tls_record_1_3:encode_iolist/3                    1199890     1.78   15598570  [     13.00]
tls_record:decode_tls_records/8                   2399792     1.91   16798558  [      7.00]
tls_gen_connection:next_event/4                   3611688     1.91   16798568  [      4.65]
tls_record:binary_from_front/5                    3599688     2.19   19198353  [      5.33]
ssl_gen_statem:get_data/3                         1213744     2.19   19198448  [     15.82]
tls_record_1_3:aead_ciphertext_split/2            1199896     2.32   20398610  [     17.00]
crypto:aead_cipher_nif/7                          2399796     2.60   22798046  [      9.50]
tls_record_1_3:nonce/2                            2399796     3.01   26397770  [     11.00]
erlang:split_binary/2                             2399792     3.01   26397872  [     11.00]
ssl_gen_statem:hibernate_after/3                  2411792     3.02   26529742  [     11.00]
gen_statem:loop_timeouts_cancel/13                4797736     3.82   33558534  [      6.99]
gen:do_call/4                                     2399800     5.61   49195845  [     20.50]
tls_record_1_3:encode_tls_cipher_text/2           1199900     5.61   49195900  [     41.00]
tls_record_1_3:decode_cipher_text/2               1199896     6.43   56395255  [     47.00]
gen_statem:loop_receive/3                         3611692     8.21   72072758  [     19.96]
erlang:setelement/3                               9637069    13.75  120672224  [     12.52]
----------------------------------------------  ---------  -------  ---------  [----------]
Total:                                          307692609  100.00%  877580394  [      2.85]

From this we can see that the function that allocated the most memory per call is tls_record_1_3:encode_tls_cipher_text/2, while the function that allocates the most memory overall is erlang:setelement/3.

Using eprof has the same problems for memory pressure as it does for time, i.e. you don’t know the stack of the processes that called you. It should be relatively simple to extend fprof to have access to the same information and thus get context about what functions called erlang:setelement/3.

My question is this: Do you think that this would be useful when profiling? Most of the time, the time spent in a function is proportional to the amount of memory allocated, so the normal eprof and fprof would give the same results. So I’m a bit unsure how useful this would actually be in practice.

You can find the very incomplete and probably incorrect prototype code here: GitHub - garazdawi/otp at lukas/erts/call_gcwords_trace

max-au · December 7, 2021, 5:16am

Yes, this is incredibly useful!
To a degree that we’ve been trying to get this done in many different ways:

Use DTRACE/systemtap tracepoints (heap grow events) - averaging many samples creates a reasonable ‘heatmap’ (similar to perf flamegraph), because with JIT we have stacks where allocation is happening
Use parse_transform to inject process_info(memory) calls (similar to cover), although it affects all processes using recompiled module.
Monitoring large_heap events - however I didn’t find a good way to grab stacks.

We no longer use fprof for profiling, as it skews results towards tiny functions executed. I think the cost of sending trace message plays too significant role. So small function executed 500 times will show up high in fprof, while a long one executed 3 times will be rather low.
So we ended up using erlperf in comparison mode - running original and allegedly improved code at the same time, measuring throughput.

For memory I believe fprof-like approach would make a lot more sense. Especially when just a single process can be profiled, with something like memprof:apply(Mod, Fun, Args), and every allocation is tracked/accounted.

maxlapshin · December 7, 2021, 9:35am

It can be extremely helpful!

garazdawi · December 7, 2021, 9:40am

My main concern is that making sure that it is correct will be very hard and that it could prevent future optimizations.

Maybe it is time to start to tackle some of the problems of fprof, although now that we have perf it is not as useful as it used to be.

I think that my final goal with the functionality is to be able to gather data in order to create simulations of memory allocation patterns to try to tackle GC of large heaps.

peerst · December 7, 2021, 9:47am

Since sending the message is a constant overhead in first order approximation: couldn’t fprof or some wrapper compensate for the frequency?

If most of the time its proportional: would then maybe the time/memory quotient interesting, which could be calculated when both are shown. What would it tell us? Ideas:

should show which functions are slower because of memory pressure produced (-> reduce memory pressure first to optimise)
functions slow at low memory pressure need a different look at optimisation
How would a plot of memory vs. time look? Would expect everything align along a line (or some other curve?) .What can different areas tell us here?

max-au · August 16, 2022, 2:30am

What’s the status of this?
I ran into a problem that could indeed be solved with having such profiling tool. I am trying to emulate this functionality via tracing GC messages (spawning function in a separate process), but it’s really crufty job.

If there is no bandwidth for it whatsoever, I can probably pick it up to some PR-ready state…

maxlapshin · August 16, 2022, 5:34am

we’ve made a nice debugging session previous week and could find the source of performance problems on flame graph, showing that 30% of our CPU was consumed by GC.

garazdawi · August 16, 2022, 12:28pm

We are planning to do it for OTP 26, but plans do change so any help with the change would make it more likely to happen and be very welcome.

max-au · August 30, 2022, 5:58pm

Before firing up a PR, I want to have a quick rundown of available implementation options (and my considerations for these options).

Flags call_time and call_memory can be mutually exclusive, or completely separate. The former is easier to implement - less copy-and-paste within beam_bp.c, no need for an extra beam_return_memory_trace emulator instruction, etc… But I just don’t like piggy-backing here.
For fprof-like API approach. It is not exactly related to call_memory, but can be done by introducing something similar to timestamp option for trace/3. When that flag is specified, call tracing can send a message {trace_memory, [call|return_from|exception_from], {M, F, Arity}, WordsAllocated} to the tracer process. This is completely unrelated to eprof approach and can be done separately (or not done at all to begin with, as it’s a separate PR)
I don’t exactly understand how to attribute spawn memory usage. There is sharing loss happening when I do L = lists:seq(1, 1000000), spawn(fun() -> do_smth_with(L) end). I don’t know how this should reflect in memory trace for either method.

garazdawi · August 30, 2022, 7:06pm

My original plan was to do them completely separate, but if it is a lot easier we can make first make them mutually exclusive and then change it later on if this turns out to be a valuable feature.
The API is separate, but the machinery that counts the amount of memory used is the same. I agree that we can leave this to a separate PR.
IMO it should work as if the spawn BIF received a message with the arguments. The sharing loss should IMO be attributed to the spawned process just as it would when you send a message and sharing is lost there.

max-au · August 31, 2022, 4:04am

Combining tracers turned to be even harder (because there are features like “pause” and “reset” trace).

My naive implementation for separate traces - GitHub - max-au/otp at max-au/trace-memory - I would appreciate a quick look at this commit to see whether I’m moving in the right direction. I don’t really like copy-pasting almost entire functions (e.g. erts_trace_time_call → erts_trace_memory_call), but I feel so rusty with C (and lacking C++ templates)…

(in the meantime I’ll see how to add some tests)

garazdawi · August 31, 2022, 6:39am

I had a quick look and I think you are on the right track. Creating a copy of the trace functions is not idea, but if we can first make it work then we can worry about making the code more maintainable later. Adding tests sounds like a good next step as it will be important to make sure that all the memory calculations are done correctly.

max-au · September 26, 2022, 3:51am

I need an opinion on how to attribute message passing.

One solution (1), when a message has been added onto a process heap, allocation has already happened (and process heap size changes correspondingly). So for heap allocation correctness it would be fair to attribute it somehow. Otherwise there is a discrepancy between process heap size and total allocation amount. But it happens at somewhat random moment in time, when the process executes somewhat random function, and it’d be wrong to attribute allocation to that random function.

Second (2) solution would be to attempt it the way @garazdawi’s implementation does, attribute memory allocation to the function that runs receive instruction. On one hand, it makes total sense (this is where the process actually starts working with that memory). But on the other hand it can happen that a process never extract any messages from the queue (e.g. it runs a selective receive, and does not match any message at all). Process heap size could be very large, yet no allocations have been instrumented with memory tracing.

I think (2) makes more sense, but I also find valid statement about “invisible” allocations (messages that have not been matched).

(also, the minimal implementation has a bit of a bug with subtracting on-heap message size from Uint).

garazdawi · September 26, 2022, 3:08pm

Or option #3, maybe we should not count message data at all? We are using the information to try to pinpoint where our programs allocate data, and including messages may not be what we want. Do we want a separate counter for the total size of all received messages?

max-au · September 28, 2022, 3:13am

That option feels attractive. Given that there is separate tracing option for receive, it makes sense to add some “words” option there rather than conflate heap allocation by the process with message allocation on the process heap.
Now the only point of disagreement could be adding “GC” counter to the process structure, I remember @rickard explaining why it should be avoided when possible. Maybe store it in PSD (or find a way not to store at all).

garazdawi · September 28, 2022, 6:40am

My original plan was to store it in the PSD.

max-au · September 29, 2022, 5:13am

I think I’ve got it working with PSD as well. It surely needs some polish, test cases, actual testing and bug fixes, but I’d appreciate a quick review of [erts] Implement call_memory tracing · max-au/otp@100ae43 · GitHub

(a few things I am aware: (a)- erts_adjust_memory_break can be done more efficiently, avoiding calculations in the caller (b) there is no need to allocate allocated Sint for call_time in PSD (c) 32-bit emulator may not compile/work, especially under ARM32, I simply don’t have a machine to try on)