ERTS started to work slower

snakeduse · September 26, 2023, 2:40pm

Hello!

On OTP 23 our performance tests for our project were passing with a small percentage of slowdown - 7 out cases which was working slowly on 10K iterations.
But after move on OTP 26.0.2 our performance tests are slowed down - 296 out cases which was working slowly on 10K iterations.

I tried find to the reason and I didn’t catch what’s wrong. Our code wasn’t changes - only OTP version was.

And we see the slowdown probabilistically.

We used eprof to calc. data for our “slow” function, and have the result (the result with turned on logger):

As you can see, most of the functions have slowed down, not a few

Maybe somebody know what happened between 24 - 26 Erlang which started to work slower?
Maybe some changes in ERTS? There may have been changes in the work of the scheduler or GC?

We are using Debian 10, Erlang/OTP 26.0.2.

starbelly · September 26, 2023, 11:35pm

There’s a been a LOT of changes between 24 and 26 Maybe providing some more details may help narrow it down.

You’re using Linux :white_check_mark, but what’s the architecture? Is this on a laptop? How exactly how the performance tests run?

I think that information might terribly helpful.

max-au · September 27, 2023, 3:04am

I would note that eprof may not be the best tool for micro-benchmarking, or figuring out the cause of a slowdown. And having said that, I would recommend Linux perf to profile code running with OTP 26. It may give a clue where the bottleneck is.

snakeduse · September 27, 2023, 8:40am

There’s a been a LOT of changes between 24 and 26

Yea, you are right and I have no idea where to look (

I think that information might terribly helpful.

Of course, sorry for that I didn’t say more. We use VMWare virtual machine with Debian 10 on x64 arch.

How exactly how the performance tests run?

We have one RPC request to the service through RabbitMQ. That request will be processed with some work and will be saved in cache (Mnesia).
After that we are remember time to handle request and then we send 10K RPC requests one by one (without sleep) and see time to handle request.
Sometimes requests from cache work slowly then requests without cache. And mechanism with cache work with stable time,
but the code around this began to slow down.

Before call the cache service parse ETF-request (type map()) to terms and will call cache,
then serialize term to ETF (type map()).
It all happen working in one erlang-process, which handling RabbitMQ message.

I think it’s still not enough information for assessing the situation…

snakeduse · September 27, 2023, 8:52am

I agree with you and used perf. And didn’t see some strange. I wanted to send you the perf svg-file, but the forum wait images only png, jpg etc.

josevalim · September 28, 2023, 11:29am

My suggestion is to try a couple different versions:

26.1 (this version fixes code server regression which made code loading in some cases which could manifest during tests)
25.3
24.3

And see if any of those allow you to better pinpoint the slow down.

snakeduse · September 28, 2023, 12:50pm

Thanks for your recommendation. I’ll try it.

max-au · September 29, 2023, 3:47am

I usually just take a screenshot (both Windows and MacOS create *.png files).

snakeduse · September 29, 2023, 7:45am

I thought that it’s not so informative.

I attached two images:

Without slowdown

without_slowdown1200×1558 104 KB
And without slowdown

with_slowdown1200×1638 103 KB

Please said If you find some suspicious.

max-au · September 29, 2023, 4:36pm

The two images appear identical. Anyway, how did you create these graphs? The way I usually do that, I run the benchmark in an endless loop, and use perf -g -p 'pidof beam.frmptr.smp' to capture only a small portion of the actual test run.

You’re right that (small) PNG pics aren’t helpful for debugging purposes, so the better way to share (things like *.svg files) could be uploading it somewhere (gist?) and providing a link to it.

starbelly · September 29, 2023, 6:47pm

I think the question for me still is : Are you sure anything is actually going slower?

I had a recent case where with one benchmark tool, term_to_binary/1 came out faster than term_to_iovec/1, which is not the expectation. In another benchmark tool, term_to_iovec/1 came out faster, as expected.

Do you have any reason to believe other than what eprof and such shows that anything is slower? Maybe it’s worth checking request time, etc. What do you think?