I’m trying to figure out how to best configure BEAM for our server configurations. We have a variety of different compute nodes at our disposal which basically fall into three different categories:
20 cores
24 cores
27 cores
We’re now experiencing problems with the newer machines (24 and 20 cores) occasionally.
12 cores have a realtime application pinned to them (logical cores 2-13) which will completely hog them. And we limited the amount of BEAM schedulers with +S 12 and it all went fine until now.
Now we see some BEAM processes occasionally starving for minutes on the newer HW, and while this may very well be entirely unrelated, I’m still wondering what a good scheduling setup would be. Could this condition even be caused by setting scheduling thread to 12 when only 8 cores are truly free? And if so, could they be encouraged to migrate faster? We’re talking about minutes of difference in total runtime in a period with no IO (done reading config files) and no CPU-bound tasks, either.
But it’s more general than that - if I have the “right amount” of schedulers online (even if “schedulers” itself might still 12 and hence more than logical cores available), will it be simply left to the underlying Linux OS to schedule them and can I assume they are smartly kept on not-pinned cores?
I know this touches on both how Linux does things versus how BEAM does things, but I’m trying to find a good solution for keeping this non-realtime app efficiently co-hosted with the realtime app.
I think it may help if you state how you went out about pinning? There might be a step you missed in doing so, it’s possible you only gave affinity to to your app running along side your erlang application, such that your “real time” app will have affinity for cores 2-13, yet they are not fully isolated, and other work can still happen on them.
That said, it is up to the OS scheduler to ultimately decide where thread work will be executed; the BEAM schedules at the thread level, where as Linux will schedule thread work at the core level. That might not be the most eloquent way to put it, but it should suffice.
Sounds good! Next question, have you set the bind type for erlang? The default is unbound, which goes along with what I said in the above post, still to quote the docs :
Schedulers will not be bound to logical processors, i.e., the operating system decides where the scheduler threads execute, and when to migrate them. This is the default.
I suppose as you alluded to, there’s quite a few possibilities, and yet I don’t think how the BEAM binds is going to be part of this problem (directly) if you are using the default scheduler bind type. Another question, when you say you see BEAM processes starved for minutes, what are you observing exactly and how?
Also, I know there are some other kernel settings related to irq and io , when it comes to isolating cpus. You may want to check those as well.
Additionally, I agree it doesn’t make sense to specify more than what you actually have hardware wise.
Right now the scheduling is not configured beyond the +S and the schedulers show as unbound.
Another question, when you say you see BEAM processes starved for minutes, what are you observing exactly and how?
We see two loglines that appear normally milliseconds from each other “suddenly” minutes apart - roughly 10% of all runs.
I checked what is done in between and it’s just starting the supervisor (this is in app startup) and I looked at the respective start_link/init functions and they are not waiting (for anything) and are not doing I/O. I/O with filesystem is already done because the previous logline confirms that the reading of configuration files and storing their content in term storage has been done - so I wouldn’t expect to wait minutes for some very cheap BEAM process inits…
Ahh, but how are you logging and observing the logs? Are the timestamps from the app or from an external logging application (e.g., journal and promtail, etc.). Not suggesting the indictator is wrong here, but it may be worth removing more variables? Have you looked at what’s going on via top on the server? Or perhaps using perf?
The log is from the default logger, piped through into a file. Timestamps come from within the BEAM app.
In parallel, somebody else is going to activate strace to see if we see a kernel issue, I’m just trying to see if this is a known problem when for example saying +S 12 and in reality you only have 8 cores truly available on some HW variations.