I’m trying to figure out how to best configure BEAM for our server configurations. We have a variety of different compute nodes at our disposal which basically fall into three different categories:
20 cores
24 cores
27 cores
We’re now experiencing problems with the newer machines (24 and 20 cores) occasionally.
12 cores have a realtime application pinned to them (logical cores 2-13) which will completely hog them. And we limited the amount of BEAM schedulers with +S 12 and it all went fine until now.
Now we see some BEAM processes occasionally starving for minutes on the newer HW, and while this may very well be entirely unrelated, I’m still wondering what a good scheduling setup would be. Could this condition even be caused by setting scheduling thread to 12 when only 8 cores are truly free? And if so, could they be encouraged to migrate faster? We’re talking about minutes of difference in total runtime in a period with no IO (done reading config files) and no CPU-bound tasks, either.
But it’s more general than that - if I have the “right amount” of schedulers online (even if “schedulers” itself might still 12 and hence more than logical cores available), will it be simply left to the underlying Linux OS to schedule them and can I assume they are smartly kept on not-pinned cores?
I know this touches on both how Linux does things versus how BEAM does things, but I’m trying to find a good solution for keeping this non-realtime app efficiently co-hosted with the realtime app.
I think it may help if you state how you went out about pinning? There might be a step you missed in doing so, it’s possible you only gave affinity to to your app running along side your erlang application, such that your “real time” app will have affinity for cores 2-13, yet they are not fully isolated, and other work can still happen on them.
That said, it is up to the OS scheduler to ultimately decide where thread work will be executed; the BEAM schedules at the thread level, where as Linux will schedule thread work at the core level. That might not be the most eloquent way to put it, but it should suffice.
Sounds good! Next question, have you set the bind type for erlang? The default is unbound, which goes along with what I said in the above post, still to quote the docs :
Schedulers will not be bound to logical processors, i.e., the operating system decides where the scheduler threads execute, and when to migrate them. This is the default.
I suppose as you alluded to, there’s quite a few possibilities, and yet I don’t think how the BEAM binds is going to be part of this problem (directly) if you are using the default scheduler bind type. Another question, when you say you see BEAM processes starved for minutes, what are you observing exactly and how?
Also, I know there are some other kernel settings related to irq and io , when it comes to isolating cpus. You may want to check those as well.
Additionally, I agree it doesn’t make sense to specify more than what you actually have hardware wise.
Right now the scheduling is not configured beyond the +S and the schedulers show as unbound.
Another question, when you say you see BEAM processes starved for minutes, what are you observing exactly and how?
We see two loglines that appear normally milliseconds from each other “suddenly” minutes apart - roughly 10% of all runs.
I checked what is done in between and it’s just starting the supervisor (this is in app startup) and I looked at the respective start_link/init functions and they are not waiting (for anything) and are not doing I/O. I/O with filesystem is already done because the previous logline confirms that the reading of configuration files and storing their content in term storage has been done - so I wouldn’t expect to wait minutes for some very cheap BEAM process inits…
Ahh, but how are you logging and observing the logs? Are the timestamps from the app or from an external logging application (e.g., journal and promtail, etc.). Not suggesting the indictator is wrong here, but it may be worth removing more variables? Have you looked at what’s going on via top on the server? Or perhaps using perf?
The log is from the default logger, piped through into a file. Timestamps come from within the BEAM app.
In parallel, somebody else is going to activate strace to see if we see a kernel issue, I’m just trying to see if this is a known problem when for example saying +S 12 and in reality you only have 8 cores truly available on some HW variations.
So far the strace has been inconclusive (but it’s the first I personally look at an strace, another, more experienced colleague also found no “smoking guns”).
In the end nothing much happens during the “hanging time” except these: pselect|futex|epoll_wait|timerfd_settime|sched_yield. My knowledge is limited, I assume it’s idling?
We found another symptom, though. It reports errors like these on console:
INFO: task 10_dirty_io_sch:59494 blocked for more than XXX seconds.
These shoot up to more than 700 seconds, and they are only present if the hanging problem is also present.
When experimenting, I also reduced the number of requested schedulers to the minimum amount of cores of all HW configurations (so from +S 12 to +S 8). It had no impact on the problem, though.
What’s more check io stats at the os level, the dirty io scheduler may be blocked, but why? What type of IO is this? With that amount of time, a reasonable assumption is file io.
With that amount of time, a reasonable assumption is file io.
A colleague looked at what we typically look at for any kind of Linux process or problem, I assume he found no smoking guns or we missed something.
The last log before the stall that we see is that we in fact finished I/O activities - we parsed a huge XML file, committed its data contents to term storage for easy access, and wrote some other XML files that are needed elsewhere. But after these activities no longer show in the strace, there seems to be a whole lot of nothing going on (unless, of course, I interpret it wrong).
I have wondered if it could be related to garbage collection since the problem shows after we have done something (involving files and I/O) and enter a phase of mere initialization of various gen_server and gen_statem processes.
Again, we might have missed something, absolutely possible, but I did a code review and all the things my supervisor chains starts between the two log lines where the problem occurs (if it occurs, it does not happen all the time, but it always occurs in the same part of bringup) are OTP processes that do minimal data structure initialization in their init/1 and defer anything else by sending themselves a trigger message…
I do believe gc is going take place on dirty cpu schedulers. There’s some tools you should try in addition to what was recommended above, which is a great place to look. I would look at msacc , instrument, and maybe your most valuable tool in a situation like this is going to be perf .
Perf is going to be particularly useful if this is related to an IO bottle neck between erts and the OS. This of course requires you’re using a version of Erlang/OTP with JIT support.
You’ve already broken out strace, so might as well go for perf
Do note, I’m not discounting the possibility of gc stalls on a processes per they the size of what you’re parsing coupled with the configuration and available resources of the underlying system may be an issue. Problems like this often tend to have multiple contributing factors, but you caught my eye with io bits, extremely slow read? Extremely slow write? Or both?
We’ve tried perf and found we had no JIT support, but the perf we generated looked innocuous.
The colleagues analyzing across all possible logs found hints that it might be a write to NFS that is blocking and are having a look at the possibility of the network driver causing issues.
I’m not sure if I overlooked that write in the strace or failed to see something blocking.
What I definitely failed to see was a copy file operation in a start_link which might explain why it is stalling. For now I have moved that copy to a separate process to unblock the system but it might only delay the onset of the problem when the file is needed by other parts of the system that read the same mount.