For Riak development, I’ve recently been looking at erl emulator flags to see what the impact of certain settings may have on performance.
The issue covering this work is here.
As part of this I’ve been looking at different scheduler settings to see if we could improve throughput and performance within our volume tests. What has been discovered is that there appears to be three different ways we can find significant performance and throughput benefits:
- by reducing the default count of schedulers
- by binding schedulers to CPUs
- by disabling “busy wait”
Details of the improvements can be seen in the github issue. The working hypotheses at present is all three changes are broadly having the same impact, in that with default scheduler settings “competition” between schedulers is leading to excessive context switches … as the default combination of standard and dirty-io schedulers leads to there being more schedulers than available CPU cores, especially if each of those schedulers is going to busy-wait to avoid context switching.
For the next release of Riak the current intention is to recommend disabling “busy wait”. As a change this not just improved performance and throughput, but also improved efficiency. Across the cluster running our primary volume test there was 27 less CPU cores being used when running the test with this change.
However, unlike for other changes, this flag comes with a specific warning that it may not be available in future releases: “This flag can be removed or changed at any time without prior notice.”
I was curious as to the reason why this warning exists. Are there some specific problems that have been seen with applications changing this setting? If there is a plan to remove the flag, is the plan to move it to enforce the current default?
Also a general query - have other applications, especially I/O intensive ones, found benefit in editing scheduler settings? Have there been side-effects of these changes?