Performance tuning with Linux scheduling policies
Scheduling is NP-hard. Even if we could predict the future and see how long every process is going to run, it would still be a challenge to achieve optimal scheduling. In a real world system, we have nanoseconds to make scheduling decisions that effect thousands of threads handling wildly different tasks such as handling a button press on a menu to copy files to a flash drive.
What we really want is for our systems to have low latency for real time interactions while maintaining maximum throughput for CPU heavy tasks. Let's how well a Linux system handles a specific scheduling scenario and how the behavior can be tuned.
First construct a sample set of 3 programs to schedule on a 2 core system:
Worker A and B: Two tasks with real time latency requirements each pinned to a specific core. This could be a task running a UI or a control loop.
Worker C: One CPU heavy "background" process.
Ideally C is running whenever A or B isn't running on one of the cores. The timeline of system execution would look something like this:
I've set up these programs to run with a 100ms cycle time and capture their execution trace with FTrace. Plotting execution v.s. time we can quickly see how effectively we are using our CPUs. Each row represents the activity of a CPU. When the CPU row is a thick colored bar, a task is running on that core.
The above plot in in fact how A, B, and C run, when they are trivially simple:
while(true) { if (now() < end_of_cycle) { for(int x = 0; x < 4096; ++x) {} // Do some work! } else { sleep_until(next_cycle) ... } }
This program isn't very useful..., what happens when we add some I/O and system calls in to the busy work loop?
This looks way messier! There are cases when A and B are interrupted, potentially resulting in a laggy interaction. Also, there are times when C isn't running but there is a free core.
Nice priorities with SCHED_NORMAL
Linux's nice priorities to give us and way to treat A and B with higher priority This ought to make it so our real time A and B tasks get to run when they need to.
Task |
Runtime |
% of ideal |
Priority |
---|---|---|---|
A |
4.78s |
95.6% |
|
B |
4.64s |
92.8% |
|
C |
8.60s |
86.0% |
|
Total |
18.02s |
90.1% |
Not bad! If scheduling was perfect, we would see 5s of runtime from A and B and C would fill the gaps for 10s of runtime, we are not far off.
Although we didn't quantify latency, there are still clearly times when A and B get
interrupted For true low latency performance, this isn't what we want. The scheduler doesn't know
that A and B should be prioritized at all costs, only that we nice
ly ask for A and
B to have higher priority.
Linux provides another tool to tune scheduling behavior: scheduling policies. Each policy uses a different algorithm to decide how processes should run. Consider these two policies:
-
SCHED_OTHER
/SCHED_NORMAL
-
This is the default scheduling policy. Tasks at this priority level are scheduled using the standard scheduler, typically the completely fair scheduler (CFS). Normal tasks can be given relative prioritization by modifying their
nice
value. Ultimately this scheduling policy uses a series of heuristics to balance execution time amongst threads, such as "when a task has been running for a long time, stop executing it and allow another thread to run". SCHED_FIFO
-
This is a "real time" scheduling policies. Any task at this level has priority over any
SCHED_NORMAL
task regardless ofnice
priority. They additionally have their own priority level relative to otherSCHED_FIFO
tasks. When aSCHED_FIFO
task wants to run, it preempts any task with lower priority while task with the same priority are run in first come first served order. Unlike tasks scheduled by the completely fair scheduler, aSCHED_FIFO
task can run forever, until it is blocked, preempted, sleeps, etc.
Mixing SCHED_FIFO
and SCHED_NORMAL
tasks
Next we grant the SCHED_FIFO
policy to A and B so that they are truly given a higher
priority instead of just a nice
request.
Task |
Runtime |
% of ideal |
Priority |
---|---|---|---|
A |
5.01s |
100.2% |
|
B |
4.98s |
99.6% |
|
C |
6.04s |
60.4% |
|
Total |
16.03s |
80.2% |
Excellent -- A and B are now running as desired (with some error likely due to not timing sleeps exactly correctly). However, C's performance really leaves a lot of be desired. There are clearly times when C could have run on CPU 0, but waits until the next free period on CPU 1.
This isn't necessarily bad behavior, in many cases it may be beneficial to keep a process running on the same core for the sake of keeping re-using cached data on that core. Maybe CPU 0 is faster than CPU 1, so we would rather try to wait for CPU 0 to be free rather than switch a process to CPU 1.
In this case, it seems like we could do a lot better. First let's understand what happens to a process, like C, when it is descheduled -- why couldn't it just migrate to the other CPU?
What happens when a task is descheduled?
We can use a FTrace function graph to examine the difference in behavior when a SCHED_NORMAL
task is descheduled versus a SCHED_FIFO
task. Looking at the functions called, a key function to
consider is put_prev_task_<scheduler>
which is called when a task is being context switched off
of a CPU.
// SCHED_NORMAL switching schedule() { pick_next_task_rt() { // A next task different than our current task was picked to run next. put_prev_task_fair() { // so we must deschedule the current task put_prev_entity() { update_curr() check_cfs_rq_runtime(); __enqueue_entity(); // Put it on a list of tasks to consider scheduling later, thats all! } } } finish_task_switch(); } // 1645 microseconds spent in the schedule function
Using the CFS scheduler for SCHED_NORMAL
tasks doesn't do much when a process is descheduled.
Instead the task is enqueued on to the CFS's tree of managed entities for later scheduling (for
example, when a CPU becomes idle). Knowing this, it should not be a surprise to see that C does
not migrate CPU cores right away when descheduled.
// SCHED_FIFO switching schedule() { pick_next_task_rt() { // A next task different than our current task was picked to run next. put_prev_task_rt() { // so we must deschedule the current task enqueue_pushable_task(); } rt_queue_push_tasks() { // Here we keep a list of tasks that need to be rescheduled... queue_balance_callback(); // ... and queue up __balance_callback to run and do that work } } finish_task_switch(); __balance_callback() { push_rt_tasks() { push_rt_task() { pick_next_pushable_task(); // For each task in the reschedule queue... find_lowest_rq() { // Find the best new CPU for it to run on, such as one with a low cpupri_find(); // priority task currently running } pick_next_pushable_task(); } } } } // 48014 microseconds spent in the schedule function
Real-Time scheduler put_prev_task
In the SCHED_FIFO
case (managed by the "rt" scheduler), the descheduled task is added to the
pushable task list (enqueue_pushable_task
) and a balance callback is queued. Later, in the same
call to schedule, balance_callback
and push_rt_tasks
are called giving the scheduler a
chance to move the task to another CPU and run it right away. This is a lot more work than in the
SCHED_NORMAL
case.
In the traces I recorded, context switching out a SCHED_NORMAL
task was an order of magnitude
faster! In this way the SCHED_NORMAL
policy gains global efficiency by sacrificing the latency
of a process.
SCHED_FIFO
tasks with different priorities
It is clear that if we want immediate action when a process is descheduled, we need the
SCHED_FIFO
policy. Let's try again with C assigned SCHED_FIFO
with a lower priority than
A and B.
Task |
Runtime |
% of ideal |
Priority |
---|---|---|---|
A |
4.98s |
99.6% |
|
B |
5.01s |
100.2% |
|
C |
9.89s |
98.9% |
|
Total |
19.88s |
99.4% |
Exactly as desired! If this system was fine tuned further, we would want to consider the overhead of constantly context switching and measure the wake up latency of our real time tasks, not just their total runtime. If what we are optimizing for is squeezing every second of CPU time out of our system, this configuration is much closer to ideal than our starting state.
A surprise when using real time Linux
When running real time applications, there is a good chance we will want to use real time Linux with
PREEMPT_RT
enabled. What happens when we use these policies an actual real time Linux system?
Task |
Runtime |
% of ideal |
Priority |
---|---|---|---|
A |
5.02s |
100.4% |
|
B |
5.02s |
100.4% |
|
C |
7.06s |
70.6% |
|
Total |
17.10s |
85.5% |
Surprisingly, the behavior got worse! For some reason C doesn't get pushed to the other CPU. Using the same FTrace function trace steps as before, the reason becomes clear: C was descheduled while migration of C was disabled (during a kernel call)!
When using PREEMPT_RT
, kernel calls can be preempted. This is complicated feature to support. The
risk of corrupting or deadlocking the entire kernel is massive increased. As a result, some
preemptable locks in kernel calls also disable migration of tasks to ensure correctness is maintained
(for example, during the manipulation of per-cpu data).
In other words, the real time kernel sacrifices the latency of a lower priority real time task for the benefit of absolutely minimizing the latency of a higher priority real time task.
While not immediately intuitive, this is documented behavior https://lwn.net/Articles/836503/
Summary
Out of the box, a Linux system uses the SCHED_NORMAL
scheduling policy to achieve efficient and
reasonable performance. We saw 90.1% utilization of our system when we scheduled 3 processes on 2
cpus. Because of how the CFS scheduler handles "descheduled" procesess, deferring their scheduling
until later, SCHED_NORMAL
simply isn't a viable option if low latency is the goal. Any system
claiming to have minimized latency, while still using SCHED_NORMAL
is likely to be leaving
latency on the table.
Tuning the system further using SCHED_FIFO
, we can squeeze most of that remaining 10% out of the
system. However, while our real time tasks A and B got much better runtime as a result, the
remaining task C had much worse runtime and latency. Ultimately, we had to schedule all tasks on
the system with SCHED_FIFO
to achieve good results (for this specific measure and use case).
Unfortunately, this strategy doesn't also work for Linux running the PREEMPT_RT patch set due to CPU
migration being disabled.
Is the "90%" out of the box performance good enough? For casual uses, that extra 10% is not likely to be worth it. Squeezing out that extra runtime can be dangerous: even in the small example here, it was easy to overtune our configuration to optimize set of processes and have much worse runtime for another. For those doing any sort of data crunching, robotics, or other low latency and high performance tasks, careful tuning is can be worth the effort.
Additional reading:
Comments