Performance tuning with Linux scheduling policies

Alistair Dobke

2022-02-27 12:20

Scheduling is NP-hard. Even if we could predict the future and see how long every process is going to run, it would still be a challenge to achieve optimal scheduling. In a real world system, we have nanoseconds to make scheduling decisions that effect thousands of threads handling wildly different tasks such as handling a button press on a menu to copy files to a flash drive.

What we really want is for our systems to have low latency for real time interactions while maintaining maximum throughput for CPU heavy tasks. Let's how well a Linux system handles a specific scheduling scenario and how the behavior can be tuned.

First construct a sample set of 3 programs to schedule on a 2 core system:

Worker A and B: Two tasks with real time latency requirements each pinned to a specific core. This could be a task running a UI or a control loop.
Worker C: One CPU heavy "background" process.

Ideally C is running whenever A or B isn't running on one of the cores. The timeline of system execution would look something like this:

I've set up these programs to run with a 100ms cycle time and capture their execution trace with FTrace. Plotting execution v.s. time we can quickly see how effectively we are using our CPUs. Each row represents the activity of a CPU. When the CPU row is a thick colored bar, a task is running on that core.

The above plot in in fact how A, B, and C run, when they are trivially simple:

while(true) {
   if (now() < end_of_cycle) {
     for(int x = 0; x < 4096; ++x) {} // Do some work!
   } else {
     sleep_until(next_cycle)
     ...
   }
}

This program isn't very useful..., what happens when we add some I/O and system calls in to the busy work loop?

/images/cpu_balancing_3normal_syscall.svg

This looks way messier! There are cases when A and B are interrupted, potentially resulting in a laggy interaction. Also, there are times when C isn't running but there is a free core.

Nice priorities with `SCHED_NORMAL`

Linux's nice priorities to give us and way to treat A and B with higher priority This ought to make it so our real time A and B tasks get to run when they need to.

Task	Runtime	% of ideal	Priority
A	4.78s	95.6%	`SCHED_NORMAL`, nice -20
B	4.64s	92.8%	`SCHED_NORMAL`, nice -20
C	8.60s	86.0%	`SCHED_NORMAL`, nice 0
Total	18.02s	90.1%

Not bad! If scheduling was perfect, we would see 5s of runtime from A and B and C would fill the gaps for 10s of runtime, we are not far off.

Although we didn't quantify latency, there are still clearly times when A and B get interrupted For true low latency performance, this isn't what we want. The scheduler doesn't know that A and B should be prioritized at all costs, only that we nice ly ask for A and B to have higher priority.

Linux provides another tool to tune scheduling behavior: scheduling policies. Each policy uses a different algorithm to decide how processes should run. Consider these two policies:

SCHED_OTHER / SCHED_NORMAL: This is the default scheduling policy. Tasks at this priority level are scheduled using the standard scheduler, typically the completely fair scheduler (CFS). Normal tasks can be given relative prioritization by modifying their nice value. Ultimately this scheduling policy uses a series of heuristics to balance execution time amongst threads, such as "when a task has been running for a long time, stop executing it and allow another thread to run".
SCHED_FIFO: This is a "real time" scheduling policies. Any task at this level has priority over any SCHED_NORMAL task regardless of nice priority. They additionally have their own priority level relative to other SCHED_FIFO tasks. When a SCHED_FIFO task wants to run, it preempts any task with lower priority while task with the same priority are run in first come first served order. Unlike tasks scheduled by the completely fair scheduler, a SCHED_FIFO task can run forever, until it is blocked, preempted, sleeps, etc.

Mixing `SCHED_FIFO` and `SCHED_NORMAL` tasks

Next we grant the SCHED_FIFO policy to A and B so that they are truly given a higher priority instead of just a nice request.

Task	Runtime	% of ideal	Priority
A	5.01s	100.2%	`SCHED_FIFO`, priority 10
B	4.98s	99.6%	`SCHED_FIFO`, priority 10
C	6.04s	60.4%	`SCHED_NORMAL`
Total	16.03s	80.2%

Excellent -- A and B are now running as desired (with some error likely due to not timing sleeps exactly correctly). However, C's performance really leaves a lot of be desired. There are clearly times when C could have run on CPU 0, but waits until the next free period on CPU 1.

This isn't necessarily bad behavior, in many cases it may be beneficial to keep a process running on the same core for the sake of keeping re-using cached data on that core. Maybe CPU 0 is faster than CPU 1, so we would rather try to wait for CPU 0 to be free rather than switch a process to CPU 1.

In this case, it seems like we could do a lot better. First let's understand what happens to a process, like C, when it is descheduled -- why couldn't it just migrate to the other CPU?

What happens when a task is descheduled?

We can use a FTrace function graph to examine the difference in behavior when a SCHED_NORMAL task is descheduled versus a SCHED_FIFO task. Looking at the functions called, a key function to consider is put_prev_task_<scheduler> which is called when a task is being context switched off of a CPU.

// SCHED_NORMAL switching
schedule() {
  pick_next_task_rt() {    // A next task different than our current task was picked to run next.
    put_prev_task_fair() { // so we must deschedule the current task
      put_prev_entity() {
        update_curr()
        check_cfs_rq_runtime();
        __enqueue_entity(); // Put it on a list of tasks to consider scheduling later, thats all!
      }
    }
  }
  finish_task_switch();
} // 1645 microseconds spent in the schedule function

CFS scheduler put_prev_task

Using the CFS scheduler for SCHED_NORMAL tasks doesn't do much when a process is descheduled. Instead the task is enqueued on to the CFS's tree of managed entities for later scheduling (for example, when a CPU becomes idle). Knowing this, it should not be a surprise to see that C does not migrate CPU cores right away when descheduled.

// SCHED_FIFO switching
schedule() {
  pick_next_task_rt() {  // A next task different than our current task was picked to run next.
    put_prev_task_rt() { // so we must deschedule the current task
      enqueue_pushable_task();
    }
    rt_queue_push_tasks() { // Here we keep a list of tasks that need to be rescheduled...
       queue_balance_callback(); // ... and queue up __balance_callback to run and do that work
    }
  }
  finish_task_switch();
  __balance_callback() {
    push_rt_tasks() {
      push_rt_task() {
        pick_next_pushable_task(); // For each task in the reschedule queue...
        find_lowest_rq() { // Find the best new CPU for it to run on, such as one with a low
          cpupri_find();   // priority task currently running
        }
        pick_next_pushable_task();
      }
    }
  }
} // 48014 microseconds spent in the schedule function

Real-Time scheduler put_prev_task

In the SCHED_FIFO case (managed by the "rt" scheduler), the descheduled task is added to the pushable task list (enqueue_pushable_task) and a balance callback is queued. Later, in the same call to schedule, balance_callback and push_rt_tasks are called giving the scheduler a chance to move the task to another CPU and run it right away. This is a lot more work than in the SCHED_NORMAL case.

In the traces I recorded, context switching out a SCHED_NORMAL task was an order of magnitude faster! In this way the SCHED_NORMAL policy gains global efficiency by sacrificing the latency of a process.

`SCHED_FIFO` tasks with different priorities

It is clear that if we want immediate action when a process is descheduled, we need the SCHED_FIFO policy. Let's try again with C assigned SCHED_FIFO with a lower priority than A and B.

Task	Runtime	% of ideal	Priority
A	4.98s	99.6%	`SCHED_FIFO`, priority 10
B	5.01s	100.2%	`SCHED_FIFO`, priority 10
C	9.89s	98.9%	`SCHED_FIFO`, priority 9
Total	19.88s	99.4%

Exactly as desired! If this system was fine tuned further, we would want to consider the overhead of constantly context switching and measure the wake up latency of our real time tasks, not just their total runtime. If what we are optimizing for is squeezing every second of CPU time out of our system, this configuration is much closer to ideal than our starting state.

A surprise when using real time Linux

When running real time applications, there is a good chance we will want to use real time Linux with PREEMPT_RT enabled. What happens when we use these policies an actual real time Linux system?

Task	Runtime	% of ideal	Priority
A	5.02s	100.4%	`SCHED_FIFO`, priority 10
B	5.02s	100.4%	`SCHED_FIFO`, priority 10
C	7.06s	70.6%	`SCHED_FIFO`, priority 9
Total	17.10s	85.5%

Surprisingly, the behavior got worse! For some reason C doesn't get pushed to the other CPU. Using the same FTrace function trace steps as before, the reason becomes clear: C was descheduled while migration of C was disabled (during a kernel call)!

When using PREEMPT_RT, kernel calls can be preempted. This is complicated feature to support. The risk of corrupting or deadlocking the entire kernel is massive increased. As a result, some preemptable locks in kernel calls also disable migration of tasks to ensure correctness is maintained (for example, during the manipulation of per-cpu data).

In other words, the real time kernel sacrifices the latency of a lower priority real time task for the benefit of absolutely minimizing the latency of a higher priority real time task.

While not immediately intuitive, this is documented behavior https://lwn.net/Articles/836503/

Summary

Out of the box, a Linux system uses the SCHED_NORMAL scheduling policy to achieve efficient and reasonable performance. We saw 90.1% utilization of our system when we scheduled 3 processes on 2 cpus. Because of how the CFS scheduler handles "descheduled" procesess, deferring their scheduling until later, SCHED_NORMAL simply isn't a viable option if low latency is the goal. Any system claiming to have minimized latency, while still using SCHED_NORMAL is likely to be leaving latency on the table.

Tuning the system further using SCHED_FIFO, we can squeeze most of that remaining 10% out of the system. However, while our real time tasks A and B got much better runtime as a result, the remaining task C had much worse runtime and latency. Ultimately, we had to schedule all tasks on the system with SCHED_FIFO to achieve good results (for this specific measure and use case). Unfortunately, this strategy doesn't also work for Linux running the PREEMPT_RT patch set due to CPU migration being disabled.

Is the "90%" out of the box performance good enough? For casual uses, that extra 10% is not likely to be worth it. Squeezing out that extra runtime can be dangerous: even in the small example here, it was easy to overtune our configuration to optimize set of processes and have much worse runtime for another. For those doing any sort of data crunching, robotics, or other low latency and high performance tasks, careful tuning is can be worth the effort.

Additional reading:

Nice priorities with SCHED_NORMAL

Mixing SCHED_FIFO and SCHED_NORMAL tasks