Preface

This article explains the meaning of the Linux's sysctl parameters about the process scheduler and some background knowledge needed to understand it. Here I don't tend to explain all parameters, but just cover essential ones.

The description in this article doesn't consider the following things about process scheduling for simplicity.

nice value
real-time priority

This article is based on Linux kernel v5.0.

Scheduling Classes

There is a concept called scheduling classes in the Linux kernel. All processes running on Linux belong to one of the scheduling classes. Each scheduling class defines how the processes belonging to it are scheduled.

Processes belong to fair scheduling class by default. In this article, I call these processes normal processes. On the other hand, processes called real-time processes (see later) belong to realtime scheduling class.

I'll describe the meaning of the sysctl parameters about the above-mentioned two scheduling classes in the following sections. In addition, I'll also describe a brief explanation about each scheduling class.

The sysctl parameters about `fair` scheduling class

The normal processes belongs to fair scheduling class are scheduled with Completely Fair Scheduler (CFS). The meaning of the CFS will be explained in the next section.

`kernel.sched_latency_ns` parameter

If there are two or more runnable processes, CFS divide CPU time to each process as fair as possible. In this case, fair means giving fair share of CPU time to each process.

CFS has a concept called latency target. CFS tries to give timeslice to all runnable processes once per the latency target. Here the timeslice of each process is (latency target)/<the number of runnable processes>. For example, if the latency target is 10ms and there are two runnable processes, these can get 5ms per 10ms. If there are four, these can get 2.5ms per 10ms.

Here kernel.sched_latency_ns defines the latency target of CFS in nanoseconds. If there are multiple CPUs in the system, the latency target becomes kernel.sched_latency_ns * (1+log2(the number of CPUs)).

`kernel.sched_min_granularity_ns` parameter

How about the case that there are so many runnable processes? For example, if the latency target is 10ms and there are 100 runnable processes, does each process's timeslice get just 100us? It seems to be too short since the context switch cost becomes too high in this case.

To prevent this problem, timeslice is guaranteed to become equal or longer than the value of kernel.sched_min_granularity_ns parameter. The unit of this parameter is nanoseconds. Please note that the latency target becomes kernel.sched_min_granularity_ns * (the number of runnable processes).

Similar to the latency target, if there are multiple CPUs in the system, the guaranteed timeslice becomes kernel.sched_min_granularity_ns * (1+log2(the number of CPUs)).

`kernel.wakeup_granularity_ns` parameter

The processes, which are woken up from a sleep state, tend to sleep again in a short period. So, in many cases, it's efficient to give CPU time to the woken up process as soon as possible.

The typical example is terminal emulators that directly interact with users through the input from keyboard. When a user types something, a terminal emulator
is woken up and echo back his input. If the echo back takes too long, the user experience becomes bad.

CFS has a special logic to shorten the latency of such interactive processes. However, to explain the detail of this logic is a bit difficult. So I only say that if you decrease kernel.wakeup_granularity_ns parameter, the probability of the preemption by the woken up process gets high. Then the system's interactivity would get better.

However, please note that there is a tradeoff between interactivity and throughput. If you set the value that is shorter than the default value, the number of context switches would get large and the throughput would get worse.

The sysctl parameters about the `realtime` scheduling class

realtime scheduling class is for the processes that must run prior to any normal processes, in other words, the processes belonging to fair scheduling class.

As I already described, the processes belong to realtime scheduling class are called real-time processes. The definition of the real-time processes is the processes having SCHED_FIFO scheduling policy or SCHED_RR scheduling policy. We can set the scheduling policy of processes with sched_setscheduler() system call.

Let's assume that a real-time process A becomes runnable in a CPU, in which process B, that belongs to fair scheduling class, is running on this CPU. Here B can preempt A at any time by definition. So, how about the case that the B is also real-time processes? It depends on the scheduling policy of B.

If B's scheduling policy is SCHED_FIFO, A can't preempt B and can run on this CPU only when B exits or becomes sleeping state. However, if its scheduling policy is SCHED_RR, B has its predefined timeslice and B can preempt A after A exhausts its timeslice. If A also belongs to SCHED_RR, both A and B got CPU time in a round-robin manner after that.

`kernel.sched_rr_timeslice_ms` parameter

This parameter means the timeslice of real-time processes that belong to SCHED_RR scheduling policy. Its unit is millisecond.

`kernel.sched_rt_period_us` parameter and `kernel.sched_rt_runtime_us` parameter

These parameters are to prevent CPU occupation by the out-of-control real-time processes.

If the real-time process continues to run for a long time without getting sleep, any normal processes can't get CPU time at all during this period. It would incur serious problems like hanging up the whole system. For example, let's assume a system that has only one CPU and the a real-time process A is running on the CPU. If A hangs up, the system also hangs up. In addition, we can't kill this problematic real-time process because launching bash is also prevented by this process.

To prevent this kind of problem, the process scheduler has a logic to limit the running time of real-time processes. In short, the total CPU time consumed by real-time processes can't exceed kernel.sched_rt_runtime_us per kernel.sched_rt_period_us. Both units are microseconds.

Conclusion

This article describes some of Linux's scheduler and the basic knowledge which is necessary to understand this explanation. If you're interested in this topic, please modify these parameters and run your workload to verify whether the description of this article is correct or not. For example, the following article would help you.

Visualize the Linux kernel's behavior: process scheduler

Blog

The Linux's sysctl parameters about process scheduler

Satoru Takeuchi

Preface

Scheduling Classes

The sysctl parameters about `fair` scheduling class

`kernel.sched_latency_ns` parameter

`kernel.sched_min_granularity_ns` parameter

`kernel.wakeup_granularity_ns` parameter

The sysctl parameters about the `realtime` scheduling class

`kernel.sched_rr_timeslice_ms` parameter

`kernel.sched_rt_period_us` parameter and `kernel.sched_rt_runtime_us` parameter

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related

The Linux's sysctl parameters about process scheduler

Satoru Takeuchi

Preface

Scheduling Classes

The sysctl parameters about fair scheduling class

kernel.sched_latency_ns parameter

kernel.sched_min_granularity_ns parameter

kernel.wakeup_granularity_ns parameter

The sysctl parameters about the realtime scheduling class

kernel.sched_rr_timeslice_ms parameter

kernel.sched_rt_period_us parameter and kernel.sched_rt_runtime_us parameter

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related

The sysctl parameters about `fair` scheduling class

`kernel.sched_latency_ns` parameter

`kernel.sched_min_granularity_ns` parameter

`kernel.wakeup_granularity_ns` parameter

The sysctl parameters about the `realtime` scheduling class

`kernel.sched_rr_timeslice_ms` parameter

`kernel.sched_rt_period_us` parameter and `kernel.sched_rt_runtime_us` parameter