Under construction

Important notice: sched_mc power saving mode has been removed from scheduler since v3.5 in order to make some room for a new power aware scheduler


The sched_mc function adds a power saving awareness to the Linux scheduler which is tuned for performance by default. When sched_mc is enabled, the scheduler tries to gather the running processes in a minimal number of cpus and clusters. This choice of the location of a process is done thanks to the cpu topology function which describes the affinity between cpus (see ARM topology section). The level of power saving can be set with the sysfs entry /sys/devices/system/cpu/sched_mc_power_savings. There are 3 power saving levels:

  • 0 : No power saving load balance.
  • sched_mc_0.png

  • 1 : Fill one thread/core/package first for long running threads.
  • 2 : Also bias task wakeups to semi-idle cpu package for power savings.
  • sched_mc_2.png

An overview of the scheduler and the power saving policy is available in the sched_mc section

Terminology (FIXME)

  • Scheduler flags
    • SD_LOAD_BALANCE: Triggers load balancing between scheduling domains

    • SD_POWERSAVINGS_BALANCE: Triggers balancing to save power. By default this flag is set at the MC domain level.

    • SD_SHARE_PKG_RESOURCES: Used to flag CPUs that share resources such as caches. Set at the MC domain level.

ARM topology definition

MIPDR register provides the thread/core/package information for ARM topology description. This register can be accessed with the command MRC p15,0,<Rd>,c0,c0,5;.

According to Cortex-A9 technical reference manual (revision:r2p2), the MPIDR bits assignment is :




New multiprocessor format. It's always 1


Multiprocessing Extensions









MPIDR register values on various SoCs



ARM core


0x80000300 and 0x80000301

dual core Cortex-A9


0x80000900 and 0x80000901

dual core Cortex-A9


0x80000000 and 0x80000001

dual core Cortex-A9


0x80000000 and 0x80000001

dual core Cortex-A9

In addition, the ARM architecture reference manual describes bit 24 of MPIDR as the MT bit. This bit indicates a performance dependency at the lowest level which matches with the SMT feature of the scheduler.

Topology Configuration

We should map MIPDR info on kernel topology. The use of MRC instruction implies to run on each core in order to get topology information. smp_store_cpu_info can be used for this purpose because it's called on each core before sched_init_smp.

We have to save :

  • thread capabilities, id and sibling if any
  • core id and core map
  • package id

We could already detect and set both core and thread topology even if thread topology will not be used with sched_mc but only with sched_smt when some multi-threading capable processors will be available.

The proposal for the ARM cpu topology is


[30] is clear


[24] is set

[24] is clear


socket ID






core ID

socket ID





thread ID

core ID

thread ID unset

Once, we have the different ID, we can set thread and core maps that will be used to build the sched_domain.

The ARM cpu topology patch can be found here



The scheduler uses the sched_domain structure to describe the topology of the platform and to set the scheduling policy of each level. There are 3 interesting domains levels which could map to ARM MPcore system:

  • SMT domain: for multi-threading in a core
  • MC domain: for multi-core in a package
  • CPU domain: for multi-package

For performance reason, the sched_domain hierarchy is optimized to keep only relevant sched_domains levels of a given system description. Only one sched_domain level will be kept for Cortex-A9 MPcore as described below.

The scheduler uses kernel topology for building its sched_domain hierarchy. Some functions are defined by default when an architecture doesn't support the functionality of cpu topology. This default configuration maps all cores as independent cpus.

Without the cpu topology, the default topology of sched_domain for a dual/quad core Cortex-A9 is the following one:

With the cpu topology, the kernel becomes aware of the link between cores and the sched_domain becomes:

  • 1 MC level sched_domain with one sched_group for each core mc_domain.png

The main difference is the level of the sched_domain (MC vs CPU) which implies some differences in the configuration flags. The default MC level configuration sets the SD_SHARE_PKG_RESOURCES flag to mark the sharing of resources like the cache between groups. This flag is used when the scheduler wakes up one task and selects its run queue. It always tries to use the current cpu or the previously used one. If this cpu is not idle, the scheduler seeks an idle cpu which shares resources and uses it instead. Therefore, we should have a better spreading of tasks on cpus and an improvement of performance.

Load balancing

The load balancing is used to ensure that no cpu is overloaded while others are idle or that there is no obvious imbalance between running cpus. This check is effectively done between groups at each sched_domain level that has the SD_LOAD_BALANCE flag. There are 2 main methods for checking the load balance:

  • Monitoring: The periodic check of the load balancing across cpus is done on idle and running cpus.
  • Events. The load balance can be checked on a newly idle cpu (with SD_BALANCE_NEWIDLE flag) and during the wake up of task.

task wake up

The task wake up event uses a dedicated algorithm which is simpler and faster than the load_balance function. For a new task/program, the scheduler looks for a sched_group that is idler than the local one (a threshold of half the value above 100% of imbalance_pct is used to decide if a group is idler than the local. Some different cpu load estimations are also used). Once a group is selected, the idlest cpu in this group is chosen. For a Cortex-A9MP, we have one group for each cpu so that implies that we are looking for the "idlest" cpu but on a multi-package system, we might not select the idlest cpu if it's not member of the idlest group.

  • threshold_busiest.png

If the task is not a new one, the SD_WAKE_AFFINE flag ensures that the task will stay in the same sched_domain. The same threshold as above is used to choose a preferred cpu between the previously used and this one. An additional check is done to select an idle cpu whenever the preferred cpu is not idle but there is an idle cpu which shares resources with it (SD_SHARE_PKG_RESOURCES flag in the sched_domain).

  • threshold_affine.png


Other load balance checks are done with the load_balance function. For each sched_domain level with the SD_LOAD_BALANCE flag, this function looks for the busiest group from which the cpu could pull some tasks. There are several way to trigger a load balance. The main condition is described below and others ones will be described a bit later in the document.

First of all, the cpu on which the load_balance runs, must be the appropriate one : The 1st idle cpu of a group (default cpu is the 1st cpu of the group if there is no idle cpu) is the only one which is eligible for the periodic Idle and Not idle load balancing whereas the Newly idle load balancing is done on all cpu of a sched_group.

Then, the scheduler must find a busiest group from which it could pull some loads. By busiest group, it means the group with the highest average load per cpu, which is out of capacity (The capacity of a group reflects the number of tasks that can run simultaneously : group_power / SCHED_POWER_SCALE)

Finally, the cpu will not pull additional load if the average load of its local group is already above the sched_domain's average load. Otherwise, a load balance will be tried between the busiest group and the cpu if there is an obvious imbalance (more than imbalance_pct %) between these 2 groups.

Other load balancing triggers

In addition to the previous conditions, several other triggers can force or prevent a load balance.

  • TODO: Add other triggers

Power saving balancing

The system tries to do power saving load balance when no obvious imbalance between groups has been found during the check of the load balance.

In order to do some power saving load balance, the sched_domain must have at least 2 sched_groups and the SD_POWERSAVINGS_BALANCE flag. The default sched_domain's configuration sets the SD_POWERSAVINGS_BALANCE flag at the CPU level for a SMP system (without multi-threading) but cpus have only a MC sched_domain level on a cortex-A9 (as explained in the sched_domain section).

At this stage, we have the confirmation that sched_mc has been designed for doing power saving load balancing on multi-package systems and not on single-package. We now have 2 solutions for taking advantage of the level 2 of sched_mc on a cortex-A9 :

  • either enable the SD_POWERSAVINGS_BALANCE flag at MC level even if there is no multi-threading.
  • or modify the cpu topology and emulate a multi-package system instead of the real multi-core/single-package system. The scheduler lets a chance for architecture to modify its topology when the scheduling policy is changed so we can have a topology description which is performance oriented and another one which is power save oriented. The arch_update_cpu_topology function is called before building the sched_domain

Even if the 1st solution seems to be the simplest one, it's not the best choice because of the SD_SHARE_PKG_RESOURCES flag. This flag enables the scheduler to migrate a waking task to an idle cpu of the sched_domain which is the opposite of what we want to do to save power - trying to gather tasks on few cpus and migrate tasks only if cpus are out of capacity.

Once we have a sched_domain with the SD_POWERSAVINGS_BALANCE flag and at least 2 sched_groups, the load balancer is looking for a near idle group and a near full group in order to fill one with both loads. These near idle/full constraints (0 < running threads in the near idle/full group < group capacity) implies to have a group capacity higher than 1. For a quad cores, we can emulate a dual packages / dual cores system and creates 2 sched_groups with 2 cores at CPU level. This virtual topology can be used to keep threads in one virtual package and to delay the migration decision to the periodic load_balance step instead of doing it at wakeup. The screen-shots (kernelshark outputs) below show how cyclictest's threads are spread across cpus for a normal topology and for a virtual dual packages topology (cpu0 and cpu1 in package0 and cpu2 and cpu3 in the package1). We can see that we are using 3 cores with normal topology but only 2 cores with the virtual.

  • quad cores topology

    virtual dual packages topology



    5 threads



    10 threads



The virtual dual packages configuration not only enables the powersaving load balance but it also impacts the normal load balance and the above results are more linked to the behavior of the latter with this new topology. This topology has an interest only if you can powergate each core independently. For a dual cores, you have no other choice (with current implementation) than increase the cpu_power of a core in order to increase its capacity and to pull/keep tasks in one core. The scheduler periodically computes the cpu_power of each core and this computation takes into account hyper threading, the rt load and an architecture dependant feedback with arch_scale_freq_power function. The default value that is returned by arch_scale_freq_power is SCHED_POWER_SCALE and the related cpu capacity is 1.

Idle load balancing

Timer and irq migration

we need to check how and which pinned timers can impact the power saving efficiency of sched_mc scheduler


We have at least 3 different solutions for decreasing a core load:

  • sched_mc and sched_smt with load balancing
  • cgroup + cpuset
  • cpu hotplug

Each solution have different costs, time scale and responsiveness and we have to define the power saving level targeted by sched_mc on ARM platforms. This target could be different for an embedded device like a smartphone and a laptop or a server. Once defined, we have to check if the current policy matches the targeted power saving level.

The current sched_mc powersaving level are :

enum powersavings_balance_level {
        POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
        POWERSAVINGS_BALANCE_BASIC,     /* Fill one thread/core/package
                                         * first for long running threads
        POWERSAVINGS_BALANCE_WAKEUP,    /* Also bias task wakeups to semi-idle
                                         * cpu package for power savings

Performance and power saving tests

we must check both impact on performance regression and the power improvement of sched_mc on ARM platform

Performance tests

We will test each sched_mc mode and measure the performance decrease which should be at least minor with sched_mc_power_savings=0. kernbench and ebizzy are the preferred bench for testing performance but we might choose another one which might suits best our requirements.

Select a bench: kernbench/ebizzy/sysbench/Bbench... other ?

The performance tests list is available here

Power saving tests

We will test the power improvement of sched_mc on relevant use cases.

Select measurements : P and C states statistic, wake-up per seconds. The mains advantage of these statistics is that they can be easily measured and extracted in order to get some result which are only linked to ARM processor. On the other hand, these statistics might not always implies some wide system power saving improvements.

Define use cases for smartphone/laptop/server

The power saving tests list is available here

Linked presentation


WorkingGroups/PowerManagement/Archives/SchedMc (last modified 2013-08-21 10:56:54)