Multiprocessor overview

Our industry is in a constant search for better performance (MIPS/watt ratios). An increase in processing power can in a first instance be achieved by increasing the clock rate and make command execution more efficient in general. This is done by using more transistors on the chip to reduce the number of clock cycles required to execute a command and by increasing the on chip memory cache sizes to reduce the occasions the processor has to wait for data to be delivered from external and slow RAM. Both approaches are made possible by the ever shrinking size of the transistors on the chip. While previous generations of smartphone chips used 90 nanometer structures, current high end smartphones use 45 nanometor technology and the next step to 32 and 28 nanometer structures. When transistors get smaller, more can be fitted on the chip and power consumption at high clock rates is lowered. Another way of increasing processing power is to have several CPUs (SMP architecture) and have the operating system assign different tasks that want to be executed simultaneous to different processor cores

symmetric multiprocessing (SMP)

symmetric multiprocessing (SMP) is an architecture which involves a multicore computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance so that multiple CPUs are available to execute individual tasks simultaneously, and it differs from traditional asymmetrical processing in which a single CPU executes all tasks SMP systems require different programming methods to achieve maximum performance with proper OS support, SMP systems can distribute multiple tasks between processors to balance the workload efficiently.

Amdahl's law is used predominantly to calculate the maximum theoretical performance improvement when using multiple processors.

Amdahl's law

The maximum performance improvement of a system using N processors and a factor F that specifies the portion of the system that cannot be parallelized (the portion of the system that is sequential in nature)

Amdahl's law

The top line shows the number of processors. Ideally, this is what you'd like to see when you add additional processors to solve a problem. Unfortunately, because not all of the problem can be parallelized and there's overhead in managing the processors, the speedup is quite a bit less. At the bottom (purple line) is the case of a problem that is 90% sequential. In the best case for this graph, the brown line shows a problem that's 10% sequential and, therefore, 90% parallelizable. Even in this case, ten processors perform only slightly better than five

More performance while less power consumption

big.LITTLE architectures

In October 2011, ARM announced "big.LITTLE" design, big.LITTLE processing addresses how to create a System on Chip (SoC) that provides both high performance as well as extreme power efficiency to extend battery life. big.LITTLE connects the performance of the big symmetric cluster cores( ARM Cortex-A15 MPCore™) with the energy efficiency of the little symmetric cluster cores (Cortex-A7 processor), and enables the same application software to be seamlessly switched between them by ensuring cache coherency between big core cluster and little core cluster. By selecting the optimum processor for each task big.LITTLE can extend battery life by up to 70%.

The big core/little core concept has been kicked around in the research community for nearly a decade. One of the first papers “Single-ISA Heterogeneous Multi-Core Architectures” on the subject was written in 2003 by Rakesh Kumar

What the big/little model brings to the table is having both types of cores on the same die. And perhaps more importantly, unlike the CPU-GPU integration that AMD is doing with their Fusion chips and what NVIDIA is planning to do with their "Project Denver" platform, the big/little model consolidates on a ‘’’homogeneous instruction set’’’. homogeneous instruction set allows easier software development without complex toolchain with multiple compilers, runtimes, libraries, and debuggers. even Intel has concluded, big/little cores(Intel calls them asymmetric cores) seem to deliver the best results (Best of Both Latency and Throughput). Beyond mobile computing domain with the unveiling of the 64-bit ARM architecture last year, and with companies like HP delving into ARM-based gear for the datacenter, big/little implementations of ARM servers could appear as early as the middle of this decade.

There will be x86-based big/little server chips even sooner because Intel’s recent patent the company filed regarding mixing asymmetric x86 cores in a processor suggests the chipmaker has indeed given serious thought to big/little products The big/little approach won't be a panacea for energy-efficient computing, but it looks like one of the most promising approaches at least at the level of the CPU.

NVIDIA vs Qualcomm

NVIDIA’s Tegra 3, Kal-El processor, implements a novel new ‘’’Variable Symmetric Multiprocessing (vSMP) technology’’’. (also known as ‘’’4-PLUS-1’’’). When looking at Nvidia's latest Tegra design, it features 4 CPU cores so four tasks can be run in parallel, In addition, Nvidia features a 5th core that they call a "companion core" that takes over when only little processing power is needed, for example while the display is off and only low intensity background tasks have to be served. which is probably most of the time All five CPU cores are identical ARM Cortex A9 CPUs, and are individually enabled and disabled (via aggressive power gating) based on the work load. Tegra 3 vSMP monitoring is available at at http://www.youtube.com/watch?v=R1qKdBX4-jc&feature=player_embedded#!

NVIDIA published a set of patches that enable support for their Tegra 3 (Kal-El) processor Just as the Linux 3.2 kernel merge window opened. The patch series that enables the "Tegra 30" (also referred to as the T30) support for the Linux kernel and available at https://lkml.org/lkml/2011/10/25/305.

Qualcomm

Qualcomm’s latest Krait architecture in Snapdragon S4 - MSM8960 - to conserve power. Instead of requiring all cores to run at the same clock speed, each core can be run at a different speed depending on how much workload the operating system is requesting to the cores to be worked on. Each core in the aSMP system has a dedicated voltage and clock including the L2 cache. This enables each CPU core to run at the most effi cient power point or voltage and frequency depending on the type of workload being executed. According to Qualcomm, architecture results in a 25–40% power improvement over current synchronous SMP architectures. even more, each core that is not being used can be completely collapsed independently resulting in no power consumption in idle state.

Qualcomm MSM8960 was integrated since Kernel 3.0.1. benchmark result is available at http://www.anandtech.com/show/5559/qualcomm-snapdragon-s4-krait-performance-preview-msm8960-adreno-225-benchmarks/1

ARM's big.LITTLE architecture Qualcomm's Krait architecture Nvidia's Tegra3 architecture

Linaro approach for big.LITTLE

General the Linux scheduler expects all available CPUs to have the same performance characteristics but big.LITTLE case the cores aren't symmetric between clusters. To get the best user experience together with the lowest possible battery consumption Linux scheduler is needed to be improved to support big.LITTLE architecture since in general the Linux scheduler expects all available CPUs to have the same performance characteristics

ARM Switcher

ARM Ltd has produced a prototype software solution consisting of a small hypervisor using the virtualization extensions of the Cortex-A15 and Cortex-A7 to make both clusters appear to the underlying operating system as if there was only one Cortex-A15 cluster. Migrate the CPU states from one cluster to the other, and resume system execution on the other cluster without the underlying operating system being aware of the change It should work for any operating system targeting a Cortex-A15 without modifications to that operating system However, it needs to virtualize all the differences between the A15 and the A7. The hypervisor needs to trap every cache maintenance and translate into equivalent operation on the actual running core. Another disadvantage is overhead of saving and restoring the full CPU state because the hypervisor doesn’t know what is going on inside of Kernel.

Linaro in kernel switcher

Linaro put the switcher functionality directly in the kernel based on ARM Ltd BSD licensed switcher code to reduce. This solution needs much less support from the hypervisor code and improve switching performances by not having to trap any cache maintenance instructions, by limiting the CPU context transfer only to the minimum set of active registers, and by sharing the same address space with the kernel.

MP scheduler

Exposing all the cores to the kernel and scheduler makes the right decisions. That is MP scheduler need to change Linux scheduler code with high degree of collaboration with the upstream scheduler maintainers and a good amount of community efforts.

In Jan 2012, Linaro has kicked off a big.LITTLE initiative with two projects: big.LITTLE Switcher and big.LITTLE MP. big.LITTLE MP is to allow both Cortex-A15 (if required) and Cortex-A7 to be powered on and simultaneously executing code.

The good news, as described in earlier, is the most likely architectures to adopt the big/little paradigm over the next few years are x86 and ARM. This mean many players will want to change scheduler to support their big/little architecture to deliver the best user experience together with the lowest possible battery consumption and high performance.

Power consumption

Total chip power consumption is governed by two influences, leakage power and dynamic power. When processors are run at high clock speeds a low voltage is required as the power requirement increases linearly with frequency but in square with the voltage. Unfortunately, optimizing the chip for low voltage operation increases the leakage power, i.e. the power consumption when voltage is applied to a transistor which always requires power to keep it's state. It is this leakage power which becomes the dominant power consumption source when the CPU is idle. I.e, lower clock speeds CPU can get low power consumption advantage in idle state.

Scheduler for multiprocessing

An important goal of a scheduler is to allocate CPU time slices efficiently to execute multiple programs at the same time while providing a responsive user experience. The scheduler can also be faced with such conflicting goals as minimizing response times for critical real-time tasks while maximizing overall CPU utilization

history of Linux schedulers

  • The 1.2 Linux scheduler used a circular queue for runnable task management that operated with a round-robin scheduling policy
  • Linux version 2.2 introduced scheduling classes, permitting scheduling policies for real-time tasks, non-preemptible tasks, and non-real-time tasks. The 2.2 scheduler also included support for symmetric multiprocessing (SMP).
  • The 2.4 kernel included a relatively simple scheduler that operated in O(N) time (as it iterated over every task during a scheduling event). The 2.4 scheduler divided time into epochs, and within each epoch, every task was allowed to execute up to its time slice. If a task did not use all of its time slice, then half of the remaining time slice was added to the new time slice to allow it to execute longer in the next epoch. Although this approach was relatively simple, it was relatively inefficient, lacked scalability, and was weak for real-time systems. It also lacked features to exploit new hardware architectures such as multi-core processors
  • The early 2.6 scheduler, Ingo Molnar designed and introduced O(1) scheduler to solve many of the problems with the 2.4 scheduler—namely, the scheduler was not required to iterate the entire task list to identify the next task to schedule. The O(1) scheduler kept track of runnable tasks in a run queue (actually, two run queues for each priority level—one for active and one for expired tasks) to identify the task to execute next, the scheduler simply needed to dequeue the next task off the specific active per-priority run queue. Con Kolivas introduced Rotating Staircase Deadline Scheduler (RSDL). RSDL is based on the concept of "fairness." Processes are treated equally and are given the same time slices. The scheduler doesn't care or even try to guess if the process is CPU- or IO-bound
  • Later 2.6 scheduler, Ingo Molnar, the creator of the O(1) scheduler, then developed the CFS (Completely Fair Scheduler) taking the basic design element of fair scheduling from RSDL (created by Con Kolivas). And 2.6.23 kernel comes with a modular scheduler, Completely Fair Scheduler (CFS) and CFS group scheduling. Then scheduling algorithm enhancement in 2.6.24
  • Kolivas continues to update his Brain Fuck Scheduler (BFS) scheduler patch for every major Linux kernel release since Linux 2.6.31 in 2009. Kolivas’s patches are available from his Kolivas.org server. Kolivas is not motivated to push it upstream and that the Brain Fuck Scheduler is designed to improve the CPU scheduling experience for desktops and systems with few-cores-or-less. BFS is not written to scale to systems with many processing cores, thus not making it ideal for powerful servers and workstations, and not a one-all replacement for CFS. Latest BFS’s benchmark result is available.

  • Rakib Mullick announced the Barbershop Load Distribution (BLD) algorithm scheduler to the Linux kernel mailing list on February 12, 2012. it only tries to distribute the load properly by tracking lowest and highest loaded rq of the system. This technique never tries to balance the system load at idle context, which is done by the current scheduler. The motivation behind this technique is to distribute load properly amonst the CPUs in a way as if load balancer isn't needed and to make load distribution easier.

Before the 2.6 kernel, scheduler has critical limitation

  • Pre-2.6 scheduler being implemented using an algorithm with O(n) complexity so that the more tasks (n) are active, the longer it takes to schedule a task.
  • Pre-2.6 scheduler also used a single runqueue for all processors in a symmetric multiprocessing system (SMP)
  • Pre-2.6 scheduler also used a single runqueue lock
  • Preemption wasn't possible in the earlier scheduler

2.6 CFS scheduler was designed and implemented by Ingo Molnar by Ingo Molnar to achieve completely O(1) complexity for wakeup, context-switch, and timer interrupt overhead with following technologies

  • Each CPU has a runqueue made up of 140 priority lists that are serviced in FIFO order. The first 100 priority lists of the runqueue are reserved for real-time tasks, and the last 40 are used for user tasks And has a lock on each runqueue mechanism to allow all CPUs to schedule tasks without contention from other CPUs
  • Task queue per CPU, work can be balanced given the measured load of all CPUs in the system. Every 200 milliseconds, the scheduler perform load balancing to redistribute the task loading to maintain a balance across the processor complex.
  • Swapping between CPU's runqueue (active runqueue) and expired runqueue
  • To make simple and efficient job scheduling in prioirity a bitmap (find-first-bit-set instruction is used to find the highest priority bit set in one of five 32-bit words) is used to define when tasks are on a given priority list.

CFS related Kernel configuration information is also available at Tuning the Linux Kernel’s Completely Fair Scheduler

Ofcourse, the operating system by itself is not enough for full parallelism. Recall that the power of SMP lies in TLP (thread-level parallelism). Single monolithic (non-threaded) programs can't exploit SMP, but SMP can be exploited in programs that are composed of many threads that can be distributed across the cores

pthread - Portable Operating System Interface (POSIX) threads - are a great way to build threaded applications that are able to take advantage of SMP. POSIX threads provide the threading mechanism as well as shared memory. When a program is invoked that creates some number of threads, each thread is provided its own stack (local variables and state) but shares the data space of the parent.

POSIX provides the mutex function to create critical sections that enforce exclusive access to an object (a piece of memory) by a single thread

Linux Scheduler simulation

Developing schedulers that provide suitable behavior for various SMP can be difficult. Luckily, the User-space Linux Scheduler Simulator (LinSched) hosts your Linux scheduler in user space (for scheduler prototyping) while modeling arbitrary hardware targets to validate your scheduler across a spectrum of topologies. LinSched was originally developed at the University of North Carolina (with grants from IBM and Intel) but was recently revived by Google to validate scheduling policies

LinSched is a Linux scheduler simulator that resides in user space. It isolates the Linux scheduler subsystem and builds enough of the kernel environment around it that it can be executed within user space. It also builds a set of application programming interfaces (APIs) around it to provide the necessary stimuli to validate scheduling workloads and collect enough relevant data to understand its behavior.

As a user-space process, it is easy to execute the scheduler, simple to collect data about the scheduler, it's also easy to attach a debugger such as the GNU Debugger (GDB) to provide even greater insight into scheduler operation

LinSched uses the Linux scheduler subsystem within Linux for its simulation, it's much simpler to make changes, and then integrate them back into the kernel

LinSched architecture

LinSched provides a useful functionality permits the construction of dynamic scheduling scenarios that are also repeatable. LinSched hosts the Linux scheduler as an isolated subsystem in user space, for easier debugging of and experimentation with new and existing scheduling policies. This tool can be extremely useful in the early stages of scheduling policy development, as it allows for the rapid prototyping and evaluation of many different policies, from which the most promising ones can be selected for further detailed experimentation. Porting code back into Linux is made less difficult since LinSched shares most of its code with the Linux scheduler and related subsystems

Processor time received by the tasks in four different workloads, in LinSched and Linux

Processor affinity

In a multi-tasking operating system, the work to be performed is broken into discrete components, called tasks. These tasks are semi-independent portions of the total work, that may be scheduled to be executed when needed, then swapped out so that another task may run as appropriate. Processor Affinity is the term used for describing the rules for associating certain threads and certain processors

The work to be performed is broken into Processes and further into Threads. A process is a complete unit that uses a virtual memory space. The threads of a process share the same virtual address space, but run semi-independently and a thread is the unit that the operating system schedules to run from time to time Processor affinity takes advantage of the fact that some remnants of a process may remain in one processor's state (in particular, in its cache) from the last time the process ran. Scheduling it to run on the same processor the next time could result in the process's running more efficiently by reducing performance-degrading situations such as cache misses

  • SGI systems, dplace binds a process to a set of CPUs.
  • In Linux, taskset used to retrieve or set a processes’s CPU affinity. CPU affinity is a scheduler property that "bonds" a process to a given set of CPUs on the system.
  • In NetBSD, the psrset utility to set a thread's affinity to a certain CPU set. In FreeBSD, cpuset utility is used to create CPU sets and to assign processes to these sets. NetBSD 5.0, FreeBSD 7.2 and later versions can use pthread_setaffinity_np and pthread_getaffinity_np.
  • On Windows NT, thread and process CPU affinities can be set separately by using SetThreadAffinityMask and SetProcessAffinityMask API calls. CPUStres.exe in windows 2000 allows you to start one or more threads that will consume a CPU each

  • Mac OS X exposes an affinity API that provides hints to the kernel how to schedule threads according to affinity sets

Preferred Processor

In an SMP system, although each processor has access to all memory, each processor has its own cache copy of certain portions of memory (There are both Layer 1 and Layer 2 caches, typically). When the processor can retrieve code or data from that cache instead of going over the memory bus to slower DRAM based main memory, it will perform faster. The scheduler uses this Preferred Processor (also called Ideal Processor) method of affinity to increase the odds of cache hits to improve performance. The preferred processor of a thread is the processor it last ran in. So on a system that is not overly busy, threads tend to be scheduled into the same processor more often than random occurrence.

The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs and also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical for performance reasons.

Parallel computing is a form of computation in which many calculations are carried out simultaneously,[1]operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel").

There are many type of parallelism, such as Bit-level parallelism, Instruction-level parallelism, Data parallelism, Task parallelism and Hardware parallel. Multicore and Symmetric multiprocessing can be categorized to hardware parallel approach.

SMP Programming

Linux works on SMP (Symmetric Multi-Processors) machines. SMP support was introduced with kernel version 2.0 and theoretically allows up to 16 CPUs

Processes and kernel-threads are distributed among processors. User-space threads are not.

  • POSIX Threads
  • PVM / MPI Message Passing Libraries
  • fork() -- Multiple processes

fork() and PVM/MPI processes usually do not share memory, but either communicate by means of IPC or a messaging API, They are not very specific to SMP, since they are used just as much - or more - on uniprocessor computers

MITOpenCourseware offers a series of lectures on parallel programming concepts as well as a group project providing hands-on experience with parallel programming at http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-189-multicore-programming-primer-january-iap-2007/lecture-notes-and-video/

Reference

multiprocessor_overview (last modified 2012-04-23 04:09:09)