Under construction

System setup

kernel config


You must ensure that the ftrace has been enable on your platform:

Kernel hacking  --->
    [*] Tracers  --->
        --- Tracers
        [*]   Kernel Function Tracer 
        [*]   Interrupts-off Latency Tracer 
        [*]   Preemption-off Latency Tracer 
        [*]   Scheduling Latency Tracer 

System time precision


The time precision of your system will have a main impact in the reliability of the traces and the associated results (and on the behavior of your system too...).

The 2 examples below show the trace of the same workload on the same platform :

cyclictest -t 4 -L c0p15 -i 2000 -q -D 5

that creates 4 threads with a wake up period of 2ms, 2.5ms, 3ms and 3.5ms respectively. The run time of each thread is 15% of its period but with 2 different precision:

  • The 1st example uses a fine grained sched_clock that is based of the clocksource (41ns)

fine grained

  • The 2nd example uses a coarse grained sched_clock that is based of the tick (5ms)

coarse grained

The difference is quite obvious ...

So, start to check the sched_clock precision, which is the one that is used by trace and scheduler. The default sched_clock is the tick but most of systems register their clocksource as a sched_clock

You can look for such log on the console

sched_clock: 64 bits at 24MHz, resolution 41ns, wraps every 2863311519744ns

NB:The modified version of cyclictest that has been used for this test, is available here

Using ftrace

You can find a complete description of ftrace here with all the details of the large capabilities of ftrace. Nevertheless, most of the time, a "simple" setup will be enough to spy your use case and some tools are available to make your life easier :-)

trace-cmd


trace-cmd is a simple command line tools that configures ftrace and extracts data that can be then seen on your host with a nice GUI viewer named kernelshark. The source code can be found here: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

The easiest way is to compile the command line tool directly on your board so you will be ensured to be aligned with the right header files of the embedded libraries.

If you want to compile trace-cmd for android, you need to statically link the library : As an example for ARM:

make LDFLAGS=-static CC=arm-none-linux-gnueabi-gcc trace-cmd

The presentation gives a complete description of the current and upcoming features of trace-cmd and ftrace

Kernelshark


More details about kernelshark gui can be found here

And also here is the official page

The 2 snapshots above have been extracted from kernelshark

Record a trace


There are 2 main different ways to record a trace with trace-cmd and you will have to choose one according to your constraints:

  • The 1st method is probably the simplest one and will enable you to record a lot of events during a long time without loosing activities:

trace-cmd record [-v][-e event [-f filter]][-p plugin][-F][-d][-D][-o file]
    [-s usecs][-O option ][-l func][-g func][-n func]
    [-P pid][-N host:port][-t][-r prio][-b size][-B buf][command ...]
    [-m max]

This method creates per-cpu RT threads that will periodically wake up and save the traces in a file or send them on a port. But these extra threads can disturb the activity and the issue that you are trying to record.

  • If you want to minimize the impact of trace-cmd on your system, you should use start and stop command and let the events in the circular buffers of ftrace. Your use case will not be disturb by trace-cmd's threads like above but the depth of your trace is limited to the size of ftrace's buffers. The oldest events are overwritten by the newest when the buffers become full.

trace-cmd start [-e event][-p plugin][-d][-O option ][-P pid]
     Uses same options as record, but does not run a command.
     It only enables the tracing and exits

run the use case that you want to trace

trace-cmd stop

trace-cmd extract [-p plugin][-O option][-o file][-s]
     Uses same options as record, but only reads an existing trace.
     -s : extract the snapshot instead of the main buffer

In addition, this slideset gives additional information, advantages and drawbacks about how to use trace-cmd and ftrace

Useful events


There is no exhaustive list of useful events because it will clearly depends of the Use Case that you want to spy. Nevertheless, you can use the following list when you start your investigation or more generally speaking in order to have a good starting view of the system behavior:

  • sched events will show you task scheduling and latency
  • irq events will show you irq and softirq activity
  • timer events will show you timer and hrtimer activity
  • workqueue events will who you which work are queued

With this bunch of events, you will have a good view of your use case and you will then be able to tune your list of events according to which activity you want to investigate more deeply.

The attached script is a simple example of how to use trace-cmd in order to get main events. It creates one raw data file that can be used with kernelshark and a text file so you can parse it on the board with command line and quickly check that the trace has caught the sequence that you want to spy.

trace-activity.sh

Just launching the script will start a sequence for 5sec and save the data in dft and dft.txt files

./trace-activity.sh

You can set the duration like below

./trace-activity.sh 25

And you can also specify, which command, you want to spy. In this case, you can replace the duration arg by a "-"

./trace-activity.sh - testfile "cyclictest -t -q -D 10"

Analyse a trace


Once a trace has been recorded, the 1st step is to display the trace with kernelshark.

To Be Filled

Extract CPU load


There are several ways (listed below) to get the system load from a trace.

Using Idlestat tools

Not usable with trace-cmd output for the moment. Idlestat uses irq and cpu_idle events in order to compute the idle statistics of your system. The tool will take into account the topology of your system to compute additional information like the idle statistic of a cluster or the efficiency of the governor policy.

Idlestat wiki page

Using trace-cmd output file

You can parse sched_switch events, wrapper/idle task run time and irq events in order to estimate the load of your system. In old kernel, you also have the sched-switch tracer which is redundant with sched_switch and sched_wakeup events. Trace-activity.sh script enables these events so you can use the output data for evaluating the load of your use case.

This script below parses various type of trace file and extracts the time spent in idle.

cpuload.pl

If you want to parse a raw data file that has been generated by trace-cmd, you just need to give file path and ensure that trace-cmd is available.

./cpuload.pl <file path>

As an example:

./cpuload.pl dft
CPU0 : idle 1807234 us / total 1991221 us
CPU0 : load 9.21%
CPU1 : idle 1962905 us / total 1966802 us
CPU1 : load 0.21%
CPU2 : idle 1729518 us / total 1961565 us
CPU2 : load 11.81%
CPU3 : idle 1952525 us / total 1999297 us
CPU3 : load 2.31%
CPU4 : idle 1809197 us / total 1975311 us
CPU4 : load 8.41%
CPU5 : idle 1978687 us / total 1997261 us
CPU5 : load 0.91%
CPU6 : idle 1792770 us / total 1999670 us
CPU6 : load 10.31%
CPU7 : idle 1911737 us / total 1937235 us
CPU7 : load 1.31%

You can also parse the output of ftrace with nop or sched_switch tracer but you will have to set the type of file

./cpuload.pl <file.txt> nop

or

./cpuload.pl <file.txt> sched

Using perf

Perf uses ftrace and the PMU (Performance Monitoring Unit) to provide a lot of useful features for analyzing a use case. Some of these features are described below.

Compiling perf


You must ensure that the perf events have been enable

General setup  --->
    Kernel Performance Events And Counters  --->
        [*] Kernel performance events and counters 

Make sure to compile perf with the rootfs of your platform by adding --sysroot=<path to platform rootfs>

Perf can take advantage of PMU which is referred as "hardware event" in events list ( perf list ) but you must enable it and make sure that the device has been declared in the dtb of your platform.

Kernel Features  ---> 
    [*] Enable hardware performance counter support for perf events

        pmu {
                compatible = "arm,cortex-a9-pmu";
                interrupts = <0 7 IRQ_TYPE_LEVEL_HIGH>;
        };

Check latency


Perf can extract the scheduling latency statistics of threads.

Record the use case:

./perf sched  record  <your command>

We can use cyclictest as an example

#./perf sched  record  cyclictest -t 8 -n -q -D 10
# /dev/cpu_dma_latency set to 0us

T: 0 ( 1413) P: 0 I:1000 C:  10000 Min:     73 Act:   93 Avg:   93 Max:    2113
T: 1 ( 1414) P: 0 I:1500 C:   6667 Min:     78 Act:   94 Avg:   94 Max:     607
T: 2 ( 1415) P: 0 I:2000 C:   5000 Min:     89 Act:   96 Avg:   94 Max:     419
T: 3 ( 1416) P: 0 I:2500 C:   4000 Min:     87 Act:   95 Avg:   93 Max:    1081
T: 4 ( 1417) P: 0 I:3000 C:   3334 Min:     73 Act:   93 Avg:   94 Max:     243
T: 5 ( 1418) P: 0 I:3500 C:   2857 Min:     53 Act:   99 Avg:   93 Max:     173
T: 6 ( 1419) P: 0 I:4000 C:   2500 Min:     89 Act:   94 Avg:   94 Max:     390
T: 7 ( 1420) P: 0 I:4500 C:   2223 Min:     86 Act:   94 Avg:   94 Max:     316
[ perf record: Woken up 11 times to write data ]
[ perf record: Captured and wrote 24.002 MB perf.data (~1048659 samples) ]

Then, the scheduling latency figures are available with

./perf sched  latency

 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  ...
  cyclictest:1412       |     33.006 ms |      995 | avg:    0.024 ms | max:    0.251 ms | max at:     44.310525 s
  ...
  cyclictest:1414       |    157.335 ms |     6668 | avg:    0.021 ms | max:    0.380 ms | max at:     41.195775 s
  cyclictest:1415       |    119.684 ms |     5001 | avg:    0.021 ms | max:    0.084 ms | max at:     48.210471 s
  cyclictest:1417       |     79.254 ms |     3335 | avg:    0.021 ms | max:    0.086 ms | max at:     47.015842 s
  cyclictest:1420       |     52.825 ms |     2224 | avg:    0.021 ms | max:    0.142 ms | max at:     47.014827 s
  cyclictest:1413       |    234.334 ms |    10001 | avg:    0.021 ms | max:    1.860 ms | max at:     41.211163 s
  cyclictest:1419       |     59.219 ms |     2501 | avg:    0.021 ms | max:    0.062 ms | max at:     39.397027 s
  cyclictest:1418       |     66.947 ms |     2858 | avg:    0.021 ms | max:    0.089 ms | max at:     48.040365 s
  cyclictest:1416       |     92.605 ms |     4001 | avg:    0.021 ms | max:    0.619 ms | max at:     41.209499 s
  ...
 -----------------------------------------------------------------------------------------------------------------
  TOTAL:                |   1167.720 ms |    37805 |
 ---------------------------------------------------
  INFO: 0.005% state machine bugs (2 out of 37794)

There are differences between cyclictest's figures and perf's ones because there are not measuring the exact same things; Perf measures the time between the wake up and the switch on a CPU of the threads whereas cyclistest measures the time between the wake up time that has been set and the time at which we have entered the test function of cyclictest. This time includes the scheduling latency (measured by perf) but also the wake up latency from the idle state (C-state's exit latency), the precision of the clockevent and clocksource and the time to execute the main loop of cyclictest which can be preempted by other things like IRQ of SoftIRQ

Check performance


To Be Filled

WorkingGroups/PowerManagement/Resources/Tools/load-perf-analyses (last modified 2015-10-29 08:59:27)