-03 writeup

draft r0

(We work on performance. Some of our improvements aren't turned on until -O3. -O3 can be a performance boost in itself. This writeup shows the state of building at -O3 and lets a product maker know the tradeoffs)

See:

PENDING Title

Introduction

This page looks into the higher optimsation levels available in GCC and how they affect performance, size, and correctness from a product point of view. This information can be used when making a product or distribution to understand the impact of higher optimisations and the potential gains from it.

Like all compilers, GCC takes source code, transforms it, optimises it, and emits the machine code that a processor can run. GCC 4.6 contains approximately PENDING passes, each one of which does different transformations or optimisations on the code. For ease of use these are grouped into optimsiation levels where each higher level turns on more passes or tweaks existing ones PENDING: confirm tweak.

The different levels can be thought of in terms of how debuggable the final code is and how incremental the optimisations are:

-O0 has no optimisations. The program is easy to debug but slow and big

-O1 targets optimisations that improve the speed and size without making the program harder to debug.

-O2 is the standard optimisation level. Speed and size is futher increased, but the program may be harder to debug through variables being optimised away or the program order being changed. Optimisations at -O2 should be a strict improvement over -O1, i.e. any program compiled at -O2 should run faster than at -O1.

-O3 is the highest optimisation level. This turns on the more aggressive and specialised optimisations. The main difference with -O2 is that the optimisation may not be a strict improvement, and in some cases will generate code that is slower than at -O2.

Two related areas are -Ofast and the vectoriser. -Ofast is the same as -O3 plus the 'fast math' option, which allows GCC to transform some floating point operations instead of strictly following the IEEE 754 standard. This allows expressions such as x / 7.0 to be transformed into the numerically slightly different but much faster x * (1 / 7.0).

The vectoriser is turned on by default at -O3. The vectoriser recognises data parallel code and converts it into the equivalent NEON form. Data parallel code is where the same operation is done to different values, such as doing an operation on every pixel in an image. This report includes results both with and without the vectoriser.

Take EEMBC CoreMark as an example. The table below shows the difference in speed and size on a typical Cortex-A9 compared to -O1:

size

speed

-O0

158%

32%

-O1

100%

100%

-O2

96%

114%

-O3

205%

127%

-Os

96%

90%

-O0 is a third of the speed and 58 % larger. -O2 is faster and approximately the same size. -O3 is faster again and, in this case, much larger.

PENDING:

  • What -O3 is
  • What it does
  • Why we care
  • Tradeoffs
  • Scope of the writeup: correctness, speed, size

Method

All tests below were performed using Linaro GCC 4.6-2012.01 on a typical dual core Cortex-A9. In the cross build cases, the pre-built binary toolchain was used. For benchmarking, the cbuild auto build system was used to compile and run natively.

Correctness

Compiler errors are divided into build time, such as a package failing to build from source due to triggering an internal compiler error, and run time, such as generating bad code that causes a crash or wrong result.

Linaro GCC was tested by cross-building OpenEmbedded Core. OpenEmbedded is a popular source based distribution that allows you to pick and customise the packages you want in your product and then cross build them from scratch for the target. The Core packages are a cut down, supported subset of the greater OpenEmbedded set and this core is shared with the Yocto project.

While OpenEmbedded Core is a subset, it still includes significant packages such as GTK, Qt, WebKit, and the Linux kernel.

The number of packages and lines of code for the large configurations are:

target # packages # lines of code
sato 261 18,366,587
qt 114

OpenEmbedded Core was built in all three configurations with no build time errors.

Run time faults are harder to measure as they are rare and significantly rarer than build time faults. The best approach to checking for run time faults is to run a large body of code, such as any test suites that come with the packages. The slow speed of native testing and difficulty of cross testing makes this difficult and was not done for this report.

A future project could cover cross testing more packages to gather evidence on run time errors.

Performance

Performance was mesaured using the industry-standard EEMBC and SPEC 2000 benchmarks. The Linaro toolchain working group target portable devices and as such use mobile and media related benchmarks.

Performance was measured by natively building and benchmarking these benchmarks. The graphs below show the improvement relative to the default -O2 optimisation level. -Os is included as a reference to show the impact that optimising for size has.

The first graphs show the overall improvement. The improvement was calculated by taking the geometric mean of all results at each level and dividing it against the overall reference. The geometric mean reduces the affect that outliers have on the results.

The later graphs show the improvement on each sub benchmark. This shows the balance of regressions against improvements and highlights the worst case regression and best case improvements.

EEMBC Overall

DENBench Overall

SPEC 2000 Overall

EEMBC by regression

<Particular regressions> <Identification>

DENBench by regression

SPEC 2000 by regression

The EEMBC vectoriser results are worth noting. The EEMBC benchmarks tend to be small, focused kernels that do one function. These kernels are very susceptible to small changes in the compiler but can also benefit greatly from auto vectorisation. Significant results include:

  • Foo by 400 %
  • Bar by 350 %

PENDING:

  • Base off next weeks release
  • Method (benchmark)
  • Profile (mobile) and benchmarks we try
  • List of benchmarks
  • PENDING: pick a few open ones that we can test. pybench? scimark? Good ones from Phoronix?
  • Results of -O2 vs -O3 vs -O3 -fno-tree-vectorize
  • Internal: Results of -O2 vs -O3 vs -O3 -fno-tree-vectorize with -mfpu=vfpv3-d16
  • Graphs:
    • Absolute performance?
    • Performance gain, sorted by name
    • Performance gain, sorted by gain

Size

The increase in size is less important than in the past due to how cheap Flash memory has become. The increase in size is interesting as every byte taken up by an application reduces the space for the end users files.

Size can be measured in different ways depending on the application:

  • text measures the bytes of executable code

  • On disk measures the size of an executable including the non-executable data
  • File system measues the size of all executables plus other data files such as media, configuration, and support
  • In memory measures the amount of RAM used in a running system

Each measurement mixes in other data which is not affected by the optimisation level, which reduces the percentage increase. This page uses the increase in text size as it is the most conservitative. The final relative increase in file system size will be less.

The increase.text size is the most conservatitive as it measures the number of bytes of executable code. The on-disk size

The minimal image holds enough to boot to a Busybox command prompt.

core-image-minimal file size .text .data .bss
-O0 5336716 4849760 66917 1905424 11.0%
-O1 4876548 4387713 66045 1905416 1.4%
-O2 4808500 4322441 66053 1905424 0.0%
-O3 -fno-tree-vectorize 4856108 4367396 66049 1905424 1.0%
-O3 4860448 4370586 66049 1905424 1.1%

These results are dominated by GLIBC which, due to using a pre-built toolchain, was the same in all builds.

Sato is an example mobile user interface that is included with OpenEmbedded. It includes applications such as a launcher, calendar, and contacts which is built on top of a GNOME Mobile stack.

This represents a typical GTK+ based application and includes the X Window System, Matchbox window manager, GTK+, D-Bus, and GStreamer.

core-image-sato file size .text .data .bss
-O0 65406023 56627187
-O1 52029807 42941832
-O2 51070083 41964949
-O3 -fno-tree-vectorize 54325435 44985158 2109110 6529580
-O3 54506651 45156576

Qt is a application and UI framework popular in embedded and desktop applications. To compare the impact on a library versus a distribution, Qt 4.8 was built from source and compared. The final install includes a Webkit browser, JavaScript engine, SQL support, and UI library.

qt4e-demo-image file size .text .data .bss
-O0 0 0 0
-O1 104594041 98679519
-O2 103458087 97447245
-O3 -fno-tree-vectorize 108206431 102170749 2250437 2432904 4.6%
-O3 108419995 102386907 2250437

The in memory footprint was measured on a just-booted but otherwise idle system. This is difficult to measure as the the memory use is time dependant.

-O0 -O1 -O2 -O3 -fno-tree-vectorize -O3 Increase at -O3
core-image-minimal 13632.6 13275.8 13215.4 13300.2 13275.2 0.5%
core-image-sato 55459 49497.6 49127.6 50744.8 50863.8 3.5%
qt4e-demo-image #DIV/0! 64085.4 61901.6 62950.6 63047.8 1.9%

PENDING:

  • Graphs of

Summary

Change in size:

  • 1.1 % for a minimal distribution
  • 6.7 % for A GNOME Mobile based application
  • 5.1 % for a Qt based application

Change in speed:

  • PENDING

No correctness problems were found while researching this document. Linaro are focused on performance and will investigate and fix any optimisation problems found.

MichaelHope/Sandbox/O3Writeup (last modified 2012-01-30 22:36:55)