(We work on performance. Some of our improvements aren't turned on until -O3. -O3 can be a performance boost in itself. This writeup shows the state of building at -O3 and lets a product maker know the tradeoffs)
This page looks into the higher optimsation levels available in GCC and how they affect performance, size, and correctness from a product point of view. This information can be used when making a product or distribution to understand the impact of higher optimisations and the potential gains from it.
Like all compilers, GCC takes source code, transforms it, optimises it, and emits the machine code that a processor can run. GCC 4.6 contains approximately PENDING passes, each one of which does different transformations or optimisations on the code. For ease of use these are grouped into optimsiation levels where each higher level turns on more passes or tweaks existing ones PENDING: confirm tweak.
The different levels can be thought of in terms of how debuggable the final code is and how incremental the optimisations are:
-O0 has no optimisations. The program is easy to debug but slow and big
-O1 targets optimisations that improve the speed and size without making the program harder to debug.
-O2 is the standard optimisation level. Speed and size is futher increased, but the program may be harder to debug through variables being optimised away or the program order being changed. Optimisations at -O2 should be a strict improvement over -O1, i.e. any program compiled at -O2 should run faster than at -O1.
-O3 is the highest optimisation level. This turns on the more aggressive and specialised optimisations. The main difference with -O2 is that the optimisation may not be a strict improvement, and in some cases will generate code that is slower than at -O2.
Two related areas are -Ofast and the vectoriser. -Ofast is the same as -O3 plus the 'fast math' option, which allows GCC to transform some floating point operations instead of strictly following the IEEE 754 standard. This allows expressions such as x / 7.0 to be transformed into the numerically slightly different but much faster x * (1 / 7.0).
The vectoriser is turned on by default at -O3. The vectoriser recognises data parallel code and converts it into the equivalent NEON form. Data parallel code is where the same operation is done to different values, such as doing an operation on every pixel in an image. This report includes results both with and without the vectoriser.
Take EEMBC CoreMark as an example. The table below shows the difference in speed and size on a typical Cortex-A9 compared to -O1:
-O0 is a third of the speed and 58 % larger. -O2 is faster and approximately the same size. -O3 is faster again and, in this case, much larger.
- What -O3 is
- What it does
- Why we care
- Scope of the writeup: correctness, speed, size
All tests below were performed using Linaro GCC 4.6-2012.01 on a typical dual core Cortex-A9. In the cross build cases, the pre-built binary toolchain was used. For benchmarking, the cbuild auto build system was used to compile and run natively.
Compiler errors are divided into build time, such as a package failing to build from source due to triggering an internal compiler error, and run time, such as generating bad code that causes a crash or wrong result.
Linaro GCC was tested by cross-building OpenEmbedded Core. OpenEmbedded is a popular source based distribution that allows you to pick and customise the packages you want in your product and then cross build them from scratch for the target. The Core packages are a cut down, supported subset of the greater OpenEmbedded set and this core is shared with the Yocto project.
The number of packages and lines of code for the large configurations are:
OpenEmbedded Core was built in all three configurations with no build time errors.
Run time faults are harder to measure as they are rare and significantly rarer than build time faults. The best approach to checking for run time faults is to run a large body of code, such as any test suites that come with the packages. The slow speed of native testing and difficulty of cross testing makes this difficult and was not done for this report.
A future project could cover cross testing more packages to gather evidence on run time errors.
Performance was mesaured using the industry-standard EEMBC and SPEC 2000 benchmarks. The Linaro toolchain working group target portable devices and as such use mobile and media related benchmarks.
Performance was measured by natively building and benchmarking these benchmarks. The graphs below show the improvement relative to the default -O2 optimisation level. -Os is included as a reference to show the impact that optimising for size has.
The first graphs show the overall improvement. The improvement was calculated by taking the geometric mean of all results at each level and dividing it against the overall reference. The geometric mean reduces the affect that outliers have on the results.
The later graphs show the improvement on each sub benchmark. This shows the balance of regressions against improvements and highlights the worst case regression and best case improvements.
SPEC 2000 Overall
EEMBC by regression
<Particular regressions> <Identification>
DENBench by regression
SPEC 2000 by regression
The EEMBC vectoriser results are worth noting. The EEMBC benchmarks tend to be small, focused kernels that do one function. These kernels are very susceptible to small changes in the compiler but can also benefit greatly from auto vectorisation. Significant results include:
- Foo by 400 %
- Bar by 350 %
- Base off next weeks release
- Method (benchmark)
- Profile (mobile) and benchmarks we try
- List of benchmarks
- PENDING: pick a few open ones that we can test. pybench? scimark? Good ones from Phoronix?
- Results of -O2 vs -O3 vs -O3 -fno-tree-vectorize
- Internal: Results of -O2 vs -O3 vs -O3 -fno-tree-vectorize with -mfpu=vfpv3-d16
- Absolute performance?
- Performance gain, sorted by name
- Performance gain, sorted by gain
The increase in size is less important than in the past due to how cheap Flash memory has become. The increase in size is interesting as every byte taken up by an application reduces the space for the end users files.
Size can be measured in different ways depending on the application:
text measures the bytes of executable code
- On disk measures the size of an executable including the non-executable data
- File system measues the size of all executables plus other data files such as media, configuration, and support
- In memory measures the amount of RAM used in a running system
Each measurement mixes in other data which is not affected by the optimisation level, which reduces the percentage increase. This page uses the increase in text size as it is the most conservitative. The final relative increase in file system size will be less.
The increase.text size is the most conservatitive as it measures the number of bytes of executable code. The on-disk size
The minimal image holds enough to boot to a Busybox command prompt.
These results are dominated by GLIBC which, due to using a pre-built toolchain, was the same in all builds.
Sato is an example mobile user interface that is included with OpenEmbedded. It includes applications such as a launcher, calendar, and contacts which is built on top of a GNOME Mobile stack.
This represents a typical GTK+ based application and includes the X Window System, Matchbox window manager, GTK+, D-Bus, and GStreamer.
The in memory footprint was measured on a just-booted but otherwise idle system. This is difficult to measure as the the memory use is time dependant.
- Graphs of
Change in size:
- 1.1 % for a minimal distribution
- 6.7 % for A GNOME Mobile based application
- 5.1 % for a Qt based application
Change in speed:
No correctness problems were found while researching this document. Linaro are focused on performance and will investigate and fix any optimisation problems found.
MichaelHope/Sandbox/O3Writeup (last modified 2012-01-30 22:36:55)