Parent

Profiling Compression Algorithms

Linux perf was used to profile the xz, bzip2 and gzip tools compressing an 82Mb file in a ramdisk (/dev/shm). Both xz and gzip have relatively fast decompression times so only compression was examined for these tools. For bzip2, the hotspots for decompression and compression lie in the same code path.

Work was done on a TC2 (A15s enabled, A7s held in reset).

Zlib compression

Zlib is based on the Lempel-Ziv 77 sliding window compression algorithm. It essentially searches for repeating patterns within a 32Kb window (which advances through the stream). Decompression is significantly faster than compression and doesn't really warrent optimisation in my opinion (it took 2s to decompress the test file in RAM).

The hotspot for compression is the longest_match function (50% of the CPU time is spent there), of which there are (old to very old) assembler optimisations. I couldn't see anywhere assembler could improve the C compiled versions of longest_match from perf significantly. Most of the hotspots are in load instructions, where the source address is dependent on results from a previous load. There is one tight loop that could potentially benefit from being un-rolled, but this contained recursive style loads; which I couldn't see any way of un-rolling.

If one has an MMU, they can enable -DUNALIGNED_OK and longest_match will then perform unaligned half-word loads which does speed up the algorithm by about 4%. The potential incompatibilites introduced by enabling this, however, significantly outweigh the small performance increase in my opinion.

BZ2 Compression

BZ2 compression differs from the conventional Lempel-Ziv dictionary based algorithms in that it operates on blocks (typically 900Kb) of data using a Burrows-Wheeler transform, which can be loosely thought of as a "reversible sort" that gives a data pattern that yields higher compression ratios with some algorithms (RLE and Huffman in bz2). This transform is used for both compression and decompression and is heavily reliant on sorting; 90% of the CPU time is spent sorting blocks.

The main code to sort blocks is very heavily optimised and can be found in the mainGtU function in blocksort.c. As blocks are sorted, their order is incrementally cached and stored in quadrant descriptors. This reduces access to memory and speeds up the algorithm greatly.

I've been unable to find any convincing optimisations to speed up bzip2. The hotspot, mainGtU compiles into what appears to be nearly optimal assembler. Use of PLD inside mainGtU to pull the quadrant array into cache lines gave a small speed up on TC2 and a significant speed decrease on other platforms (presumably it was evicting important data from cache).

xz compression

The xz compression tools adopt the LZMA 1 & 2 algorithms which are greatly enhanced variants of Lempel-Ziv. Repeating patterns are stored in a dictionary (indexed via hash chains and binary trees), which is allowed to be quite large (typically tens of megabytes, can go up to 1.5 gigabytes). The data is also range encoded before it is stored in the dictionary. All the heuristics to optimise this process lead to quite a complex algorithm which takes a long time to compress data, but does give the best compression ratios found so far. Decompression is thankfully very quick (3 seconds for the test file). It should also be noted that xz is extremely user-customisable too.

The hotspots for xz compression are in bt_skip_func (33%) and bt_find_func (28%), in lz_encoder_mf.c. These are functions that search the binary tree representation of the dictionary for a pattern or advance pointers to skip over a pattern. I have found a couple of tight loops that can be unrolled in these functions to give a 3.4% speed boost on A15, unfortunately on other platforms this was found to have a small detrimental effect. The loops what were unrolled were:

   1                        while (++len != len_limit)
   2                                if (pb[len] != cur[len])
   3                                        break;

As xz operates over a large area of memory in relatively-scattered manner, the use of huge pages to reduce tlb thrashing was examined. On A15, I got a better speed up (about 5.5%), and a reasonable speed boost on other platforms too.

Timings

Motivated by the speed up huge pages gave to xz, I tested the other algorithms too.

An 82Mb source code .tar archive was placed in /dev/shm and compressed with xz, bzip2, and gzip using the default compression settings. To gauge the effect of huge pages on the utilities, the environment variables LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=y were prepended to the command line, activating libhugetlbfs. The time is measured in seconds.

utility

A15

A15 Huge

A7

A7 Huge

A9

A9 Huge

compressed size

xz

111.18s

105.09s

201.27s

194.71s

358.88s

314.61s

9.9Mb

bzip2

58.40s

58.36s

126.79s

122.84s

227.41s

218.17s

13Mb

gzip

10.44s

10.25s

18.99s

19.10s

38.46s

38.29s

16Mb

-- SteveCapper

LEG/Engineering/Compression (last modified 2012-12-13 13:04:53)