Optimisation and porting - assembly


I scanned a large number of free software packages in two Linux distributions, looking for places where architecture-specific assembly code might be used so that I could identify likely packages where porting and/or optimisation would be necessary or worthwhile for ARMv8-based and (to a lesser extent) ARMv7-based server platforms.

Thankfully, it became clear that most of the software in common usage in Linux distributions does not rely on assembly code. In the places (1435 packages) where assembly code is used, I have analysed it, categorised it by purpose and then applied rough prioritisation by software area.


I worked through scans of packages in Ubuntu and Fedora looking for x86/ARM assembly. Christian Reis and Matthias Klose provided a list of target packages from scanning the Ubuntu archive; Jon Masters and Al Stone gave me a similar list from the Fedora archive. Each of these lists was generated using different locally-written tools. This might seem strange, but I considered the potential differences useful. More variance in input methods would hopefully make it less likely that something would be missed; after all, the two distributions overlap substantially in terms of the packages included.

Given the lists from both sources, I merged them as well as possible, trying to pick up on places where the same package might have different names across the distros. Then I worked through the long list of source packages that was generated, performing the following 4 steps in each case:

  1. Download and unpack the source
  2. Look for all likely-looking assembly files within the source (*.[sS] *.asm *.ASM, etc.)
  3. Look for inline assembly contained in other source files (*.c *.C *.h *.H *.cpp, etc.)
  4. (By far the longest step) In the cases with actual assembly, try to work out:
    1. the purpose of the assembly code

    2. whether or not the assembly code is used

I did not spend any time specifically looking for the use of intrinsics for SIMD operations (MMX, SSE, NEON, etc.), but I have remarked on it where I saw it in passing. Potentially it might be worth quickly scanning again for such intrinsics in future; this should be much quicker than looking for the generic assembly code in this study.

Raw results

The raw results from my analysis are stored in Google Docs at


for anybody else who might be interested. The raw data there works OK, but with limited success. The spreadsheet stored there is already quite slow, plus the many extra cells used to help with analysis did not work at all - maybe I overstepped the limits of Google Docs. Given that, an extra copy in a better format is also attached here for reference.


I broadly classified the results from each of the packages into a small set of categories based on the purpose of the "assembly" I found:

  1. Atomics - Use of assembly code for memory barriers, locks, atomic increment/decrement etc. 10.0% of the packages scanned fell into this category.

  2. Embedded library code (18.1%) - Assembly code found in embedded copies of source code from other packages (e.g. libjpeg, gettext)

  3. False positives (11.1%) - Code or text that appeared in the scan results but either was not assembly code (e.g. data files with file names ending in .s), or was commented out or unused for some other reason

  4. Lowlevel (38.1%) - Assembly code for various "lowlevel" purposes such as direct hardware control (registers, stack, hardware ports)

  5. OtherOS (9.3%) - Assembly code included for support for other platforms or operating systems and so not directly relevant to us

  6. Performance (30.4%) - Assembly code designed (but not always achieving it!) to increase performance in some way (e.g. SIMD for multimedia acceleration, replacement of code in algorithm inner loops)

  7. Symbols/sections etc. (2.9%) - Code using "asm" to directly control symbol access or to designate ELF sections (e.g. linker scripts)

The numbers here show percentages of the 1435 packages that fit each category; the total is clearly not 100%, as packages may fit into more than one category and some cover all of them!

It also became clear that some of the packages could be totally ignored for our purposes as they would never need porting. Examples of such would be architecture-specific (e.g. boot loaders) or programs that only provide useful output with very specific hardware (e.g. programs to display/tune parameters on certain graphics hardware).

Common findings

Looking at the assembly results, it was easy to find patterns in usage:

Byteswap and bitops

Many packages use assembly for performance in byte-swapping code and bitwise operations. Much of this code seemed to be "cargo-culted" from other places, i.e. copied in to make things run faster without necessarily understanding what the code does, or how it works. This type of operation is a great example of code that should be easy for a compiler to recognise and optimise well, and I suspect that many developers added assembly here in the past to work around compiler limitations. It should no longer be necessary, and could even be slower than the output of a good compiler these days. It would be worthwhile digging further into some of this code and verifying that.

Hardware identification

Another common case for architecture-specific assembly code is for hardware identification. The CPUID instruction in x86 assembly returns information about the CPU on which code is running, including things such as hardware family, stepping and (the most commonly-used information) availability of optional features such as SIMD (MMX/SSE/SSE2 etc.) Similar features are available on other platforms too.

In a lot of the packages identifying CPU resources like this, further code will later use the results to choose between various functions when performing CPU-heavy operations such as graphics processing. Some other packages don't actually use the results for anything beyond diagnostic messages at startup; maybe some of the developers here are expecting to make more use of CPU-specific features in the future.

It seems quite a shame that (as far as I could find) nobody has written a simple library to deal with hardware identification like this. There is the hwcaps feature provided by the Linux kernel and exposed via glibc, but for whatever reasons (portability concerns?) developers are clearly not making much use of this. It is quite cumbersome to use, requiring the installation of completely separate copies of libraries rather than simply choosing an optimised version of a function at runtime in an otherwise-portable library.

Timer access

Lots of programs also include direct use of the x86-specific assembly RDTSC to read the timestamp counter, a register providing a simple count of the number of cycles since reset. Similar code exists for direct access to low-level timers exposed on various other platforms. This is typically used in benchmarks and profiling, particularly for measuring performance for a small number of instructions or short functions. In most cases, this timing information is not critical to functionality - it's just disabled on other platforms or there is a fallback to gettimeofday() and the like. This is another place where it is frustrating that there is not a common macro/inline function provided by system libraries to abstract the interface.


A common feature in multimedia libraries is the growing use of SIMD (Single Instruction, Multiple Data) technology for performance. On Intel platforms there are multiple generations of SIMD (MMX, SSE, SSE2, etc.), PowerPC has AltiVec and most ARMv7+ cores include NEON. The recommended way to write code for these instructions is normally to use compiler-provided intrinsics, but still much of the SIMD code I found in the scanning is written directly in assembly.

For LEG purposes, multimedia libraries and (especially) desktop applications only merit a low priority for optimisation so I did not devote much time to deeper investigation of the use of SIMD code there. There are likely to be other places where we could optimise for ARMv7 and ARMv8 using NEON, however. Common CPU-intensive server operations on servers include checksumming, compression and encryption, and depending on the algorithms in use there could be substantial performance improvements available.

A further scan of code to look for use of SIMD intrinsics would not be too difficult, but is beyond the scope of this study. Searching for C/C++ source files including one of the <*mmintrin.h> family of system header files will quickly highlight Intel SIMD usage, for example.


A lot of packages are designed to be used safely in multi-threaded environments and this necessitates the use of atomic operations. There are a number of different primitives available, including many provided as standard in system libraries like pthreads or even by compilers themselves (e.g. __sync_val_compare_and_swap for test-and-set semantics, or __sync_synchronize to provide a memory barrier in recent versions of gcc). Using readily available code like this is good for portability, and also correctness - it's much more likely that compiler and system library engineers will do a good job of testing and optimising on all platforms than most third-party developers.

Unfortunately, a large number of packages in the scan were found to use assembly of one sort or another to implement atomics. In many cases it was not really clear why developers chose to do this. At least in some places, I found that code would attempt to use the gcc-provided builtin atomic function if available, only using platform-specific assembly as a fallback. However, other developers use their own atomics (most likely borrowed from other places) only, with varying levels of support for less-common (non-x86) platforms:

  • disable threading for non-supported platforms at build time
  • allow threading configuration for non-supported platforms, but #error at compile time
  • (worst) allow threading for all platforms, but without any code to force atomicity on unsupported platforms

Floating-point control

A surprisingly large number of packages include x86 assembly code to directly control the floating point unit, setting options like rounding modes, precision and exception masks. There don't seem to be any similar uses of assembly code for non-x86 platforms. It's especially disappointing to see so much use of assembly here in the first place; C99 defines standard routines for controlling FPU behaviour like this, e.g. fegetenv() and fesetround(). It must be assumed that many of the assembly uses here either pre-date C99 adoption in toolchains or are further examples of code copying.

Embedded libraries

A very common feature found in the scan results was embedded copies of libraries, i.e. a package containing a copy of source code from some other project(s). This is often done for one of two reasons:

  • to make it easier for end users to build and use the package
  • to allow upstream developers to rely on a specific (and maybe locally-modified) version of the library

The first of these reasons is typically unimportant; the vast majority of end users will be using software built and packaged by a distributor in some way, and it is fair to expect that the distributor will be able to find and manage shared libraries correctly. In the latter case, some common projects do not provide useful shared libraries (e.g. with stable API and ABI) for others to work with.

There are a number of downsides to copying library code, though. The most likely cause of potential problems is due to embedded library copies not being updated over time. As time passes, hopefully bugs will be fixed and other improvements (performance, behaviour changes, etc.) made in the library code, but the embedded copies may lag behind. This can lead to security holes remaining un-patched and all manner of other issues. Embedded copies of code also leads to inefficiencies on the end user's system.

The most common embedded libraries I found, in popularity order, were:

  • gnulib (essentially a false positive - it's designed as a library of code for people to copy and embed; whether this is a good idea or not...)

  • gettext (again, basically a false positive; it uses "asm" for lowlevel control of symbol exporting in Cygwin)

  • libgc - the Boehm garbage collection library

  • sqlite - embedded SQL database engine

  • zlib - compression library

  • libjpeg - support for the JPEG graphics format

In the later cases here, there are some good examples of embedded library problems. In some of the places where it is embedded, it appears that libgc has been modified for various reasons. This will make it harder to keep up with new upstream versions. In the particular case of libgc, it itself further embeds a copy of libatomic-ops which will need updating to add support for newer CPU architectures like AArch64. zlib is a commonly-used library with a very good track record of ABI and API stability, but there have been some security holes found and fixed in its long history. Packages using an embedded copy of zlib may take a very long time to pick up those security fixes.

As/when/if optimisation work happens on these embedded libraries, it would be worth contacting the upstream developers who use them to make sure that they pick up new versions including the changes.


In the scan results, some patterns quickly became clear:

The Good

Many developers do know about the gcc intrinsics and builtins that are available, and are starting to make use of them. In some cases, they have written their code entirely to use the features provided by the compiler; in some others they still use some assembly routines but only as a fallback in case the user is building with an obsolete compiler.

It seems that some packages are moving away from using inline assembly; older versions included code for performance etc. but developers are re-evaluating its effects and have removed it in newer releases.

Both of these patterns will help with software portability.

The Bad

There is far too much cargo-culted assembly code in use today. In many of these cases, developers have clearly seen code used elsewhere and copied it in. If it works elsewhere, it must be worthwhile? At first glance, several such lumps of code also appeared to be buggy; this is a common issue when re-using code that is not well understood!

Far too many embedded libraries - see above.

Both of these patterns make software porting much harder than it needs to be, and will be causing unnecessary bugs besides.

The Ugly^WComical

I found a number of places where assembly code was accompanied by comments along the lines of

gcc 2.7 optimises this code incorrectly

This is clearly forgivable in old code that has not been modified in a decade or so, but points to failing maintenance.

Ten packages contained assembly code for Vax machines; in some cases it was clear that the code in question was first written in Vax assembly then ported to C for those weird, new-fangled Sun workstations (in the 1980)s.

There were lots of places where assembly code had been written at some point, but then either disabled for some reason (maybe testing showed it didn't work?) or just not actually used in the code at all. A perfect example here was an IRC client including hand-crafted, carefully-optimised (and then carefully commented-out) assembly routines for string handling performance.

Priorities for porting

In the spreadsheet, I have assigned scores to packages following some rough guidelines (the higher the score, the more important I consider a package to be). See the "Manual Priority" column.

  • -1 for packages that do not merit porting for some reason (e.g. for a totally platform-specific package)

  • 5 for games

  • 10 for desktop applications

  • 20 for most other packages

  • higher numbers for packages that look/feel important (core toolchain, core libraries, etc.)

These scores are mostly arbitrary. I was specifically focusing on packages that would be expected or useful on a server, as my work is in the Linaro Enterprise Group. I would definitely expect other people to prioritise differently! I considered also dropping the scores applied to multimedia libraries, but a likely workload on an AArch64 server farm could well be multimedia content generation / transformation so I left them alone.

Finally, I added some extra scores for dependencies. I started off with likely-looking server / web farm package lists from the Ubuntu Server Guide and used a germinate-based script to create a list of all the packages needed to fulfil dependencies. For any of those packages that were listed in the spreadsheet, I gave a +20 boost to the priority - see the "Dependency Priority" column.

What needs doing?

There is not a simple list here. Many of the obvious candidates for port work are already underway or even completed, as they are core packages:

  • linux
  • gcc
  • (e)glibc
  • klibc
  • libffi
  • binutils
  • gdb
  • device-tree-compiler
  • libunwind
  • openjdk
  • mpfr
  • llvm
  • gmp

There are some obvious packages that LEG members will need/want to see ported and/or optimised:

  • grub2
  • TBB
  • openssl
  • libatomic-ops
  • libgcrypt
  • php
  • postgres
  • mysql
  • libaio

and probably more that will become clear later. There is also a clear list of packages that may not be critical to LEG work, but are very widely used and thus may be helpful in general. They are worth investigating, at least:

  • zlib
  • libgc
  • libjpeg
  • dlmalloc
  • nss
  • gnulib

Other code that will matter to other users of AArch64 in the future will obviously include multimedia libraries, games etc. I expect there will be porting and optimisation work around those areas in the future, most likely community driven.

Help for developers

Alongside direct porting efforts, the results here suggest other work that should be done. Having identified several bad patterns in the packages here, we should really start helping and advising developers to do what we consider the right things. This starts with documentation, ideally well publicised information that people will find useful immediately.

In those places where portable code already exists that could replace existing unnecessary assembly code, we should provide example code for people to use or learn from as they see fit. For a lot of the obvious cases, developers should be able to drop in replacements without much effort; in some less clear cases, good examples of best practise can be invaluable. Most atomics should be covered this way.

Secondly, we should help people to trust their compiler more. Benchmarks of hand-optimised assembly may not be very easy to generate, but highlighting how the code works and how compiler optimisations have improved may be a good start. Fundamentally, outside of special cases the compiler should be much better at producing good fast code than programmers writing assembly directly.

For those places where people are using assembly for more than just performance (e.g. lowlevel hardware access), we should try to provide library routines for people to use instead of the assembly that needs porting all the time. We should push some of these into existing libraries where possible (glibc maybe?), or where not possible then maybe something like libatomic-ops. Help the developers write to clean higher-level APIs instead of in assembly where possible.

Finally, we should at the very least reach out to developers to help them. They may not want direct patches from us or what they consider to be intrusive code changes, but we should be able to provide expert guidance on porting to AArch64 to get the best results.


Most software doesn't need (assembly) porting

Although this document spends a lot of time discussing the assembly code that I found, the first point to make is that most software in the typical Linux distribution does not contain any assembly. Ubuntu "Raring Ringtail", the current development code which is expected to be released in April 2013, contains over 20,000 source packages. The assembly scan found just over 1,200 of those (in Ubuntu) to contain assembly, i.e. approximately 6%.

Most that include assembly will work anyway

Of that group of packages including assembly, most of them will already work on ARMv7 and ARMv8 already. Some may need some porting for performance or for all their functionality to be enabled, but many will work just fine regardless.

Most of the assembly has little value

Much of the assembly that I found in the scan here actually has very little value. Lots of it is trivial code that developers may expect to give performance gains, but is likely to be overwhelmed by other considerations. We need to look at some of these in more detail to see what does matter.

Work to do - porting and communication

We do have some porting work to be done, and just as importantly we have some documentation to write and some developers to work with.


Code used here:

I presented this work at LCA13 in Hong Kong: see the slides or the video


While I've tried my best to be accurate in the scanning and analysis here, with such a large body of data to work on it's always possible that I've made mistakes here and there. Also: the code I was looking at in both Ubuntu and Fedora was in their development repositories and hence a moving target. In some cases, package versions found in the scan results had changed by the time I started the deeper analysis.

Apologies for any issues that you may find; please point them out to me at steve.mcintyre@linaro.org and I'll endeavour to fix them.


  • 2013-04-02: Updated entry for fpc after contact from Marco van de Voort; fpc includes embedded assembly code in files such as *.pas and *.inc too, and has some asm for performance as far as I can see. It will need porting for AArch64, and upstream developers are working on that already.

  • 2013-04-03: Updated entry for orc after contact from David Schleef; "Orc is a compiler for SIMD code, much like a JS jit engine, but specifically for SIMD. It also has a NEON backend for 32-bit ARM." No AArch64 support yet due to lack of hardware to test/work with.

Sub-pages (automatically generated)

LEG/Engineering/OPTIM/Assembly (last modified 2013-04-03 17:51:51)