Summary

Adapt the block layer and file systems to be aware of low-end flash media characteristics to improve performance.

Rationale

The work from the previous cycle at analyzing flash media (https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey) has shown systematic problems with the way that Linux deals with flash media pretending they are hard drives, while the flash media expect specific access patterns that Linux does not provide, typically a single-partition FAT32 file system with a specific geometry. This causes an very high write amplification, which manifests as low performance and reduced longevity of media.

Fortunately, there are a multiple ways to improve the current situation by changing the way that Linux tries to operate.

User stories

  • Cassandra uses SD cards to store data on her mobile phone. Unfortunately, her data constantly gets destroyed as the cards wear out. With the reduced write amplification that we get by improving the block layer, her cards actually last long enough for Cassandra to show other people the documents she created.
  • Steve always buys the newest and fastest hardware. For some reason, it still takes him a long time to do anything with it because even a $200 SD card takes half a second for writing a single data block as he is doing many things in parallel. With the block layer optimization, this can be a hundred times faster.

Work areas

Elevator support for erase block handling

See https://wiki.linaro.org/WorkingGroups/Kernel/Specs/io-scheduler-support-for-erase-blocks

Make sure that the elevator always schedules any requests together that are in the same erase block. This requires coming up with block layer infrastructure for communicating the erase blocks between the individual block drivers (sata, usb, mmc, ms, ...) and the elevator, possibly with overrides from the file system and from user space.

There are multiple elevators in the kernel, the most commonly used one is the CFQ scheduler, so that should be the main focus. One important task of the elevator is to minimize seeks by scheduling all read and write requests in a way to do the next one with the minimum distance to the previous one, which works best for hard drives. There is already a mode for high-end flash storage that will cause it to ignore the distance and only group large contiguous I/Os together (nonrotational mode).

The idea here is to add a third mode for low-end flash, with the following parameters:

  • Ignore location for read requests, but not for write requests. There are generally no constraints on how reads have to be located for optimum performance even on the low end devices. Possibly avoid merging read requests across an erase block boundary, as this typically involves a performance penalty and is not actually faster than doing separate requests.
  • Never merge write requests across an erase block boundary
  • After one write has been issued within one erase block, issue all other outstanding writes for that erase block before any other writes. This can drastically reduce the write amplification because it avoids having to do garbage collection inside of the drive when going back and forth between the same erase blocks.
  • Do all writes within an erase block in ascending order, never descending. While the majority of devices can deal with an arbitrary order, some can not and actually take a significant performance hit because they have to garbage-collect the erase block every time that they move backwards. This also helps in cases where we fail to detect the hardware erase block size and use a larger value by guessing.
  • After all outstanding writes for one erase block are done, chose another erase block independent of its position based on the other data that the scheduler has. There is no need to write in close proximity beyond an erase block boundary, so by ignoring it, we can optimize better for other constraints.

.

Block allocation in file systems

See https://wiki.linaro.org/WorkingGroups/Kernel/Specs/improve-fs-block-allocation and https://wiki.linaro.org/WorkingGroups/Kernel/Specs/investigate-block-allocation-in-fs

The first step is to quantify the write amplification for each of the relevant block based file systems, for a number of real-world workloads. Possible workloads include

  • distribution installation (debootstrap)
  • parallel kernel compile
  • git clone/checkout
  • streaming data write

We will need a new analysis tool that looks at the raw blktrace data file and counts the number of blocks being written while estimating the number of garbage collection cycles for a medium with the characteristics recorded in the flash card survey. The input to the tool should include

  • blktrace data set
  • erase block size
  • page size
  • algorithm used (linear or random writes allowed) in the card
  • number of open erase blocks supported

Since the performance is mainly determined by the number of garbage collection cycles, the main output should be that number, and the write amplification factor, defined as (number of GC * eraseblocksize / bytes written)

With that data, we can hopefully draw specific conclusions, in the form of e.g.

  • with XXXfs, you need at least N open erase blocks to not get into the worst case write amplification.
  • XXXfs is always better than YYYfs.
  • YYYfs has a specific problem, slightly changing the block allocation in there improves it by a factor of C.
  • We do/don't need the flash remapper for optimum performance
  • If we was a Linux Foundation qualification program for flash devices, it should require N erase blocks in random access to be recommended for use with Linux.

Further action on this depends on the specific results.

.

Change the VFS to deal with larger than 4KB block sizes

See https://wiki.linaro.org/WorkingGroups/Kernel/Specs/vfs-support-for-large-block-sizes

Specifically, as discussed with Nick Piggin, accesses to the b_data field in struct buffer_head no longer work if b_data is backed by multiple pages. Consequently, any dereference of b_data should be encapsulated in a macro, e.g.

memset(bh->b_data, 0, blocksize);

could get replaced with

void *bh_data = bh_get_data(bh);
memset(bh_data, 0, blocksize);
bh_put_data(bh, bh_data);

Once the bh_get/put_data macros are in place, we can replace them with functions that call vmap() to get a linear mapping for the underlying pages.

As a prototype, we can do this for a single file system, and fs/buffer.c, then see what else breaks.

Specific actions should get discussed with Nick and on linux-mm@kvack.org as well as linux-fsdevel@vger.kernel.org. .


CategorySpec

WorkingGroups/KernelArchived/Specs/block-layer-optimizations-for-flash (last modified 2013-01-14 19:43:28)