Add support in VFS for larger than 4KB block size

Release Note

Linux has always been limited to file systems with block sizes of at most the system page size, which is architecture dependent but typically 4KiB. Lifting that restriction is important for devices such as flash memory cards that have an underlying block size larger than 4 KB but provide an abstract block interface that emulates smaller writes.


When the underlying flash page size is larger than the 4 KB block size of the file system, every write of a single block results in a very expensive read-modify-write cycle on the card. On other devices that have smaller flash page sizes but can drive multiple channels in parallel, writing anything less than e.g. 16 KB (page_size*num_channels), will take the same time that it takes to write the full 16 KB. Both of these behaviors can be observed in today's flash drives and are getting more common (see WorkingGroups/Kernel/Projects/FlashCardSurvey), the only way to resolve this is to allow file systems to use larger blocks.


Ideally, the file system block size should match the underlying flash write unit, which is typically 16 KB in 2011, but keeps growing over the years. An obvious disadvantage of larger block sizes is the additional memory consumption, both in RAM and on storage, but the performance advantage is seen as more important for many use cases, and allowing larger block sizes will give users the chance to find out the best number for their work loads.

A number of people are interested in seeing this happen, and we are trying to work on different aspects. Nick Piggin suggested to Arnd Bergmann that someone from Linaro could change the way that the b_data field in struct buffer_head is accessed in the kernel, which is just one of the things that need to be worked on.


Code Changes

Right now, the maximum block size is the same as the CPU page size, which means that we can never fill all channels on common devices that don't have command queuing (mmc, usb, pata). Ideally, a file system should be able to use 16 or 64 kb blocks. A number of people are working on this, but it would be good if we can lend a hand there.

Specifically, as discussed with Nick Piggin, accesses to the b_data field in struct buffer_head no longer work if b_data is backed by multiple pages. Consequently, any dereference of b_data should be encapsulated in a macro, e.g.

memset(bh->b_data, 0, blocksize);

could get replaced with

void *bh_data = bh_get_data(bh);
memset(bh_data, 0, blocksize);
bh_put_data(bh, bh_data);

Once the bh_get/put_data macros are in place, we can replace them with functions that call vmap() to get a linear mapping for the underlying pages.

As a prototype, we can do this for a single file system, and fs/buffer.c, then see what else breaks.

Specific actions should get discussed with Nick and on as well as

Unresolved issues

This is only a partial effort. Completing the work on b_data will not have any positive effect by itself but depends on more work getting done. Depending on how we progress with this, it might be good to invite Nick or others to LDS-p to discuss further steps.


WorkingGroups/KernelArchived/Specs/vfs-support-for-large-block-sizes (last modified 2013-01-14 19:38:17)