Summary

A new file system is being prototyped, specifically with the goal of having optimal I/O characteristics on low-end managed flash storage such as SD cards.

Release Note

This work is not intended for a specific Linaro release and is not getting tracked in the kernel working group. From Linaro's perspective, the work is low priority, but it complements the other work on storage performance and is developed in close cooperation. The actual implementation is done as part of an internship at IBM.

Rationale

The best performance on flash storage can only be achieved with a file system that knows about the underlying characteristics. Having a file system that does everything as good as possible to the best of our knowledge lets us point out the problems with the existing file systems in order to optimize them. If the work turns out to be vastly superior to existing approaches, we can also think about integrating it into Linux as a follow-on project.

Assumptions

  • Low-end flash storage behaves consistently according to the results of WorkingGroups/Kernel/Projects/FlashCardSurvey.

  • Write performance on these devices is usually dominated by internal garbage collection, almost never by CPU speed or memory bandwidth.

Requirements

Definitions

In order to better describe the specific requirements, we define five categories of importance:

  1. mandatory -- Without this feature, the file system does not serve its purpose of demonstrating the performance win of avoiding internal garbage collection

  2. important -- This feature must be present in order to actually use the file system in standard setup.

  3. secondary -- There should be the option to implement this feature at a later point, because important applications depend on it. It does not have to be part of the initial release.

  4. optional -- A feature that is clearly useful and should eventually be implementable. If it's easy to do, it will be implemented right away, and the design should leave the option to implement it, unless it conflicts with a more important feature.

  5. non-goal -- While there are use cases for it, we do not intend to ever support this feature. No effort will be made to

Supported hardware

  • Types of devices
    • SD/MMC/eMMC (important)
    • USB Stick (important)
    • Other low-end blockdevice (CF, eSATA, cheap SSD) (optional)
    • not supported (non-goal): MTD flash, Hard drive, high-end SSD
  • Storage size
    • less than 4 GB (optional)
    • 4 to 32 GB (mandatory)
    • 64 GB to 2 TB (secondary)
    • more than 2 TB (optional)
  • Erase block sizes
    • regular: 4 MB (mandatory)
    • common sizes: 1.5, 2, 3, 4, 6, 8 MB (important)
    • possible future sizes: 12, 16, 24, 32, 48, 64 MB (secondary)
    • sandisk-style TLC/SLC combo: e.g. 4MB/1.33MB or 3MB/1MB (secondary)
    • arbitrary (optional)
  • Optimum I/O sizes
    • 64 KB (important)
    • arbitrary power-of-two (optional)
    • non-power-of-two (non-goal)
  • Minimum I/O sizes
    • 4 KB (mandatory)
    • 8/16/32 KB (secondary)
    • arbitrary power-of-two (optional)
    • non-power-of-two (non-goal)
  • Number of concurrent erase blocks
    • 1 (optional)
    • 2 (secondary)
    • 3-4 (important)
    • 5 (mandatory)
    • more than 5 (secondary)
  • I/O pattern
    • random I/O within erase block (mandatory)
    • linear I/O within erase block only (secondary)
    • FAT Area: hardware supports random I/O in fixed range of erase blocks (secondary)
    • no erase block limitations, e.g. SSD (non-goal)

File system features

  • Basic features
    • Reading and writing multiple files (mandatory)
    • POSIX file system semantics, e.g. hierarchical directory, symlink, hardlink, attributes (important)
    • NFS exports (optional)
    • mmap (secondary)
    • extended attributes (secondary)
  • Internal features
    • Transactionality (important)
    • garbage collection (important)
    • block checksums (secondary)
    • discard/trim (secondary)
    • extents (optional)
    • sparse files (optional)
    • tail merging (optional)
    • Snapshotting (non-goal)
    • compression (non-goal)
    • encryption (non-goal)
    • preallocation (non-goal)

Components

  • FUSE based implementation (mandatory)
  • mkfs tool (mandatory)
  • fsck (secondary)
  • kernel port (secondary)
  • tunefs (optional)
  • debugfs (optional)

Design

First draft design

The file system is organized in clusters and cluster groups. A cluster has a power-of-two size, typically between 16KB and 64KB (initially 4KB) to allow optimum I/O performance. A cluster group is a multiple of clusters and should represent one or more erase blocks on the device, typically 4 MB. Each group contains either inodes or data, directories are stored as special data files with associative lists linking file names to inodes. All metadata other than inodes is stored in the first few groups of the medium and is intended to normally reside in memory.

Each inode is variable length and can be at most one cluster long. There are three possible inode layouts, depending on the file size:

  1. Embedded data: All file data is part of the inode itself, the file plus the inode metadata fits entirely inside one cluster.
  2. Cluster indirect: The inode has an array of cluster pointers, the file size is limited by the number of pointers that fit inside the inode cluster.
  3. Group indirect: The inode has an array of pointers to cluster groups, file data always fills up entire groups, this should always allow files as large as the storage medium in practice, but wastes half a cluster group per inode on average.

The ideal inode type is guessed at file creation time based on how much data is there. It can change dynamically between embedded data and cluster indirect allocation when the file grows or shrinks, but might not change back from group indirect to cluster indirect on demand.

The number of possible inodes is defined at file system creation time. The metadata area contains a fixed-length array with cluster pointers indexed by inode number. When a file is written, its inode gets copied to a new location and the array is updated to point to the new location. The update of the inode pointers and other data in the metadata are needs to be transactional, while all other cluster groups are only written in copy-on-write operations and are not visible until the metadata are gets updated.

To optimize space for small files, multiple inodes of any of the three types can be stored in one cluster, when retrieving an inode from storage, the entire cluster is read into memory and scanned for the inode number.

When writing data, the file system holds multiple cluster groups open. Each group is filled from start to end and contains data from a relatively uniform category. The minimum set of categories is one for inodes and data clusters, but more categories can be defined based on e.g. file size, age of the file or hints from user space. The number of open cluster groups should not exceed the hardware capabilities that are determined at file system creation time.

For garbage-collection purposes, an index of all cluster groups is kept in the metadata area. It contains the type of the group (inodes, clusters, linear data or empty), time of last write access and amount of valid data. This data is used to determine which group to garbage-collect next. When garbage-collecting a group that contains inodes, each inode is looked up in the inode index to see if it is still valid. Valid inodes are copied to a new location, invalid inodes are discarded.

In order to garbage-collect data clusters, the cluster group contains a list of the inodes that reference it. This is written into the last cluster of the group when the group being filled. Garbage-collection then reads the respective inodes and copies all data blocks that each inode has in the group into a new location and then updates the inodes. File data that is filling entire groups never needs to be garbage-collected, but it may occasionally have to be touched for static wear-levelling purposes.

The decision which group to garbage-collect can take several aspects into account:

  • groups that have very little active data should be garbage-collected to free up space for new data
  • groups that have not been written in a very long time do not participate in wear-leveling, garbage-collecting them contributes to longevity of the storage medium
  • garbage collection should avoid mixing hot (written recently) and cold (not written recently) data in order to reduce the need for garbage-collection later.
  • garbage-collection can be triggered either by the need for free space or by excessive fragmentation of free space. The trade-offs between the above aspects will be different depending on the scenario.

Implementation

The file system is based on the "fuse" framework and implemented in user space. A port to an in-kernel file system is outside of the initial scope but can be implemented later. There will be a library containing the basic data structures and methods to access them, and the library can be shared between the fuse code and the user programs (mkfs, fsck, ...).

Milestones

1. Reading and writing a single file

The necessary logic to is implemented to have a single inode in cluster indirect or embedded data mode, write data to it and read the same data back. The readdir call returns a fixed file name associated with zero-byte file at mount time.

2. Directory operations

A mkfs program creates an empty file system layout. It is possible to mount that file system and create and remove files. Space is freed up when files are removed.

3. Mostly posix compliant operation

Subdirectories, readdir, symlinks, special files, hardlinks and other essential operations all work as expected.

4. Generality

Cluster and group sizes as well as number of open groups are customizable in a reasonable way to adapt to most of the common hardware.

5. All inode formats working

All three inode formats (embedded data, cluster indirect, group indirect) are implemented and tested.

6. Garbage collection

Both inodes and data clusters get garbage-collection. The file system does not lock up when full and/or fragmented.

7. Transactionality

No data is lost during power-outage, writes to the metadata area are performed in a transactional way.

8. Secondary requirements

All important requirements have been met, and secondary items are being implemented. Specifics and additional milestones to be defined then.

Unresolved issues

* Apparently, FUSE cannot really handle block sizes other than the page size, but we want to support cluster sizes between 16KB and 64KB. Initial versions can use 4KB (on x86) and 64KB (on powerpc64), but it will have to become more general eventually.

Final thesis

The thesis is completed, the full text is available as volker-schneider-thesis_colored.pdf. The source code is available on http://git.linaro.org/gitweb?p=people/vschneider/ffsp.git .


CategorySpec

WorkingGroups/KernelArchived/Specs/flash-file-system-prototype (last modified 2013-01-14 19:39:54)