Preventing Filesystem Corruption in Embedded Linux

Published as a Whitepaper on Mar 22, 2010 (Updated Dec 17, 2020)

Introduction

Almost any computer system is subject to unexpected power failures. For some embedded systems, this only occurs when the power grid goes down. For others, it may happen when a user decides to pull the plug instead of using a documented shutdown procedure. Automotive and remote systems need to anticipate that power will stop and start several times a day. If an embedded system is implemented without thinking about what happens when the power goes down it could lead to catastrophic failures down the road. Due to the nature of failures caused by unexpected power loss an embedded system may run for weeks, months, or years before users experience an unexpected and catastrophic failure. From the users perspective, their device worked fine yesterday and today it doesn't even turn on, and they don't tie it back to the unexpected power failure event.

One collection of failure types caused by unexpected power loss are those related to issues with the boot medium. Investigating the boot medium failure as a result of power loss may show an unclean filesystem, missing files, or more commonly a filesystem that only mounts as read only. The latter happens when the filesystem detects a serious problem with filesystem metadata during runtime that it cannot fix automatically causing it to remount read only to avoid writing to prevent further corruption on the disk. Many people turn to common journaled filesystems like ext3/ext4 to attempt to address these failures. While journaled filesystems like ext3/ext4 are less prone to corruption, they are far from immune.

Explanation

When a NAND write from the filesystem occurs, it must write a page at a time, and it must be erased before writing. However, in order to erase a page you must erase an entire block which will include many pages. As an example from a Micron NAND, a page is 4KiB, and the erase block is 1MiB. A filesystem or FTL (Flash Translation Layer) will store multiple files within a block, so if power is lost during an erase/write it is possible to lose the entire block and the multiple files stored on them.

On some of Technologic Systems' older products we accessed NAND flash using a controller that was either built into the processor or implemented in our FPGA. These hardware controllers were used in conjunction with a flash filesystem like YAFFS2 or JFFS that manages communications with the NAND devices, ECC, wear leveling, bad block management, and filesystem specific storage mechanisms for storing Linux data, permissions, and directories.

After using flash filesystems for years and understanding their limitations Technologic Systems created XNAND which significantly improved the reliability of NAND. XNAND can find corrupt blocks, and uses redundant sectors to gracefully recover from failure. When ECC failures occur, YAFFS2/JFFS permanently retires blocks, but XNAND improves on this by attempting to write corrupted blocks back with correct data from redundant sectors. In many cases this allows a sector to work for many thousands more writes. XNAND is closely tied to out 512MB SLC NAND based products and is limited to providing 256MiB of usable space in order to support the redundancy. See the XNAND paper for more details.

More recently Technologic Systems has transitioned our products to more modern NAND flash devices which integrate a Flash Translation layer in the device, removing the processor from direct control over the FLASH. These modern NAND flash devices can then be used in conduction with standard file systems.

The above examples shows an SD card but SATA, USB mass storage, and eMMC are all similar underneath in regards to the Flash Translation Layer. The FTL provides NAND Flash access to the OS in 512B or 4KiB blocks, which allows the use of a standard block device filesystems like ext3/4, or FAT32 rather than requiring a filesystem like yaffs2 which manages the NAND directly. The Flash Translation Layer also manages wear leveling, bad block management, and garbage collection of leftover data in pages. The wear leveling implemented by an FTL often improves upon yaffs2 by implementing a dynamic wear leveling algorithm. The NAND flash controllers are different and proprietary to each manufacturer and it is uncommon for a manufacturer to publicly document the process used inside of their controller, however all of them use the same basic principles.

The FTL provides data access in 512B blocks, but the pages and erase blocks on the FLASH are organized into "allocation groups" which are typically 4MiB on SD cards. The FTL provides improvements over the older NAND implementations by maintaining a mapping table of allocation groups to their real location on the NAND flash. During wear leveling or garbage collection the SD controller may decide to update this allocation group mapping to a different location, and moves the data by writing the old data to the new locations. Depending on the design of the flash controller, if power is lost during this update you can lose up to the allocation groups worth of data during the power failure. While an allocation group is 4MiB on SDHC, SDXC allows the card to specify up to 64MiB.

Solutions

With an understanding of the NAND flash implementation being used in your embedded system there are several ways to protect against filesystem corruption in embedded Linux.

Use a Read-Only Root Filesystem

Files that don't need to be modified should be kept on a partition that is mounted read-only. If all writes are avoided then there is no risk of a write being interrupted and corrupting the disk. Some of our products include an initial ramdisk that is read only by default, or more recent products can be configured to have this same behavior with a read only root filesystem. A full Debian boot can still be used as well with some customization.

Data logging applications can be accomplished with out NAND flash writes by storing logs in a ramdisk until they can be offloaded to a networked location. If local data logging is required a read-write partition can be created on the same medium as the read only partition, but in this case it is understood that in the rare case of NAND corruption due to sudden power loss it is acceptable to lose data. The read/write partition should be created aligned with the allocation group size (typically 4MiB). With this setup, the worst-case scenario from a poorly timed failure is that the system will boot correctly, but the data it has been collecting recently will be corrupted and the filesystem with the read/write data may need to be recreated.

Build an Emergency Backup into Your System

In many cases the failure of an embedded system in a remote location will cost many thousands of dollars just to reach it and replace it. If this is true of your system, it is better to make an investment up front to prevent failures.

Most Technologic Systems single board computers have an initial boot flash memory, such as a SPI NOR flash device, an eMMC boot partition, or an RTC NVRAM chip. These boot flash memories which have some non-volatile memory available that a user can leverage to create a fail safe boot mechanism. For example, on a TS-4900 there could be a byte on SPI flash designated for boot status.

U-boot is started
A byte is read from SPI flash boot status and checked to see if it is equal to 0x55. If it is not, it takes recovery action. It then immediately writes 0xaa indicating it is attempting a boot.
U-boot launches Linux and starts your application. When your application has successfully started, it writes the SPI flash to 0x55.

If the system fails to boot you can then take an action such as booting to a separate recovery partition, or using a kernel stored directly in the SPI flash. This can be done on many other Technologic Systems SBCs by using the initial ramdisk to modify the special boot indication byte instead of u-boot. The special byte could also be located in the RTC NVRAM instead of the SPI flash.

Our fail-safe SD card image for the TS-7800 demonstrates a way to guarantee a successful boot process using the initrd. It uses the non-volatile RAM in the RTC to record whether a restore process is necessary. The boot procedure is as follows:

Linux boots to an initial ramdisk.
The RTC value is checked to confirm a normal boot sequence.
A value is written to the RTC indicating failure, the watchdog is started, and Debian is booted.
The watchdog is continually fed during Debian initialization.
The RTC value is written to indicate success, and the watchdog is disabled.

In real world applications of this example, step 5 is controlled by the application software. The application can be responsible for feeding or disabling the watchdog, and for writing different values to the RTC.

If a failure code is read at step 2, the entire Debian partition is reformatted and restored based on a read-only backup partition. This configuration results in a robust system that is always able to boot Debian. The worst-case behavior is loss of logged data and a lengthy delay while the system is restored.

Use eMMC in Data Reliability Mode

Typical eMMC devices also should not be powered down during a write/erase cycle, as they can be prone to the same failures as an SD Card. However, the eMMC devices that Technologic Systems uses in our products include support for a "Write Reliability" mode, and a "psuedo SLC" mode. These modes can be access by setting a fuse on the eMMC device. With both of these enabled the eMMC only risks up to 512B during a write. In the event of a power loss that 512B of data would return the values from a previous write rather than corrupt or erased data like a typical SD card. Even in cases where the wrong data is present on the next boot, fsck is able to deal with the older data being present in a 512B block.

The downsides to setting these modes are that it will about half the size of the eMMC module to 1.759Gib by default, and write speed will be slightly slower. Used with ext3/4 and the filesystem configured to journal data, this can provide a very robust system. This mode still has the tradeoff that any data not yet committed to disk will be lost.

Build a Battery Backup into Your System

Sometimes, the cheapest way to make your system reliable is to simply make sure power never fails. For low quantity/high reliability embedded applications, the extra cost for an uninterruptible power supply can be a very good investment.

TS-SILO

Offered as an on-board, soldered solution in our newer single board computer products like the TS-7680 and TS-7553-V2, TS-SILO uses super capacitors to provide 20 to 60 seconds of power hold. This is enough time to gracefully shutdown when a power outage is detected. Once depleted of stored energy, TS-SILO can be fully recharged in under a minute. You can read more about TS-SILO in our TS-SILO press release.

TS-BAT3 and TS-BAT10

The TS-BAT3 and TS-BAT10 are PC/104 power peripherals that can supply 1000 mAh or 2000 mAh, respectively, of 5V power to an embedded SBC when external power is not available. They communicate to the SBC using a serial port, so the SBC can intelligently shut down if necessary. The TS-BAT3 and TS-BAT10 are strong solutions for an embedded device that needs to shut down gracefully in all cases.

Other Suggestions

Use a Good Media Source

SanDisk has suggested one third of SD cards for sale are counterfeit. There are also reports of vendors reselling used cards that have already had significant wear. To avoid these cards affecting the reliability of your system make sure you are buying media from a reputable vendor recommended by your manufacturer of choice rather than a marketplace like Amazon or Ebay.

Be Aware of When Wearout Will Occur

We performed a wearout test with both SanDisk SD cards, and the Micron eMMC cards we use on our products. SanDisk 4GB MicroSD cards required approximately 1-2 months to completely exhaust, and would last between 8-12 TiB of written data. The 4GiB Micron eMMC cards began to fail from 100-200TiB of writes, and took just over a year of writing constantly to reproduce a wearout. These tests were performed on a 10MiB region of the card, so wear leveling was performing very well automatically and should not need to be a concern when writing your application. If wear is a concern there are steps you can take to reduce writes on a given Linux system. Contact us for more details.

Use compatible cards

Not all SD cards are entirely compatible with all controllers and even though a card conforms to the SDHC specification they may have slight variances. We have had known incompatibilities with many "Industrial" branded SD cards between both our SD controllers and those built directly into the CPUs we use. We test primarily with SanDisk and expect their media to perform well. We recommend performing an exhaustive wearout test on a card to verify compatibility. Some incompatible cards will lose data immediately, some may corrupt after several GiB of data, and some will last for many TiB before any failure. Contact us for more information on running exhaustive compatibility tests on a specific product.

Conclusion

Filesystem corruption is a frequent problem for embedded Linux systems. System designers need to make a business decision regarding how much per-unit cost and how much engineering to put into preventing it. The inputs to this decision are different for every application. Designers need to weigh the business costs of failure prevention against the business costs of occasional failures in the field. For some systems, it may be enough to provide the customer with an extra SD card that they can install in the event of a failure. For others, a failure in the field is prohibitively expensive, and it is best to spend a significant amount of money per unit on prevention.

This paper outlines four strategies that can reduce or eliminate failures due to filesystem corruption:

Mounting filesystems read-only
Using a software backup
Using a battery backup
Using a TS-SILO equipped solution

The success of an embedded product may depend on evaluating these options during the design phase.

Citations

Document History

Date of Issue/Revision	Revision Number	Comments
07/19/2016	2.0	Major update to bring things back up to speed with current software technology.
03/31/2010	1.0	Document created

We've Moved!