Rexursive®
 
Articles > ext4 and POSIX

Copyright © 2009 Bojan Smojver, Rexursive.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the licence is here
.

Introduction

As ext4 file system was introduced to stable Linux kernels, more and more people are starting to use it. And, like with any other software, the more it gets used, the more people find out about it.

Well, some people have found out that ext4, as of kernel 2.6.29, behaves differently than ext3 in ordered mode, when it comes to system crashes. The difference is that ext4 uses a technique called delayed allocation, which can then cause files in certain circumstances to be left empty on disk upon a crash, unless they have been explicitly committed beforehand.

Facts

The problem was discovered with programs that use small configuration files, which are written to disk using the open(), write(), close(), rename() sequence (we note here for completeness, that open() was called without any synchronised I/O flags). Essentially, the application writes new settings to a brand new (truncated) file with a different name and then renames that file into its proper name.

If it so happens that the machine crashes at that point, with ext4 as of 2.6.29, the file may be left empty on disk, because the blocks for its content have not been allocated yet. Obviously, this caused people that encountered this problem to be unhappy, because their programs would not start properly any more without the settings.

And, predictably, the ext4 file system, being the new thing in this scenario, got blamed.

As a side note, this particular case of ext4 behaviour has workarounds in place, queued for kernel 2.6.30. Major distributions already integrated these changes in their latest kernels, so everyone should feel more comfortable using ext4.

More facts

All of the above calls (open(), write(), close(), rename()) are defined in POSIX, which is the standard Linux kernel is aiming to implement. The write() call doesn't guarantee anything about data that's been written hitting the disk (for I/O that is not synchronized). Neither does close(). Yes, that is correct, when you close the file, it may not be written to disk. Similarly, rename() doesn't say anything about data of the file or the directory being written to disk at any particular time or in any particular order.

The only call (for the purposes of this discussion) that does say something about writing data to disk is fsync(), which is supposed to take the data and put it safely on platters. Now, fsync() may also lie to you and not actually do that properly (because hardware may be faulty or there are bugs in the driver etc.), but it is the best guarantee to have data written to disk. So, for all intents and purposes, it is the only explicit way of putting the data onto the disk platters (other Unix-like OSes may need other calls to do this, but we'll just call that fsync() from now on).

What fsync() does is, it takes a file descriptor and then commits all the data written into that file descriptor by physically putting it on disk. The file descriptor may be for a file or for a directory, both of which can be committed to disk via fsync() completely independently.

In the absence of fsync(), the kernel is free to make its own choices as to when the data and metadata associated with the file will be committed to disk. So, sometimes, the metadata gets committed first due to all kinds of different optimisations. In such a case, the contents of the new file, which is not yet committed, is still in kernel buffers and the inode on disk then doesn't have any real blocks to point to, so the size of the file on disk is zero. If the system crashes with the directory on disk pointing to such an inode, we get the empty file situation.

So, to be POSIX compliant, the only real way of having at least some kind of guarantee that the renamed file will have the data you want, is to call open(), write(), fsync(), close() and then rename(). Now, this will still not cover actually committing the directory to disk. But, if the kernel does then commit the directory, the data inside the file will be there. If it doesn't, the data of the old file will be there. This is not very important for this discussion, because both the old and the new configuration file are usually acceptable after the crash.

Crashes

Crashes are certainly not normal occurrences and the situation after the crash is not defined in POSIX. So, depending on your file system implementation, you may need to format you disks from scratch, reinstall and restore from backup (you have those, right? ;-).

ext4 file system was designed with reliability in mind, so how come it doesn't save these renamed files in such a way that there is either the new or the old file in place?

The reliability of the file system primarily refers to its ability to be used after the crash and relatively quickly. In other words, its ability to correct its internal structures in such a way that the user doesn't have to format, reinstall and restore. ext4 does that. And fast.

Surely, not having data in files is unpleasant, so the primary ext4 developer already put workarounds in place, so that at least for configuration files written using the sequence without fsync(), this does not happen any more. You will now get either the new or the old file, populated with data, after the crash. This then gives ext4 extra reliability, even in these situations, at the expense of some speed.

By providing high reliability characteristics mentioned above, ext4 already far exceeds the requirements of the POSIX standard.

Coming back to the original question, data that was never written to disk cannot be preserved after the crash. As we have seen from the analysis of the POSIX semantics, the problem with missing data in files is related to the fact that data destined for the new file was never actually committed to disk. In the absence of an explicit commit, it is by pure chance that sometimes these files will have data in them when renamed and sometimes not.

Bugs

So, where is the bug? In the file system or the applications?

Strictly speaking and if you don't expect your application to handle the crash case gracefully (which, we repeat, is not a regular occurrence), there is no bug. If the documentation of the application says that when the system crashes you need to restore files, then using open(), write(), close(), rename() is fine. However, such applications are less reliable in case of a crash, depending on the file system. Hence, if they claim they are doing their best to preserve your data, they do have a bug.

A bit more robust application, that wants to make stronger guarantees to data integrity, will use the more robust sequence of open(), write(), fsync(), close(), rename(). This will, however, have a performance penalty, so it may not be desirable. This is a trade-off that each application developer needs to make.

A more sophisticated and robust application will create backup files of its configuration files, using fsync() on both the file and the directory. Such backup files will be created rarely, which will then enable the application to continue using the fast open(), write(), close(), rename() cycle during its normal operation. In the case of a crash, the backup files will be used to (automatically) restore and continue.

The ext4 file system, however, is doing everything correctly (even without the latest workarounds) and there are no bugs there (there could be many other bugs in ext4, but that's not the issue here).

The way forward

It is perfectly fine to write applications and demand certain environment to run them in. So, if application developers want to keep using the not strictly robust sequence of open(), write(), close(), rename(), they should tell the users that a file system that doesn't reorder commits on such a rename is required for better data survival upon a crash.

Or, application developers can adopt one of the more robust techniques presented here (or develop their own) and be portable to many more platforms and file systems.

[ Valid XHTML 1.0 ]
[ Valid CSS ]
Last updated: 2009-12-02 21:32:40 AEDT  
Copyright © 2003 - 2009 Rexursive