Update: I've done a newer article on this subject at http://log.amitshah.net/2009/04/re-comparing-file-systems.html that removes some of the deficiencies in the tests mentioned here and has newer, more accurate results along with some new file systems.
How should one allocate disk space for a file for later writing? ftruncate() (or lseek() followed by write()) create sparse files, not what is needed. A traditional way is to write zeroes to the file till it reaches the desired file size. Doing things this way has a few drawbacks:
- Slow, as small chunks are written one at a time by the write() syscall
- Lots of fragmentation
posix_fallocate() is a library call that handles the chunking of writes in one batch; the application need not have to code his/her own block-by-block writes. But this still is in the userspace.
Linux 2.6.23 introduced the fallocate() system call. The allocation is then moved to kernel space and hence is faster. New file systems that support extents make this call very fast indeed: a single extent is to be marked as being allocated on disk (as traditionally blocks were being marked as 'used'). Fragmentation too is reduced as file systems will now keep track of extents, instead of smaller blocks.
posix_fallocate() will internally use fallocate() if the syscall exists in the running kernel.
So I thought it would be a good idea to make libvirt use posix_fallocate() so that systems with the newer file systems will directly benefit when allocating disk space for virtual machines. I wasn't sure of what method libvirt already used to allocate the space. I found out that it allocated blocks in 4KiB sized chunks.
So I sent a patch to the libvir-list to convert to posix_fallocate() and danpb asked me about what the benefits of this approach were and also asked about using alternative approaches if not writing in 4K chunks. I didn't have any data to back up my claims of "this approach will be fast and will result in less fragmentation, which is desirable". So I set out to do some benchmarking. To do that, though, I first had to make some empty disk space to create a few file systems of sufficiently large sizes. Hunting for a test machine with spare disk space proved futie, so I went about resizing my ext3 partition and creating about 15 GB of free disk space. I intended to test ext3, ext4, xfs and btrfs. I could use my existing ext3 partition for the testing, but that would not give honest results about the fragmentation (existing file systems may already be fragmented, causing big new files surely to be fragmented whereas on a fresh fs, I won't run into that risk).
Though even creating separate partitions on rotating storage and testing file system performance won't give perfectly honest results, I figured if the percentage difference in the results was quite high, that won't matter. I grabbed the latest Linus tree and the latest dev trees for the userspace utilities for all the file systems and created about 5GB partitions for each fs.
I then wrote a program that created a file, allocated disk space and closed it and calculate the time taken in doing so. This was done multiple times for different allocation methods: posix_fallocate(), mmap() + memset() and writing zeroes in 4096 byte chunks and 8192 byte chunks.
So I had four methods of allocating files and 5G partition size. So I decided to check the performance by creating 1GiB file size for each allocation method.
I was quite surprised seeing poor performance for posix_fallocate() on ext4. On digging a bit, I realised mkfs.ext4 didn't create it with extents enabled. I reformatted the partition, but that data was valuable to have as well. Shows how much a file system is better with extents support.
The results are very impressive and the final patches to libvirt were finalised pretty quickly. They're now in the development branch libvirt. Coming soon to a virtual machine management application near you.
Use of posix_fallocate() will be beneficial to programs that know in advance the size of the file being created, like torrent clients, ftp clients, browsers, download managers, etc. It won't be beneficial in the speed sense, as data is only written when it's downloaded, but it's beneficial in the as-less-fragmentation-as-possible sense.