May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix


News See also Recommended Links Tutorials Introductory materials Unix filesystems

Windows filesystems

Network filesystems
Veritas Volume Manager Loopback filesystem RAM Disks VFS Proc pseudo file system NTFS Ext2/Ext3 File System UFS filesystem
Recovery of lost files using DD Windows file Recovery Protective partitioning Disk partitioning Logical Volume Snapshots  Random Findings  Humor Etc

Science is facts; just as houses are made of stones, so is science made of facts; but a pile of stones is not a house and a collection of facts is not necessarily science.

-- Henri Poincaire

Filesystems is a very interesting area, one of the few areas in Unix where new algorithms  and new ideas still can make a huge difference in performance. Solaris ZFS is one example in this regard. 

Often the historical view on filesystems is a bit too Unix-centric and states that the Berkeley Fast File System is the ancestor of most modern file systems. This view ignores competitive and earlier implementations from IBM(HPFS), DEC (VAX VMS), Microsoft (NTFS) and others.

Still Unix filesystems became a classic and concepts introduced in it dominate all modern filesystems It also introduced many interesting features and algorithms into the area. For example a very interesting concept of extended attributes introduced in the 4.4 BSD filesystem have recently been added to Ext2fs:

Immutable files can only be read: nobody can write or delete them. This can be used to protect sensitive configuration files.

Append-only files can be opened in write mode but data is always appended at the end of the file. Like immutable files, they cannot be deleted or renamed. This is especially useful for log files which can only grow. All-in all following attributes are available at ext2f:

  1. A (no Access time): if a file or directory has this attribute set, whenever it is accessed, either for reading of for writing, its last access time will not be updated. This can be useful, for example, on files or directories which are very often accessed for reading, especially since this parameter is the only one which changes on an inode when it's open read-only.
  2. a ( append only): if a file has this attribute set and is open for writing, the only operation possible will be to append data to its previous contents. For a directory, this means that you can only add files to it, but not rename or delete any existing file. Only root can set or clear this attribute.
  3. d (no dump): dump (8) is the standard UNIX utility for backups. It dumps any filesystem for which the dump counter is 1 in /etc/fstab (see chapter "Filesystems and Mount Points"). But if a file or directory has this attribute set, unlike others, it will not be taken into account when a dump is in progress. Note that for directories, this also includes all subdirectories and files under it.
  4. i ( immutable): a file or directory with this attribute set simply can not be modified at all: it can not be renamed, no further link can be created to it [1] and it cannot be removed. Only root can set or clear this attribute. Note that this also prevents changes to access time, therefore you do not need to set the A attribute when i is set.
  5. s ( secure deletion): when such a file or directory with this attribute set is deleted, the blocks it was occupying on disk are written back with zeroes.
  6. S ( Synchronous mode): when a file or directory has this attribute set, all modifications on it are synchronous and written back to disk immediately.

Unix filesystem is a classic, but classic has it's own problems: it's actually an old and largely outdated filesystem that outlived its usefulness.  Later ideas implemented in HPFS, BFS and several other more modern filesystems are absent in plain-vanilla implementation of Unix file systems. Balanced trees now serve the base of most modern filesystems including ReiserFs (which started as NTFS clone but aqured some unique features in the process of development):

The Reiser Filesystems by Hans Reiser [and Moscow University researchers], a very ambitious project to not only improve performance and add journaling, but to redefine the filesystem as a storage repository for arbitrarily complex objects. reiserfs. Reiserfs is faster than ext2/3 because it uses balanced trees for it's directory-structures. It was used by Suse and Gentoo.

 Unfortunately the novel feature introduced in HPFS called extended attributes never got traction in other filesystems.  Of course the fundamental decision to make attributes indexable deserves closer examination, given the costs of indexing, but still the fixed set of attributes (like in UFS) created too many problems to ignore this issue. Still I think that extended attributes should be present in a filesystem, and they can replace such kludges as #! notation in UNIX for specifying default processor in executable files.

Top Visited
Past week
Past month


Old News ;-)

[Jan 05, 2012]

Scalpel is a fast file carver that reads a database of header and footer definitions and extracts matching files or data fragments from a set of image files or raw device files. Scalpel is filesystem-independent and will carve files from FATx, NTFS, ext2/3, HFS+, or raw partitions. It is useful for both digital forensics investigation and file recovery.

Notes on Platforms


The preferred platform for using Scalpel is Linux.


Scalpel will also compile under Windows (32 or 64-bit) using mingw. If you'd like to try Scalpel on Windows without the bother of compiling it yourself, an executable and appropriate libraries are included in the distribution--just untar and go. Note that under Windows, the pthreads DLL must be present in the same directory as the Scalpel executable. Carving physical and logical devices directly under Windows (e.g., using \\.\physicaldrive0 as a target) is not supported in the current release.

Mac OS X

As of v1.53, Scalpel is supported on Mac OS X.

All platforms

As of v1.54, Scalpel supports carving files larger than 4GB on all platforms.

As of v1.60, Scalpel supports preview carving and other new carving modes. See the distribution for details.

As for v2.0, Scalpel supports regular expressions for headers and footers, minimum carve sizes, multithreading and asynchronous I/O, and beta-level support for GPU-accelerated file carving.

[Oct 23, 2009] Linux Overview for Solaris Users Open desktop mechanic

Interesting, albeit outdated discussion

It also doesn't tell how messy Solaris's VFS is ... and nothing about Sun's marketing bullshit like ZFS.

Posted by mamamia on May 24, 2006 at 05:40 PM GMT+00:00

Do you have any details you can share here? It doesn't go into scaling deficiencies in GNU/Linux NFS clients or defects in popular GNU/Linux filesystems either:

But it is a useful guide for those of us who recognize that there are applications where Solaris is a better fit and there are other applications where a GNU/Linux dist is a better fit. I haven't met anyone who has used ZFS, understood it and still shares your opinion. In my opinion, the only thing Sun's marketing might be guilty of is focusing on catchy and confusing names for technology and not enough on explaining how unique and useful the technology is. As I write this, I'm transfering my DV video and photos from an HFS+ volume to a ZFS pool. I'm looking forward to the day when Apple, Linux and Microsoft have a filesystem which can raid, resilver and compress as easily as ZFS and which can validate that what I read from disk is what I wrote. Regardless of what name marketing comes up for these features, I doubt I'm the only one who finds them useful.

Posted by bnitz on May 24, 2006 at 11:33 PM GMT+00:00 #

Perhaps he's referring to the fact that ZFS still hasn't shipped yet Solaris people (I am guilty of this, too) have been touting it for the past year as an advantage over Linux. ZFS is not yet a supported shipped Solaris feature, no matter how much we wish it was.

Posted by jofa beetz on May 25, 2006 at 12:10 AM GMT+00:00 #

1) just compare Linux VFS and Solaris VFS. Solaris one is a mess 2) due to this Solaris UFS is race in ufs_rename() 3) I have read about ZFS, including on-disk structures paper. and I definitely can say that: a) 256 _bytes_ block pointers suck b) if you change a single byte, you need to rewrite and re-calculate checksum of several blocks, which can be very large. this also suck 4) instead of packing all _unrelated_ things into a single one as ZFS does, I hope Linux will just get a good API to ask raid for recovery of given block

Posted by on May 25, 2006 at 08:31 AM GMT+00:00 #

Jofa, That is a fair point. ZFS is still only available via Solaris Express and other unsupported opensolaris releases. I wish Sun could figure out a support model for products which fall between "more stable than Microsoft Windows" and "stable enough to run multibillion dollar businesses for decades." Someone will.

Posted by bnitz on May 25, 2006 at 09:34 AM GMT+00:00 #

If Linux VFS is so wonderful, how come the unionfs guys are having such a horrid time fixing all the corner cases?

Posted by jmansion on May 25, 2006 at 10:06 AM GMT+00:00 #

VFS just wasn't designed for them probably. or they don't understand it well. the point is that VFS implements model and does it very well. in contrary, Solaris has no VFS at all. silly methods switcher and dnlc can't be considered a good model -- just an ugly hack. I also remember how VFS did help to filter reiser4 crap from getting to the kernel. for case of Solaris, nothing would prevent a stupid developer to do absolutely wrong things ...

Posted by mamamia on May 25, 2006 at 11:15 AM GMT+00:00 #

mamamia, The ufs_rename race was fixed as of Nevada Build 25 you should see it in Solaris 10 update 2. I'm not a ZFS expert so you might want to post your comments and criticisms on one of the ZFS forums or blogs. Nothing I've seen from userland suggests that "256 byte block pointers suck", and if you want ZFS checksumming to behave as it does on other filesystems, do this: zfs set checksum=off {pool name}

Posted by bnitz on May 25, 2006 at 11:30 AM GMT+00:00 #

nice, how long it took to fix rename race? I want checksums, but definitely not at this cost. and that would be really unexpected to see users complaining about internals like size of block pointers. given how much marketing bullshit Sun puts everywhere about Solaris/ZFS, I tend to think time when Sun was for engineers has gone and new niche is "all those idiots who prefer 128bit fs over really good and balanced design".

Posted by mamamia on May 26, 2006 at 07:26 PM GMT+00:00 #

I'd hoped this document would help people become familiar with linux and Solaris. I try to avoid tool zealotry. Yes, the rename race was a bug. There are also bugs in GNU/Linux. I have some 40G drives which mysteriously stopped working and are no longer partitionable or detectable by BIOS immediately after ReiserFS/Kernel 2.6 synced them. I have 2 USB memory sticks which seem to have been killed by this Linux problem and I'm familiar with the 2.6 kernel's ability to destroy certain CD-ROM drives.

If your understanding of ZFS is based on the marketing material, then maybe the material is incorrect. Have you trie downloading Solaris express and benchmarking ZFS against your favorite filesystem? Try ZFS with checksumming on and off. Even if MD5sums were calculated all the way back to the root block for a 1 byte change (they aren't), why should you care if it performs? If you took any other existing filesystem and made the checksum granularity, inode and block pointer size precisely what you consider to be ideal, it would still be missing most of what makes ZFS a well designed filesystem. As a user I don't even care about these parameters as long as I don't bump into the 64k, 640K, 504M, 2G, 8G, 32G, 127.5G... barriers. I'm more interested in the fact that I can just buy another hard drive, add it to the pool and immediately increase the storage in that pool.

Posted by bnitz on May 27, 2006 at 12:25 AM GMT+00:00 #

there is no way to bechmark ZFS against different filesystem because ZFS is Solaris-only and UFS2 isn't that modern. if you're going to use it for your .html/.doc, re-calculating all blocks is OK. but if you need to modify few thousand files every second, write >1GB/s, such a re-calculation is going to be too expensive. if you're that interested in adding another hard drive, you should have known this feature has been existing for a decade and called LVM.

Posted by on May 27, 2006 at 09:09 AM GMT+00:00 #

So you have a process which modifies several thousand files a second and writes > 1GB/s and it's CPU bound? This is unusual but it might be a good case running zfs set checksum=off {pool name}. You'd be no worse off than with other filesystems which never checksum all data.

I don't think Linux LVM would allow you to grow a single drive to a RAID1. It certainly wouldn't let you do it online and you would have the extra step of increasing the filesystem size to match the new pool and praying that it works. With ZFS it's just a matter of typing zpool add {pool name} {drive}. LVM wouldn't give you clones or rollbacks either.

Posted by bnitz on May 27, 2006 at 11:48 AM GMT+00:00 #

lol. writing >1GB/s is unusual for Sun/Solaris? when you write at those rates, every written/checksummed block is important for performance. and ZFS does much more than needed. at cost of performance, of course.

LVM2 does all things online, including rw snapshots. I can write a few-line shell script that will be named zpool, will add a drive to given fs and will resize fs properly. surprize?

PS. i'm shocked how effective Sun's marketing department, especially for own staff.

Posted by on May 27, 2006 at 12:20 PM GMT+00:00 #

I didn't say >1GB/s is unusual for Solaris I said > 1GB/s CPU bound is unusual, and it is for any X86 OS until disks and I/O bandwith improve by an order of magnitude or two.

I'm trying to figure out what kind of hardware/os combination you're using. XFS seems to plateau at about 0.5GB/s, EXT3 is worse. EXT2 might be faster, but its lack of journaling and reliable cache flushing makes it unworthy of enterprise class. You don't sound like a Niagara enthusiast. You sound savvy enough to not be fooled into thinking that cache writes == writes and you certainly wouldn't confuse the time of an LVM2/ZFS snapshot with the time to replicate a volume's data. We could compare theoretical deficiencies in ZFS with a theoretical filesystem all day, but I'd like to see some specifics and numbers. I'll be happy to file a ZFS performance bug when I can demonstrate that it is real.

As I said, "LVM wouldn't give you clones or rollbacks either", yes LVM2 can do snapshots, but it doesn't seem to include clones or live rollback capability. Does it allow you to mount a live snapshot? Here are some differences between ZFS and Linux+LVM+Raid.

I doub't I've read as much ZFS marketing material as you have. I learned about ZFS from bloggers (most of whom aren't Sun employees), from man pages and from using it.

Posted by bnitz on May 29, 2006 at 01:33 PM GMT+00:00 #

[Sep 12, 2009] Lies, Damn Lies and File System Benchmarks by Jeffrey B. Layton

Jeff Layton is an Enterprise Technologist for HPC at Dell.
August 11th, 2009 | Linux Magazine

Nine-Year Review of Benchmarks

Recently there was a paper published by Avishay Traeger and Erez Zadok from Stony Brook University and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research Center entitled, "A Nine Year Study of File System and Storage Benchmarking" (Note: a summary of the paper can be found at this link). The paper examines 415 file systems and storage benchmarks from 106 recent papers. Based on this examination the paper makes some very interesting observations and conclusions that are, in many ways, very critical of the way "research" papers have been written about storage and file systems. These results are important to good benchmarking. And, stepping back from that, they make recommendations on how to perform good benchmarks (or at the very minimum, "better" benchmarks).

The research included papers from the Symposium on Operating Systems Principles (SOSP), the Symposium on Operating Systems Design and Implementation (OSDI), the USENIX Conference on File and Storage Technologies (FAST), and the USENIX Annual Technical Conference (USENIX). The conferences range from 1999 through 2007. The criteria for the selection of papers was fairly involved but focused on papers of good quality that covered benchmarks focusing on performance not on correctness or capacity. Of the 106 papers surveyed, the researchers included 8 of their own.

When selecting the papers, they used two underlying themes or guidelines for evaluation:

Breaking Down Good Benchmarks

Repetition One of the simplest things that can be done for a benchmark is to run the benchmark a number of times and report the median or average. In addition, it would be extremely easy (and helpful) to report some measure of the spread of the data such as a standard deviation. This allows the reader to get an idea of what kind of variation they could see if they tried to reproduce the results and it also allows readers to understand the overall performance over a period of time.

The paper examined the 106 benchmark papers for the number of times the benchmark was run. The table below is from the review paper for all 388 benchmarks examined and is broken down by conference. Since most of the time the data was unclear, it was assumed that each benchmark was run only once.

Table 1 - Statistics of Number of Runs by Conference

Conference Mean Standard Deviation Median
SOSP 2.1 2.4 1
FAST 3.6 3.6 1
OSDI 3.8 4.3 2
USENIX 4.7 6.2 3

It is fairly obvious that the dispersion in the data is quite large. In some cases the standard deviation is as large or larger than the mean value.

Runtime The next topic examined is the runtime of the benchmark. Of the 388 benchmarks examined, only 198 (51%) specified the elapsed time of the benchmark. From this data, it was found:

Typically run times that are short (less than one minute) are too fast to achieve any sort of steady-state value.

With 49% of the benchmarks having no known runtime and another 28.6% running for less than a minute, easily three-quarters of these results should cause some of your warning bells to start ringing. If there's no data, it's not a benchmark; it's an advertisement.

[Sep 12, 2009] Metadata Performance of Four Linux File Systems by Jeffrey B. Layton

Linux Magazine

In a previous article, the case was made for how low file system benchmarks have fallen. Benchmarks have become the tool of marketing to the point where they are mere numbers and do not prove of much use. The article reviewed a paper that examined nine years of storage and file system benchmarking and made some excellent observations. The paper also made some recommendations about how to improve benchmarks.

This article isn't so much about benchmarks as a product, but rather it is an exploration looking for interesting observations or trends or the lack thereof. In particular this article examines the metadata performance of several Linux file systems using a specific micro-benchmark. Fundamentally this article is really an exploration to understand if there is any metadata performance differences between 4 Linux file systems (ext3, ext4, btrfs, and nilfs) using a metadata benchmark called fdtree. So now it's time to eat our dog food and do benchmarking with the recommendations previously mentioned.

Start at the Beginning - Why?

The previous article made several observations about benchmarking, one of which is that storage and file system benchmarks seldom, if ever, explain why they are performing a benchmark. This is a point that is not to be underestimated. Specifically, if the reason why the benchmark was performed can not be adequately explained, then the benchmark itself becomes suspect (it may just be pure marketing material).

Given this point, the reason the benchmark in this article is being performed is to examine or explore if, and possibly how much, difference there is between the metadata performance of four Linux file systems using a single metadata benchmark. The search is not to find which file system is the best because it is a single benchmark, fdtree. Rather it is to search for differences and contrast the metadata performance of the file systems.

Why is examining the metadata performance a worthwhile exploration? Glad that you asked. There are a number of applications, workloads, and classes of applications that are metadata intensive. Mail servers can be very metadata intensive applications because of the need to read and write very small files. Sometimes databases have workloads that do a great deal of reading and writing small files. In the world of technical computing, many bioinformatic applications such as gene sequencing applications, do a great deal of small reads and writes.

The metadata benchmark used in this article is called fdtree. It is a simple bash script that stresses the metadata aspects of the file system using standard *nix commands. While it is not the most well known benchmark in the storage and file system world, it is a bit better known in the HPC (High Performance Computing) world.

[Sep 12, 2009] Metadata Performance Exploration Part 2 XFS, JFS, ReiserFS, ext2, and Reiser4 by Jeffrey B. Layton

September 8th, 2009 | Linux Magazine

More performance: We add five file systems to our previous benchmark results to creating a "uber" article on metadata file system performance. We follow the "good" benchmarking guidelines presented in a previous article and examine the good, the bad and the interesting.

[Oct 29, 2008] Guide to Linux Filesystem Mastery

Nice intro...

by Sheryl Calish

What is a "filesystem," anyway? Sheryl Calish explains the concept as well as its practical application

Published November 2004

Although the kernel is the heart of Linux, files are the main vehicles through which users interact with the operating system. This is especially true of Linux, because in the UNIX tradition, it uses the file I/O mechanism to manage hardware devices as well as with data files.

Unfortunately, the terminology used to discuss Linux filesystem concepts is a bit confusing for newcomers. The terms filesystem and file system are used interchangeably in the Linux documentation to refer to several different but related concepts. They refer to the data structures as well as the methods that manage the files within the partitions, in addition to specific instances of a disk partition.

To further confuse the uninitiated, these terms are also used to refer to the overall organization of files in a system: the directory tree. Then again, they can refer to each of the subdirectories within the directory tree, as in the /home filesystem. Some hold that these directories and subdirectories cannot truly be called a filesystem unless they each reside on their own disk partition. Nevertheless, others do refer to them as filesystems, contributing to the confusion.

Linux veterans understand, from context, the sense in which these terms are used. Newcomers, however, have an understandably harder time discerning the context.

The overriding objective of this article is to provide enough background to help you discern the context of this terminology for yourself. In the process of untangling the subtleties of the filesystem terminology, however, you will also acquire the knowledge to move beyond the theoretical to the practical application of some very useful related tools.

The article focuses on the Linux disk partitions and file management system features in version 2.4 of the Linux kernel. It also reviews new features available in version 2.6 of the kernel.

[Sep 9, 2008] GNU ddrescue 1.9-pre1 (Development) by Antonio Diaz Diaz

About: GNU ddrescue is a data recovery tool. It copies data from one file or block device (hard disc, cdrom, etc) to another, trying hard to rescue data in case of read errors. GNU ddrescue does not truncate the output file if not asked to. So, every time you run it on the same output file, it tries to fill in the gaps. The basic operation of GNU ddrescue is fully automatic. That is, you don't have to wait for an error, stop the program, read the log, run it in reverse mode, etc. If you use the logfile feature of GNU ddrescue, the data is rescued very efficiently (only the needed blocks are read). Also you can interrupt the rescue at any time and resume it later at the same point.

Changes: The new option "--domain-logfile" has been added. This release is also available in lzip format. To download the lzip version, just replace ".bz2" with ".lz" in the tar.bz2 package name.

[Nov 14, 2007] Anatomy of the Linux SCSI subsystem by M. Tim Jones ([email protected]), Consultant Engineer, Emulex Corp.

The Small Computer Systems Interface (SCSI) is a collection of standards that define the interface and protocols for communicating with a large number of devices (predominantly storage related). Linux® provides a SCSI subsystem to permit communication with these devices. Linux is a great example of a layered architecture that joins high-level drivers, such as disk or CD-ROM drivers, to a physical interface such as Fibre Channel or Serial Attached SCSI (SAS). This article introduces you to the Linux SCSI subsystem and discusses where this subsystem is going in the future.

[Oct 30, 2007] Anatomy of the Linux file system by M. Tim Jones ([email protected]), Consultant Engineer, Emulex Corp.

When it comes to file systems, Linux® is the Swiss Army knife of operating systems. Linux supports a large number of file systems, from journaling to clustering to cryptographic. Linux is a wonderful platform for using standard and more exotic file systems and also for developing file systems. This article explores the virtual file system (VFS)-sometimes called the virtual filesystem switch-in the Linux kernel and then reviews some of the major structures that tie file systems together.


data=writeback While the writeback option provides lower data consistency guarantees than the journal or ordered modes, some applications show very significant speed improvement when it is used. For example, speed improvements can be seen when heavy synchronous writes are performed, or when applications create and delete large volumes of small files, such as delivering a large flow of short email messages. The results of the testing effort described in Chapter 3 illustrate this topic.

When the writeback option is used, data consistency is similar to that provided by the ext2 file system. However, file system integrity is maintained continuously during normal operation in the ext3 file system.

In the event of a power failure or system crash, the file system may not be recoverable if a significant portion of data was held only in system memory and not on permanent storage. In this case, the filesystem must be recreated from backups. Often, changes made since the file system was last backed up are inevitably lost.

[Aug 7, 2007] Linux Replacing atime

August 7, 2007 | KernelTrap

Submitted by Jeremy on August 7, 2007 - 9:26am.

In a recent lkml thread, Linus Torvalds was involved in a discussion about mounting filesystems with the noatime option for better performance, "'noatime,data=writeback' will quite likely be *quite* noticeable (with different effects for different loads), but almost nobody actually runs that way."

He noted that he set O_NOATIME when writing git, "and it was an absolutely huge time-saver for the case of not having 'noatime' in the mount options. Certainly more than your estimated 10% under some loads."

The discussion then looked at using the relatime mount option to improve the situation, "relative atime only updates the atime if the previous atime is older than the mtime or ctime. Like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified."

Ingo Molnar stressed the significance of fixing this performance issue, "I cannot over-emphasize how much of a deal it is in practice. Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_." He submitted some patches to improve relatime, and noted about atime:

"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"

[Jun 6, 2007] Funtoo Filesystem Guide

This series was originally called "Advanced Filesystem Implementor's Guide and was published on IBM developerWorks

[Jun 19, 2006] Tuning UFS in the Solaris OS for Use With Many Small Files

This methodology utilizes a tmpfs volume, and it can speed up operations approximately three times.

This document describes a methodology for configuring a fast file system that handles several small files on the Solaris Operating System. This could be used for building a Java technology-based product or for handling many operations on a large amount of small files. This methodology utilizes a tmpfs volume, and it can speed up operations approximately three times.

The requirements are as follows:

Warning: Do not develop on a tmpfs volume. A tmpfs volume is only persistent while the system is powered up, so a power loss or system problem will cause you to lose any changes to that volume.


Solaris tmpfs volumes are easy to create, but require a significant amount of RAM and swap space. It is recommended that you have at least 1 Gbyte of RAM, but there have also been major performance gains on systems with 512 Mbytes of RAM. In addition, you should add twice as much swap space as the tmpfs volume you are creating. That is, for a 2-Gbyte tmpfs volume, add 4 Gbytes of swap space to the system. Feel free to experiment with these values.

The following examples are for a 2-Gbyte tmpfs volume, which is approximately what is needed to do a developer build. Replace <swapfilename> with the absolute path to a swapfile (such as /disk1/swapfile), and <mountpoint> with the absolute path to where you want the tmpfs volume mounted (such as /ramdisk).

Add swap space to your workstation:

root# /usr/sbin/mkfile 2000m <swapfilename>

Create a mount point for the tmpfs volume:

root# mkdir <mountpoint>

Edit your /etc/vfstab file to use the swap and create the tmpfs volume at boot time. Add the following two lines:

<swapfilename> - - swap - no -
RAMDISK - <mountpoint> tmpfs - yes size=2000m

Note that on the Solaris 7 OS you may not make a single tmpfs volume larger than 2 Gbytes.

Edit your kernel parameters to increase the number of files you can create in the tmpfs volume. Add the following line to your /etc/system file. (We've had the most success using this value.)

set tmpfs:tmpfs_maxkmem=250000000

Reboot your workstation. Then verify that the tmpfs volume exists at the size you specified:

% df -k <mountpoint>

Make the tmpfs volume writable. Note: This step is necessary after each reboot of the workstation.

root# chmod 777 <mountpoint>

[May 16, 2006] Debian Administration Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch

There are a lot of Linux filesystems comparisons available but most of them are anecdotal, based on artificial tasks or completed under older kernels. This benchmark essay is based on 11 real-world tasks appropriate for a file server with older generation hardware (Pentium II/III, EIDE hard-drive).

Since its initial publication, this article has generated
a lot of questions, comments and suggestions to improve it.
Consequently, I'm currently working hard on a new batch of tests
to answer as many questions as possible (within the original scope
of the article).

Results will be available in about two weeks (May 8, 2006)

Many thanks for your interest and keep in touch with!


Why another benchmark test?

I found two quantitative and reproductible benchmark testing studies using the 2.6.x kernel (see References). Benoit (2003) implemented 12 tests using large files (1+ GB) on a Pentium II 500 server with 512MB RAM. This test was quite informative but results are beginning to aged (kernel 2.6.0) and mostly applied to settings which manipulate exclusively large files (e.g., multimedia, scientific, databases).

Piszcz (2006) implemented 21 tasks simulating a variety of file operations on a PIII-500 with 768MB RAM and a 400GB EIDE-133 hard disk. To date, this testing appears to be the most comprehensive work on the 2.6 kernel. However, since many tasks were "artificial" (e.g., copying and removing 10 000 empty directories, touching 10 000 files, splitting files recursively), it may be difficult to transfer some conclusions to real-world settings.

Thus, the objective of the present benchmark testing is to complete some Piszcz (2006) conclusions, by focusing exclusively on real-world operations found in small-business file servers (see Tasks description).

Test settings

Description of selected tasks

The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported). Each sequence takes about 7 min. Time to complete task (in secs), percentage of CPU dedicated to task and number of major/minor page faults during task were computed by the GNU time utility (version 1.7).


Partition capacity

Initial (after filesystem creation) and residual (after removal of all files) partition capacity was computed as the ratio of number of available blocks by number of blocks on the partition. Ext3 has the worst inital capacity (92.77%), while others FS preserve almost full partition capacity (ReiserFS = 99.83%, JFS = 99.82%, XFS = 99.95%). Interestingly, the residual capacity of Ext3 and ReiserFS was identical to the initial, while JFS and XFS lost about 0.02% of their partition capacity, suggesting that these FS can dynamically grow but do not completely return to their inital state (and size) after file removal.
Conclusion : To use the maximum of your partition capacity, choose ReiserFS, JFS or XFS.

File system creation, mounting and unmounting

The creation of FS on the 20GB test partition took 14.7 secs for Ext3, compared to 2 secs or less for other FS (ReiserFS = 2.2, JFS = 1.3, XFS = 0.7). However, the ReiserFS took 5 to 15 times longer to mount the FS (2.3 secs) when compared to other FS (Ext3 = 0.2, JFS = 0.2, XFS = 0.5), and also 2 times longer to umount the FS (0.4 sec). All FS took comparable amounts of CPU to create FS (between 59% - ReiserFS and 74% - JFS) and to mount FS (between 6 and 9%). However, Ex3 and XFS took about 2 times more CPU to umount (37% and 45%), compared to ReiserFS and JFS (14% and 27%).
Conclusion : For quick FS creation and mounting/unmounting, choose JFS or XFS.

Operations on a large file (ISO image, 700MB)

The initial copy of the large file took longer on Ext3 (38.2 secs) and ReiserFS (41.8) when compared to JFS and XFS (35.1 and 34.8). The recopy on the same disk advantaged the XFS (33.1 secs), when compared to other FS (Ext3 = 37.3, JFS = 39.4, ReiserFS = 43.9). The ISO removal was about 100 times faster on JFS and XFS (0.02 sec for both), compared to 1.5 sec for ReiserFS and 2.5 sec for Ext3! All FS took comparable amounts of CPU to copy (between 46 and 51%) and to recopy ISO (between 38% to 50%). The ReiserFS used 49% of CPU to remove ISO, when other FS used about 10%. There was a clear trend of JFS to use less CPU than any other FS (about 5 to 10% less). The number of minor page faults was quite similar between FS (ranging from 600 - XFS to 661 - ReiserFS).
Conclusion : For quick operations on large files, choose JFS or XFS. If you need to minimize CPU usage, prefer JFS.

Operations on a file tree (7500 files, 900 directories, 1.9GB)

The initial copy of the tree was quicker for Ext3 (158.3 secs) and XFS (166.1) when compared to ReiserFS and JFS (172.1 and 180.1). Similar results were observed during the recopy on the same disk, which advantaged the Ext3 (120 secs) compared to other FS (XFS = 135.2, ReiserFS = 136.9 and JFS = 151). However, the tree removal was about 2 times longer for Ext3 (22 secs) when compared to ReiserFS (8.2 secs), XFS (10.5 secs) and JFS (12.5 secs)! All FS took comparable amounts of CPU to copy (between 27 and 36%) and to recopy the file tree (between 29% - JFS and 45% - ReiserFS). Surprisingly, the ReiserFS and the XFS used significantly more CPU to remove file tree (86% and 65%) when other FS used about 15% (Ext3 and JFS). Again, there was a clear trend of JFS to use less CPU than any other FS. The number of minor page faults was significantly higher for ReiserFS (total = 5843) when compared to other FS (1400 to 1490). This difference appears to come from a higher rate (5 to 20 times) of page faults for ReiserFS in recopy and removal of file tree.
Conclusion : For quick operations on large file tree, choose Ext3 or XFS. Benchmarks from other authors have supported the use of ReiserFS for operations on large number of small files. However, the present results on a tree comprising thousands of files of various size (10KB to 5MB) suggest than Ext3 or XFS may be more appropriate for real-world file server operations. Even if JFS minimize CPU usage, it should be noted that this FS comes with significantly higher latency for large file tree operations.

Directory listing and file search into the previous file tree

The complete (recursive) directory listing of the tree was quicker for ReiserFS (1.4 secs) and XFS (1.8) when compared to Ext3 and JFS (2.5 and 3.1). Similar results were observed during the file search, where ReiserFS (0.8 sec) and XFS (2.8) yielded quicker results compared to Ext3 (4.6 secs) and JFS (5 secs). Ext3 and JFS took comparable amounts of CPU for directory listing (35%) and file search (6%). XFS took more CPU for directory listing (70%) but comparable amount for file search (10%). ReiserFS appears to be the most CPU-intensive FS, with 71% for directory listing and 36% for file search. Again, the number of minor page faults was 3 times higher for ReiserFS (total = 1991) when compared to other FS (704 to 712).
Conclusion : Results suggest that, for these tasks, filesystems can be regrouped as (a) quick and more CPU-intensive (ReiserFS and XFS) or (b) slower but less CPU-intensive (ext3 and JFS). XFS appears as a good compromise, with relatively quick results, moderate usage of CPU and acceptable rate of page faults.


These results replicate previous observations from Piszcz (2006) about reduced disk capacity of Ext3, longer mount time of ReiserFS and longer FS creation of Ext3. Moreover, like this report, both reviews have observed that JFS is the lowest CPU-usage FS. Finally, this report appeared to be the first to show the high page faults activity of ReiserFS on most usual file operations.

While recognizing the relative merits of each filesystem, only one filesystem can be install for each partition/disk. Based on all testing done for this benchmark essay, XFS appears to be the most appropriate filesystem to install on a file server for home or small-business needs :

While Piszcz (2006) did not explicitly recommend XFS, he concludes that "Personally, I still choose XFS for filesystem performance and scalability". I can only support this conclusion.


Benoit, M. (2003). Linux File System Benchmarks.

Piszcz, J. (2006). Benchmarking Filesystems Part II. Linux Gazette, 122 (January 2006).

Comment on this article

Anatomy of a Read and Write Call - 21k By Pat Shuff

2002-09-20 (Linux Journal)

We look at three different tactics for optimizing read and write performance under Linux.

A few years ago I was tasked with making the Spec96 benchmark suite produce the fastest numbers possible using the Solaris Intel operating system and Compaq Proliant servers. We were given all the resources that Sun Microsystems and Compaq Computer Corporation could muster to help take both companies to the next level in Unix computing on the Intel architecture. Sun had just announced its flagship operating system on the Intel platform and Compaq was in a heated race with Dell for the best departmental servers. Unixware and SCO were the primary challengers since Windows NT 3.5 was not very stable at the time and no one had ever heard of an upstart graduate student from overseas who thought that he could build a kernel that rivaled those of multi-billion dollar corporations.

Now many years later, Linux has gained considerable market share and is the De facto Unix for all the major hardware manufacturers on the Intel architecture. In this article, I will attempt to take the lessons learned from this tuning exercise and show how they can be applied to the Linux operating system.

As it turned out, the gcc benchmark was the one that everyone seemed to be improving on the most. As we analyzed what the benchmark was doing, we found out that basically it opened a file, read its contents, created a new file, wrote new contents, then closed both files. It did this over and over and over. File operations proved to be the bottleneck in performance. We tried faster processors with insignificant improvement. We tried processors with huge (at the time) level 1 and level 2 cache and still found no significant improvement. We tried using a gigabyte of memory and found little or no improvement. By using the vmstat command, we found that the processor was relatively idle, little memory was being used, but we were getting a significant amount of reads and writes to the root disk. Using the same hardware and same test programs, Unixware was 25% faster than Solaris Intel. Initially, we decided that Solaris was just really slow. Unfortunately, I was working for Sun at the time and this was not the answer that we could take to my management. We had to figure out why it was slow and make recommendations on how to improve the performance. The target was 25% faster than Unixware, not slower.

The first thing that we did was to look at the configurations. It turns out that the two systems were identical hardware,. We just booted a different disk to boot the other operating system. The Unixware system was configured with /tmp as a tmpfs whereas the Solaris system had /tmp on the root file system. We changed the Solaris configuration to use tmpfs but it did not significantly improve performance. Later, we found that this was due to a bug in the tmpfs implementation on Solaris Intel. By braking down the file operation, we decided to focus on three areas; the libc interface, the node/dentry layer, and the device drivers managing the disk. In this article, we will look at the three different layers and talk about how to improve performance and how they specifically apply to Linux.

LISA 2001 Paper LISA 2001 Paper about RUF

This paper describes a utility named ruf that reads files from an unmounted file system. The files are accessed by reading disk structures directly so the program is peculiar to the specific file system employed. The current implementation supports the *BSD FFS, SunOS/Solaris UFS, HP-UX HFS, and Linux ext2fs file systems. All these file systems derive from the original FFS, but have peculiar differences in their specific implementations.

The utility can read files from a damaged file system. Since the utility attempts to read only those structures it requires, damaged areas of the disk can be avoided. Files can be accessed by their inode number alone, bypassing damage to structures above it in the directory hierarchy.

The functions of the utility is available in a library named libruf. The utility and library is available under the BSD license.


There are many important reasons for being able to access unmounted file systems, the prime example being a damaged disk. This paper describes a utility that can be used to read a disk file without mounting the file system. The utility behaves similar to the regular cat utility, and was originally named dog, but was renamed to ruf for reading unmounted filesystems to avoid a name conflict with an older utility.

In order to access an unmounted file system, the utility must read the disk structures directly and perform all the tasks normally performed by the operating system; this requires a detailed understanding of how the file system is implemented. Implementing this utility for a particular file system is an interesting academic exercise and a good way to learn about the file system. The original work on this utility was in fact done in Evi Nemeth's system administration class.

Getting to know the Solaris filesystem, Part 1 - SunWorld - May 1999

Richard starts this journey into the Solaris filesystem by looking at the fundamental reasons for needing a filesystem and at the functionality various filesystems provide. In this first part of the series, you'll examine the evolution of the Solaris filesystem framework, moving into a study of major filesystem features. You'll focus on filesystems that store data on physical storage devices -- commonly called regular or on-disk filesystems. In future articles, you'll begin to explore the performance characteristics of each filesystem, and how to configure filesystems to provide the required levels of functionality and performance. Richard will also delve into the interaction between Solaris filesystems and the Solaris virtual memory system, and how it all affects performance.

Getting to know the Solaris filesystem, Part 3 - SunWorld - July ...

One of the most important features of a filesystem is its ability to cache file data. Ironically, however, the filesystem cache isn't implemented in the filesystem. In Solaris, the filesystem cache is implemented in the virtual memory system. In Part 3 of this series on the Solaris filesystem, Richard explains how Solaris file caching works and explores the interactions between the filesystem cache and the virtual memory system. to know the Solaris filesystem, Part 1 - SunWorld - May 1999

[Mar 7, 2005] CacheKit

CacheKit is a collection of freeware perl and shell programs to report on cache activity on a Solaris 8 sparc server. Tools for older Solaris and Solaris x86 are also included in the kit, as well as some SE Toolkit programs and extra Solaris 10 DTrace programs. The caches the kit reports on are: I$, D$, E$, DNLC, inode cache, ufs buffer cache, segmap cache and segvn cache. This kit assists performance tuning.

[Mar 7, 2005] Solaris Tunable Parameters Reference Manual

[Mar 7, 2005] Sun Solaris (Intel) Data Recovery Software for volume recovery

Linux vies with Oracle -

Here's a remarkable tale about a company that replaced an Oracle database cluster with a few Linux servers.

The great thing about this story is that the Linux servers did not run database software. The Oracle database had been converted and stored on the Linux hard disk as a collection of some 100,000 files. The work was part of a major application upgrade that involved redesigning all the components of a busy web site.

It's a stunning tale, but it's obviously an exceptional case. There's absolutely no suggestion that Oracle databases are old, slow, poor or anything else of the sort.

But this story shows how design decisions taken a few years ago can rapidly be undermined by new technologies. In this case, the original site served pages from an Oracle back-end via J2EE middleware. The new system uses a Linux back-end and Java XML middleware. First time round, Oracle and the J2EE framework were the best choice, but a few years later Java and XML had matured. The redesign enabled some 40,000 lines of J2EE code to be replaced by 5,000 lines of Java/XML.

The Oracle replacement was another spin-off from redesigning the middleware. This time it was enabled by the Reiser File System (ReiserFS) - a relatively new development that is already the default in Suse, Lindows and Gentoo Linux, largely because it's a journaling file system, so it doesn't lose data following "unplanned outages". Linux servers don't crash very much, but power fails sometimes, so a robust file system is a definite advantage.

ReiserFS uses an improved version of the same basic tree indexing scheme as some database engines. Thus ReiserFS is often very fast and efficient compared with the traditional file systems of Linux and Windows. Of course, for most datasets and applications it's probably not as fast as Oracle's sophisticated database cluster. But in this case a bit less performance was acceptable, and the new option of ReiserFS contributed to the demise of one underused Oracle database.

One might imagine switching from an Oracle cluster to a Linux file system produced huge cost savings, but it's not that simple. The money saved was in fact used to pay for two new staff - software developers who also contribute to the open-source application server used in the new Java XML architecture.

So the decision to move away from Oracle and J2EE was not driven simply by costs. Here, the web site is the firm's core business, and fixing the feature set and technical agenda of its core business to one supplier seemed a poor choice. Now the firm has influence over the features and direction of the application server.

None of this is to disparage other Oracle databases, and this is not a tale about open source versus commercial software. Rather it's about choosing the best technologies available, and re-examining those choices from time to time.

ReiserFS is unlikely to be the ultimate file system. It will probably soon seem old hat compared with the next big thing. My vote would be for something like Coda - a research implementation of a fault-tolerant, distributed file system for long-latency IP networks. Meanwhile, keep building and rebuilding - cut costs and prosper.

Design, Features, and Applicability of Solaris File Systems

  1. Understanding What a File System Is
  2. Understanding File System Taxonomy
  3. Understanding Local File System Functionality
  4. Understanding Differences Between Types of Shared File Systems
  5. Understanding How Applications Interact With Different Types of File Systems
  6. Conclusions
  7. About the Author


Ext2 compatibility (Score:5, Informative)
by Wise Dragon (71071) on Thursday August 14, @02:42PM (#6698412)

Dude, there are papers published about Ext2fs which describe the data structures in exquisite detail. You don't need to look at the code to write an ext2fs clone. I have written proprietary utilities to access ext2fs data structures. I know what I am talking about.

In addition, there are various commercial tools that read and write ext2, such as
Ext2fs Anywhere

So in that case, you're full of crap. I don't know if I am really qualified to comment on the other case, but doesn't BSD have linux compatibility? And isn't BSD available under a much less restrictive license? They could just adapt that code.

Series of interesting papers Project Info - AVFS A Virtual Filesystem

AVFS is a system, which enables all programs to look inside gzip, tar, zip, etc. files or view remote (ftp, http, dav, etc.) files, without recompiling the programs.

Journal File Systems LG #55

As Linux grows up, it aims to satisfy different users and potential situations' needs. During recent years, we have seen Linux acquire different capabilities and be used in many heterogeneous situations. We have Linux inside micro-controllers, Linux router projects, one floppy Linux distribution, partial 3-D hardware speedup support, multi-head Xfree support, Linux games and a bunch of new window managers as well. Those are important features for end users. There has also been a huge step forward for Linux server needs - mainly as a result of the 2.2.x Linux kernel switch. Furthermore, sometimes as a consequence of industry support and others leveraged by Open Source community efforts, Linux is being provided with the most important commercial UNIX and large server's features. One of these features is the support of new file systems able to deal with large hard-disk partitions, scale up easily with thousands of files, recover quickly from crash, increase I/O performance, behave well with both small and large files, decrease the internal and external fragmentation and even implement new file system abilities not supported yet by the former ones.

This article is the first in a series of two, where the reader will be introduced to the Journal File Systems: JFS, XFS, Ext3, and ReiserFs. Also we will explain different features and concepts related to the new file systems above. The second article is intended to review the Journal File Systems behaviour and performance through the use of tests and benchmarks.

New Architect Infrastructure Product Review

FreeBSD uses the UFS (Unix File System), which is a little more complex than Linux's ext2. It offers a better way to insure filesystem data integrity, mainly with the "sofupdates" option. This option decreases synchronous I/O and increases asynchronous I/O because writes to a UFS filesystem aren't synced on a sector basis but according to the filesystem structure. This ensures that the filesystem is always coherent between two updates. In my informal performance updates, softupdates showed significant improvement.

I used two identical boxes, one with Linux and the other with FreeBSD 4.0-RELEASE. I moved a 1.2GB file between two mount points, back and forth. I found that FreeBSD, without the sofupdates, performs a little slower than Linux. This speed changed after I added the softupdates to the FreeBSD kernel and then updated the mount point (via tunefs). Only then did I notice that FreeBSD's performance was marginally better (10 percent, or so).

These performance tests aren't perfect or anywhere near conclusive. The Linux filesystem can be tweaked for performance; however, currently ext2 gets its performance from having an asynchronous mount. This is great for speed, but if your system crashes it could take out the filesystem, its data, and its current state. Often, a hard crash permanently damages a mount. FreeBSD with sofupdates can sustain a very hard crash with only minor data loss, and the filesystem will be remountable with few problems.

Besides performance, FreeBSD UFS also has one major advantage over Linux in security. FreeBSD supports file flags, which can stop a simple script kiddie dead in his tracks. There are several flags that you can add to a file, such as the immutable flag. The immutable (schg) flag won't allow any alteration to the file or directory unless you remove it. Other very handy flags are append only (sappnd), cannot delete (sunlnk), and archive (arch). When you combine these with the kernel security level covered below, you have a very impenetrable system. libferris is a virtual filesystem that exposes hierarchical data of all kinds through a common C++ interface.

Access to data is performed using C++ IOStreams and Extended Attributes (EA) can be attached to each datum to present metadata. Ferris uses a plugin API to read various data sources and expose them as contexts and to generate interesting EA. Current implementations include Native (kernel disk IO with event updates using fam), xml (mount an xml file as a filesystem), edb (mount a berkeley database), ffilter (mount an LDAP filter string) and mbox (mount your mailbox). EA generators include image, audio, and animation decoders.

Index of -~wiedeman-development translucency linux kernel module 0.5.2 by Bernhard M. Wiedemann - Friday, May 10th 2002 06:42 EDT

About: translucency is a Linux kernel module that virtually merges two directories, making it possible to overwrite files on read-only media and compile projects (such as the Linux kernel) with different options without copying sources each time. No user-space tools have to be changed. The process is also known as inheriting (ifs), stacking, translucency (tfs), loopback (lofs), and overlay (ovlfs).

Changes: This version has enabled ".." handling and improves behavior on existing files.

[ Mar 15, 2002] New Windows could solve age-old puzzle - Tech News -

To achieve the long-elusive goal of easily finding information hidden in computer files, Microsoft is returning to a decade-old idea.

The company is building new file organization software that will begin to form the underpinnings of the next major version of its Windows operating system. The complex data software is meant to address a conundrum as old as the computer industry itself: how to quickly find and work with a piece of information, no matter what its format, from any location.

For those using Windows, this will mean easier, faster and more reliable searches for information. Replacing its antiquated file system with modern database technology should also mean a more reliable Windows that's less likely to break and easier to fix when it does, said analysts and software developers familiar with the company's plans.

In the process, the plan could boost Microsoft's high-profile .Net Web services plan and pave the way to enter new markets for document management and portal software, while simultaneously dealing a blow to competitors.

But success won't come overnight. Building a new data store is a massive undertaking, one that will touch virtually every piece of software Microsoft sells. The company plans to include the first pieces of the new data store in next release of Windows, code-named Longhorn, which is scheduled to debut in test form next year.

"We're going to have to redo the Windows shell; we're going to have to redo Office, and Outlook particularly, to take advantage" of the new data store, Microsoft CEO Steve Ballmer said in a recent interview with CNET "We're working hard on it. It's tough stuff."

Tough indeed. The development of the new file system technology is so difficult that Microsoft may have to market two distinctly different product lines while it completes the work--a move Ballmer concedes would be a huge step backward in the company's long-sought plan to unify its operating systems with Windows XP and Windows .Net Server, which has been delayed until year's end.

For years, Microsoft has sold two operating systems: a consumer version based on the 20-year-old technology DOS, and a corporate version based on the company's newer, built-from-scratch Windows NT kernel. The dual-OS track has frustrated software developers, who needed to support two different operating systems, and has confused customers, who often didn't understand the difference between them.

"Will we have two parallel tracks in the market at once? Not desirable. There are a lot of reasons why that was really a pain in the neck for everybody, and I hope we can avoid that here," Ballmer said. "But it's conceivable that we will wind up with something that will be put on a dual track."

Still, Ballmer and his executive team believe it's a risk well worth taking. Right now, each Windows program includes its own method for storing data, such as the vastly different formats used by Microsoft's Outlook e-mail program and Word document software. Despite advances in Windows' design and networking technology, it's still impossible to search across a corporate network for all e-mails, documents and spreadsheets related to a specific project, for instance. Searching through video, audio and image files is kludgy at best.

Likewise, it's tricky--if not impossible--to build new programs that tap into those files. "If I'm looking for anything where I interacted with one customer in the last 12 months, I need to search for e-mail, Word documents or information in my database," said Chris Pels, president of iDev Technologies, a software consulting and design firm in East Greenwich, R.I. "That kind of stuff is a nightmare from a programming perspective these days."

Other software makers have attempted to solve the same problem. Nearly two years ago, Oracle introduced something called Internet File System, which works with its database server to make storage and retrieval of data--including Microsoft Word and Excel documents--easier and more reliable. "This hasn't been done in a commercial operating system, but it has been done with Oracle's database," said Rob Helm, editor in chief of Directions on Microsoft, crediting Oracle CEO Larry Ellison as an early proponent of the idea.

Oracle continues to challenge Microsoft on this front. Last fall, the company announced an e-mail server option for its 9i database management software along with a migration program to move companies from Microsoft Exchange to Oracle's database.

Yet Oracle's efforts amount to more of a jab between long-time adversaries than a serious competitive challenge. Given Windows' enormous market clout, Microsoft's plan could change the competitive landscape of the software business and affect millions of computer users and technology buyers.

"It's a huge risk for Microsoft," Helm said. "They have so much riding on this. If this is late and doesn't work as advertised, it will have effects that will ripple through the entire company and the industry. But the benefits, if they succeed, will be huge."

Microsoft's first--and perhaps largest--challenges will be internal: how to overcome the technical and organizational obstacles it encountered when it set out to solve the very same problem in the early 1990s. At that time, the company launched an ambitious development project to design and build a new technology called the Object File System, or OFS, which was slated to become part of an operating system project code-named Cairo.

"We've been working hard on the next file system for years, and--not that we've made the progress that we've wanted to--we're at it again," Ballmer said.

While the Cairo project eventually resulted in Microsoft's Windows 2000 operating system, the file system work was abandoned because of complexity, market forces and internal bickering. "It never went away. We just had other things that needed to be done," Jim Allchin, the group vice president in charge of Windows development, told

Those other things most likely included battling "Netscape and Java and the challenge of the Internet and the Department of Justice," Gartner Group analyst David Smith said--issues that continue to persist today.

Microsoft executives say the company plans to resurrect the OFS idea with the Longhorn release of Windows. "This will impact Longhorn deeply, and we will create a new API for applications to take advantage of it," Allchin said.

He said bringing the plan back now makes sense because new technologies such as XML (Extensible Markup Language) will make it much easier to put in place. XML is already a standard for exchanging information between programs and a cornerstone of Microsoft's Web services effort, which is still under development. Longhorn and the new data store are the "next frontier" of software design, Allchin said.

In addition, Microsoft has already developed the database technology it needs for a new file system. A future release of its SQL Server database, code-named Yukon, is being designed to store and manage both standard business data, such as columns of numbers and letters, and unstructured data, such as images. Yukon will also form the data storage core of Microsoft's Exchange Server and other future products.

The more important reasons for the renewed development effort, however, are strategic. If the plan succeeds, it will give Microsoft a huge technological advantage over the competition by making its products more attractive to buyers and giving large companies another reason to install Windows-based servers.

"Having multiple data stores makes life harder for the enterprise customer," Helm said. "Search will become much easier, and this should make it cheaper to build new systems because customers only have to learn one database."

Helm said the database capability in Windows will make it a snap to add document management and more advanced portal development tools. Those applications will in essence be built into the operating system, making it more likely that customers will use them.

Moreover, industry veterans note that the new data store will benefit from Microsoft's tried-and-true strategy for entering new markets--leveraging the overwhelming market share of Windows. Because Microsoft needs the new data store to make its .Net services plan work, analysts say the company is likely to pressure customers to make the move to the Longhorn release of Windows through licensing incentives or other means.

Nevertheless, widespread acceptance is not a foregone conclusion. For big companies not yet ready to install Microsoft's 3-year-old Windows 2000 operating system--much less Windows XP, released last October--the Longhorn plan may be too much to contemplate right now.

"That's the real issue that I see in the trenches: the rate of change--for programmers, for businesses, in terms of making infrastructure technology decisions," Pels said. "People can't keep up with it, and if they want to keep up with it, is it worthwhile for their business?"

Mike Gilpin, an analyst with Giga Information Group, agrees. "It's a great dream," he said. "But it could be hard to make real."

Linux Ext2 filesystem for Windows NT driver

Unix Internals

CptS 302 The Unix Filesystem Impact Of Sudden Power Loss On Journalled Filesystems

"Alan Cox replied tersely, "Which means an ext3 volume cannot be recovered on a hard disk error." And Stephen replied: Depends on the error. If the disk has gone hard-readonly, then we need to recover in core, and that's something which is not yet implemented but is a known todo item."

Large Files in Solaris: A White Paper - by Solaris OS group - 65K .PDF

This document describes Sun's implementation of the Large File Summit's standard for 64 bit file access... including the User level experience of converting existing applications to the new standard.

File System Indexing, and Backup by Jerome H. Saltzer Laboratory for Computer Science Massachusetts Institute of Technology M.I.T. Room NE43-513 Cambridge, Massachusetts 02139 U.S.A.

This paper briefly proposes two operating system ideas: indexing for file systems, and backup by replication rather than tape copy. Both of these ideas have been implemented in various non-operating system contexts; the proposal here is that they become operating system functions.

[Jan 5, 2000] Open source JFS project Web site

IBM's journaled file system technology, currently used in IBM enterprise servers, is designed for high-throughput server environments, key to running intranet and other high-performance e-business file servers. IBM is contributing this technology to the Linux open source community with the hope that some or all of it will be useful in bringing the best of journaling capabilities to the Linux operating system. Work is currently underway to complete the port of this technology to Linux.

Developing JFS

JFS is licensed under the GNU General Public License. If there's a feature that you'd like to see added to JFS, consider becoming a part of the JFS development process. Since JFS is an open source project, it's easy to get involved.

Get the Source

A CVS repository contains the latest stable version of the JFS source code and documentation. All JFS core team members and JFS contributors have read-write access to CVS and WebCVS.

CVS is a system that lets groups of people work simultaneously on groups of files. CERN has a Web site with general information on CVS , as does

For convenience the latest source may be downloaded as jfs-0.0.1.tar.gz.

For details on building the source and a list of ToDo items, examine the README.

Report bugs

Jitterbug is the system for tracking JFS bugs and feature requests. The core team and contributors have read-write access to this database. The community at large has read access through a Web interface.

Jitterbug is a Web-based bug tracking system. It handles bug tracking, problem reports, and queries and is available under the GNU General Public License. JitterBug has a Web site for general information on JitterBug.

[Dec 12, 1999] First Journaling FS for Linux

ReiserFS article. Interestingly, this project is being funded by Suse and The FS basically seems to be boasting much more efficient algorithms and handles small file space better.

[Dec 12, 1999] Slashdot Ask Slashdot EXT3

A great way to follow kernel development is to read the excellent kernel mailing list synopses written by Zack Brown at:

xt3fs is a journaled version of ext2fs written by Stephen Tweedie. It's in beta form right now but works pretty well. Stephen and Ted Ts'o talked about ext3fs at our Linux Storage Management Workshop in Darmstadt, Germany (you can get the slides for this workshop at The ext3 filesystem, of which early alphas are ready (version 0.0.2c, the excitement !!). Development is on the linux-fsdevel mailing list, archived here. Hello, I've been running ext3 on my laptop computer for about two months now. It works great. Just sync the disks and turn it off. No shutdown. No data loss either. If you look at e.g Solaris disk-suite you are able to control where your should store your metadata. Say that you want to have journaling file data also, this is normally slowing the system down. But if you can specify that all file metadata should be on a separate solidstate disk (naturally mirrored for safety). Then journaling of file data will be quick and swift. This is in my view quite important. If I understand everything correctly you can do that with ext3. One of the major problems with ext2fs (IMHO) is that it doesn't resize well. This is because there is a copy of every group descriptor in every group [a g.d. contains metadata for a group of blocks/inodes, typically 8M in size]. Therefore enlarging or shrinking the drive causes a major reshuffle of ALL the data; so far, the only utility I know that can do this is resiz2fs, which comes with Partition Magic (there are no doubt others now).

This redundancy is good in theory (backups), but keeping a copy of a constant number of group descriptors (perhaps the previous and next 32) in a given group would still give you a lot of redundancy plus make resizing simpler.

Granted, resizing isn't something you do a lot, but having had my system lock up and die while resizing and having to recover using Turbo C++ and the ext2fs spec (code and info on my ext2fs page), it would be nice if ext3fs (or XFS) made this easier.

The Reiser Filesystems by Hans Reiser, a very ambitious project to not only improve performance and add journaling, but to redefine the filesystem as a storage repository for arbitrarily complex objects. reiserfs. Reiserfs is faster than ext2/3 because it uses balanced trees for it's directory-structures.
The project is now released for 2.2.11 - 2.2.13. Mailing list archive here.

The Xfs site has some docs. The work to unencumber the code is accelerating, and February is the target date for source code release. XFS is the one that I think has the most potential. It's a full logging filesystem from the ground up, not an extension (not that EXT3 or DTFS are bad or misguided efforts) I'm betting it will be the highest performance filesystem for linux when it goes gold. I think the tight integration of the log could be a huge plus. It's been a while since filesystem 101 but I would think that there are a ton of ways to optimize performance with log write back tricks and useage optimizations.. You could include a hit counter in metadata and have an optimizer that moves higher hit files closer to the log in the center of the disk making your more frequently used files closer to where the head is supposed to be. Those kinds of optimizations (if practical, maybe I'm full of it) wouldn't be nearly as easy with ext3 since the FS doesn't have any knowldege of the log. Plus xfs has ACLs and big file support already.

Hi,ext3fs is a journaled version of ext2fs written by Stephen Tweedie. It's in beta form right now but works pretty well. Stephen and Ted Ts'o talked about ext3fs at our Linux Storage Management Workshop in Darmstadt, Germany (you can get the slides for this workshop at

Stephen also gave a talk on ext3fs at the Linux Kongress in Augsburg, Germany. He is predicting Summer 2000 for production use of ext3fs. Nice features include the fact that ext3fs is backwards compatible with older versions of ext2. In addition, ext3fs uses asynchronous journaling, which means the performance will be as good or better than ext2fs.

I am involved with the SGI effort to port XFS to Linux. The work to unencumber the code is accelerating, and February is the target date for source code release. The read path is working at this time. More work remains however, so stay tuned to

[August 17, 1999] Size:72kb LREAD v2.3 - Programm to read LINUX Extended2-Filesystems on PCs from within DOS

[August 10, 1999] Kernel Traffic

Alan put 2.2.11pre2 up on ftp://ftp.* and posted a changelog against 2.2.10. Linus replied, "Looks good, except aic7xxx is wrong version ;) Tssk, tssk."

One of Alan's changes was "FAT now uses cluster numbering for inode info", which Alexander Viro took exception to. Alexander replied to the announcement:

Another one of Alan's changes was to remove the COMA workaround, and recommend people just use set6x86 if they have that Cyrix CPU bug. Zoltan Boszormenyi said sarcastically that in that case, they might as well remove the f00f bugfix as well. Alan defended the change, and there was a discussion about which fix was enabled in which version and then switched for which other fix.

[June 1, 1999] SGI Contributes World's Most Scalable File System Technology to Open Source Community

Recommended Links

Google matched content

Softpanorama Recommended

Top articles


Recommended Papers


Sprite papers

Extending the Operating System at the User Level the Ufo Global File System by Albert D. Alexandrov, Maximilian Ibel, Klaus E. Schauser, and Chris J. Scheiman, Proceedings of the USENIX 1997 Annual Technical Conference Anaheim, California, January 1997.

In this paper we show how to extend the functionality of standard operating systems completely at the user level. Our approach works by intercepting selected system calls at the user level, using tracing facilities such as the /proc file system provided by many Unix operating systems. The behavior of some intercepted system calls is then modified to implement new functionality. This approach does not require any re-linking or re-compilation of existing applications. In fact, the extensions can even be dynamically ``installed'' into already running processes. The extensions work completely at the user level and install without system administrator assistance.

We used this approach to implement a global file system, called Ufo, which allows users to treat remote files exactly as if they were local. Currently, Ufo supports file access through the FTP and HTTP protocols and allows new protocols to be plugged in. While several other projects have implemented global file system abstractions, they all require either changes to the operating system or modifications to standard libraries. The paper gives a detailed performance analysis of our approach to extending the OS and establishes that Ufo introduces acceptable overhead for common applications even though intercepting system calls incurs a high cost.

Keywords: operating systems, user-level extensions, /proc file system, global file system, global name space, file caching

See also GSCHWIND, M. K. 1994. FTP---Access as a user-defined file system. ACM SIGOPS Oper. Syst. Rev. 28, 2 (Apr.), 73--80.




Versioning File Systems


[Aug 19, 2000] Opensource.html IBM announces AFS as an open source product under the IBM Public License

Re:you probably don't want AFS (Score:1)
by jlrobins_uncc ([email protected]) on Thursday August 17, @05:55PM EDT (#113)
(User #136569 Info)
AFS makes great sense for Web server farms and/or mirrors of the same site across a WAN such as the Internet (think an east coast site and a west coast site). Just edit the file and pow, a server -> client callback notifies any clients caching the file that they need to refetch.

Couple this with having the content in a read-only replicated volume, then go ahead and update many files, get your new site look-and-feel redone, then once your happy with it, release the read/write volume for replication, and pow -- one atomic transaction to all of the mirroring servers on the WAN!

Mabye this is why AFS is a major component of IBM's Websphere platform. All of this, currently working like a champ, and it'll be free and open source!
---------- Hail Ants!

Why this has so much potential for good. (Score:4, Insightful)
by jlrobins_uncc ([email protected]) on Thursday August 17, @05:47PM EDT (#111)
(User #136569 Info)
AFS is a very stable, tested, enterprise filesystem. It offers the following features:
  • Cross platform: Many UNIXen as well as NT as either client or server.
  • Secure: Uses Kerberos IV for user authentication.
  • Client-side caching: client machines use disk or virtual memory to cache MRU files, greatly reducing # trips down the wire on reads.
  • Unified naming scheme: names of files don't indicate what file server they're on. Makes moving of volumes from one fileserver (or drive on the same fileserver) to another a cinch, since no client-side changes need to happen.
  • Read-only replication: Make your application install directories replicated in each building on campus.

Now, it's not a perfect product, but it is way cooler than vanilla NFSv2 or NFSv3, especially on the server-side management side of things. It doesn't do disconnected operation (which CODA strives to do), byte-range locking, strict UNIX file semantics (data most recently written == data viewiable by all file handles to that file), or Kerberos 5, but it is a far simpler system to get running than DCE, which does address some of those issues.

One would hope to see the following things from this open sourcing:

  • *BSD client / servers.
  • MacOS X client (at least!)
  • Millineum / Win2K clients (NT clients exist currently).

If the MacOS X client happens, then there will be a secure, scaleable enterprise filesystem for the three major computer platforms -- Wintel, UNIX, and Mac, and it'll even be freely available! I don't believe that there are any products available today that offer secure, robust support for all three platforms (and no, I don't consider protocol translators, such as Samba or CAP, which require you to set up the clients to use cleartext passwords over the wire to authenticate (not to downplay in any way the role of either technology -- it's not their fault that you've got to set up the clients in that fashon to interoperate with AFS as it is now), or using NFSv2 or v3 on the UNIX end to talk to something like Novell 5 (which, AFAIK, doesn't talk at all to Macs anymore)).

This will give us one protocol on the wire, multiple server-side implementations (interoperable in the same cell!), multiple client-side implementations, WAN scalability, and secure authentication. A good day for the world!

As one of the architects designing DFS in IBM (Score:4, Interesting)
by gelfling on Thursday August 17, @08:06PM EDT (#128)
(User #6534 Info)
We've always had a hard time selling DFS internally. In fact we've stopped trying to do that because there weren't enough internal customers. The hurdle costs were too high the skills were hard to find and expensive and customers still wanted SMB shares via Samba which drove the cost even higher. The client side DCE licence costs drove Samba since the per client cost was $65/seat in bulk. AFS as open source can only be a good thing since we can always find someone to pick up the development and maintenance and foregoing DCE-Kerberos is really not that big a deal from an internal perspective. In our environment the challenge was to collapse hundreds of LanServer domains. DFS or AFS fit the bill and the cost dynamics work very well compared to staffing 1 headcount/25-35 servers in the LanServer world. The problem anyone will find though is backup and storage management. butc or buta just don't scale very well even with multiple replicas of the fldb core so whoever tries to manage this, as we did, will be forced to write extensions to their storage management code, as we did with ADSM. Also you will find that Samba doesn't scale nearly as well as you want with only a few hundred accounts on a Samba server even if it sits on a huge Unix machine. This leaves you will a few hundred or more SMB gateways if you try to scale up to the huge numbers we did.

Once again AFS open source can only be a good thing - it will propagate a great technology into large sites where they would shied away from it previously.

[ Reply to This | Parent ]

Articles IBM Open Sourcing AFS

AFS semantics are very different from UNIX file system semantics: permissions are associated with directories only, access is determined only by the containing directory, if multiple clients modify the same file, updates are lost, you can't have any special files in an AFS file system, etc. AFS uses its own authentication, it doesn't work well for big files, it always requires extra work to get it to work with daemons, and it has severe problems for scientific compute clusters. IBM has long ago moved onto DFS (unrelated to Microsoft DFS), which fixes many of the problems of AFS (but is itself big, even more complex than AFS, and hard to administer). Many places are trying to get rid of AFS because it's just too much of a hassle to run it (and converting back to a UNIX file system isn't easy because AFS encourages permissions and ACLs to mushroom unnecessarily).

AFS may be acceptable for specific applications (in fact, what it was designed for originally): a large untrusted user population, dedicated system management staff, and smallish files and problems (text file editing, small programming jobs). But for many environments where Linux is used--big software development projects, web servers, scientific computing, home networking--it just doesn't seem like a good fit.

If it's the security you care about, NFSv4 might be for you, although it clearly also has some problems. If you want something AFS-like, Coda might be an option (but I don't know how mature it is yet). MFS and GFS are options for compute clusters. Maybe we can get 9P or Styx up on Linux.

Re:you probably don't want AFS (Score:2, Informative)
(User #125105 Info)

The problems you mention with "not working well with daemons" is likely related to the fact that it uses Kerberos IV. If the daemon needs to have more access to AFS directories than you are willing to give to any other user on the system, there is a lot of work to do.

Specifically, you need to stash a password away such that the daemon can authenticate and periodically reauthenticate so that it does not lose the rights that it has.

AFS does allow you to have ACL's based on IP address. As such, if you are running a daemon on a machine than only system administrators have access to, it may not be a big deal to allow everyone on that machine to write to a directory. Other machines, though, may have read-only or no access to the directory.

NFS 4 will have the same problem, as a requirement for it is that Kerberos V is supported as an authentication mechanism. If you don't give world write to a file/directory, then you cannot write to it without a kerberos V ticket.

Too little, too late? (Score:4, Insightful)
(User #131596 Info)

As someone who has worked with AFS for the past 8 years, I have to say that I greet this announcement with a somewhat more pessimistic view.

Namely: AFS is now officially dead.

I say "officially" because, IMO, AFS is already dead, and has been for years (ever since Transarc (now IBM Transarc Labs, but I'll refer to them as Transarc for brevity)) came out with DCE/DFS, really).

Oh, there were bouts of heavy maintenance and limited development. These periods were inevitably precipitated by Transarc's AFS customers becoming vocal and complaining. But when the complaints died down, so did Transarc's commitment.

Transarc has never treated AFS like a real product. Their "development" efforts have been limited to ports to new versions of the same operating systems, a few ports to new architectures, bugfixes, and very limited feature additions (mostly backports from DFS).

In fact, this year has seen Transarc's AFS support sink to a new low. From what I've been able to garner, all AFS development is being outsourced to India. Responses from Transarc's AFS hotline support (a support service which customers purchase!) have been inept. There was no Decorum (Transarc's yearly AFS conference) this year, nor even an announcement concerning it. It's been ages since anyone from Transarc has posted on the AFS mailing list.

So, why is Transarc (now IBM Transarc labs) open-sourcing AFS? For one simple reason: AFS is IBM's red-headed stepchild, and they don't know what else to do with it.

If you read the announcement at ss/opensource.html, you'll note this entry in the FAQ:

Is IBM still investing in AFS?

Yes. IBM recognizes that many of our customers will still want a commercially-supported version of AFS IBM AFS. IBM/Transarc will still sell, maintain, port (to new versions of currently-supported OS), support, and provide minor enhancements to "IBM AFS".

Good software grows or dies. AFS died a long time ago. I, personally, think this is tragic, because AFS had great potential. But Transarc never made a long-term commitment to anything other than keeping it on life support. Perhaps it can be resuscitated back to health, but I can't help but wonder if the Open Source community's effort would be better spent towards other distributed filesystems efforts, such as CODA (which I admittedly haven't investigated, but plan to).

Re:you probably don't want AFS (Score:1)
by Tower (/dev/whoop-ass) on Thursday August 17, @04:55PM EDT (#89)
(User #37395 Info)

Actually, both AFS and DFS are in use here at IBM (and at every other site I've vistited... no AFS on the windows boxen, but everyone using the RS/6ks seems to prefer AFS... Personally, I prefer the ACLs of AFS to traditional permission structures, and they are really rather flexible. You can still set rwx on the files, so it doesn't take a whole lot away...

I agree that AFS isn't meant for clustering, but it works well from a security standpoint, especially with Kerberos.

Re:you probably don't want AFS (Score:3, Informative)
by Anonymous Coward on Thursday August 17, @05:41PM EDT (#109)

> AFS semantics are very different from UNIX file system semantics: permissions are associated with
> directories only, access is determined only by the containing directory,

Think about hard links: that's why it works this way.

> if multiple clients modify the same file, updates are lost

That's not entirely true but I agree it's stupid. Anyway, it doesn't matter, if you don't use file locking you should expect corruption anyway.

> you can't have any special files in an AFS file system

I hope you don't expect your users to be able to create /dev/mem nodes in their home directories...

> AFS uses its own authentication

Yes, it's called Kerberos... ever heard of it?

> it doesn't work well for big files

It works reasonably well with big files, unlike Coda which unfortunately doesn't work at all with them. Anyway for huge amounts of data you shouldn't be creating massive files anyway, look into databases or steaming software.

> it always requires extra work to get it to work with daemons

You mean you want root on a given machine to have "root" in your whole enterprise?

> and it has severe problems for scientific compute clusters

What, rsh doesn't work? Just patch it and it works fine. Otherwise what's the problem?

> IBM has long ago moved onto DFS

No they haven't

> (unrelated to Microsoft DFS)

Thank god. But I'm glad Microsoft has finally invented the automounter.

> which fixes many of the problems of AFS (but is itself big, even more complex than AFS, and hard
> to administer).

And nobody uses it...

> Many places are trying to get rid of AFS because it's just too much of a hassle to run it

There really is no better alternative, though.

> (and converting back to a UNIX file system isn't easy because AFS encourages permissions and ACLs
> to mushroom unnecessarily).

You mean it encourages security? :)

> AFS may be acceptable for specific applications (in fact, what it was designed for originally): a
> large untrusted user population, dedicated system management staff, and smallish files and
> problems (text file editing, small programming jobs).

It lets you solve problems on a big scale. I hope the open source release will make it even better and more available for everyone to use.

> But for many environments where Linux is used--big software development projects, web
> servers, scientific computing, home networking--it just doesn't seem like a good fit.

Big software development is one of the first things AFS was used for. It's only recently, ironically, that local disks+Linux have outperformed network file systems so much.

AFS makes sense on web servers for replicating site data and allowing many people to "upload" without the insecurity of FTP.

And I don't see why anyone wouldn't want to use AFS at home. Again, I hope the open source release will allow as many people to have real security in network filesystems as possible.

> If it's the security you care about, NFSv4 might be for you

Whenever that will be available...

> If you want something AFS-like, Coda might be an option (but I don't know how mature it is yet)

Coda is nice but not packaged well enough for everyone to start using it. It also chokes on big files much worse than AFS, unfortunately.

> MFS and GFS are options for compute clusters.

They're nice for high bandwidth to big files. But they give you no security... do you really want a root exploit on one machine in a cluster to destroy all data in the entire site?

Why? CODA (Score:2, Interesting)
by Anonymous Coward on Thursday August 17, @03:49PM EDT (#26)

Why open source it? Because coda is about to replace it. CODA ( is a Free (free software), scalible, distributed file system. It covers every feature of AFS, and goes quite a bit further.

Coda is reaching a point of stability and availablity that it's nearly ready for widespread production deployment.

UKUUG Linux 2000 Conference - Timetable

At UKUUG this year, Owen LeBlanc, a Coda expert if there ever was one, said "if you have a small number of users and a relatively small amount of data, then Coda may be just what you need". I also seem to recall him saying he thought AFS is pretty darn nice. He'd be the one to know.

AFS Frequently Asked Questions

RAM disks

[August 3, 1999] How to use a Ramdisk for Linux By Mark Nielsen LG #44 -- good and important How-to


The Linux Virtual File-system Layer Neil Brown [email protected] and others. 29 December 1999 - v1.6

The Linux operating system supports multiple different file-systems, including ext2 (the Second Extended file-system), nfs (the Network File-system), FAT (The MS-DOS File Allocation Table file system), and others. To enable the upper levels of the kernel to deal equally with all of these and other file-systems, Linux defines an abstract layer, known as the Virtual File-system, or vfs. Each lower level file-system must present an interface which conforms to this Virtual file-system.

This document describes the vfs interface (as present in Linux 2.3.29). NOTE this document is incomplete.


Random Findings

The BeOS filesystem (BFS) is a 64-bit journalled filesystem with support for arbitrary file attributes on any node (ie, you can apply them to directories and symlinks as well as regular files)."

Slashdot Tux2 The Filesystem That Would Be King

Practical File System Design with the Be File System

by Dominic Giampaolo
Our Price: $27.96
Paperback - 256 pages (November 1998)
Morgan Kaufmann Publishers; ISBN: 1558604979 ; Dimensions (in inches): 0.62 x 8.97 x 7.04 Sales Rank: 36,474
Avg. Customer Review:
Number of Reviews: 4
table of contents

Publisher page: Practical File System Design with the Be File System

This is the new guide to the design and implementation of file systems in general, and the Be File System (BFS) in particular. This book covers all topics related to file systems, going into considerable depth where traditional operating systems books often stop. Advanced topics are covered in detail such as journaling, attributes, indexing and query processing. Built from scratch as a modern 64 bit, journaled file system, BFS is the primary file system for the Be Operating System (BeOS), which was designed for high performance multimedia applications. You do not have to be a kernel architect or file system engineer to use Practical File System Design. Neither do you have to be a BeOS developer or user. Only basic knowledge of C is required. If you have ever wondered about how file systems work, how to implement one, or want to learn more about the Be File System, this book is all you will need.


Dominic Giampaolo has a Masters degree in Computer Science from Worchester Polytechnic and is one of the principal kernel engineers for Be Inc. His responsibilities include the file system and various other parts of the kernel.

5 out of 5 stars the Big Picture and the specifics April 25, 2000
Reviewer: gseven (see more about me)
If you are worried that this will only talk about Be file system design, worry no more. It has overviews of several other major file systems and their pros and cons before wading into the Be decisions for a file system and how they are implimented. So, I thought it was nicely organized and broadly applicable.
5 out of 5 stars I wish every technical writer were this good. March 21, 2000
Reviewer: A reader from Texas, USA
I had wanted to buy this book for some time, but as a Unix Admin, I couldn't justify the money nor the study time. Well, now that I've bought it, I'm kicking myself for not doing so earlier. I have gained a much greater understanding of hashes, trees, filesystems, and databases. The book is an epitome of clarity of thought and presentation. It's not often (never?) that I find a technical book that I want to read cover to cover in one sitting! I only wish that the author had more time to revisit the BeFS short-comings that he mentions, and then GPL the end result.
3 out of 5 stars Worth reading, but not the last word in file system design January 18, 1999
Reviewer: [email protected] from rural New Hampshire
This book may be slightly over-sold on its jacket ("guide to the design and implementation of file systems in general ... covers all topics related to file systems") but that's likely not the author's fault. It does provide intermediate levels of detail regarding many, perhaps most, areas of concern to file system designers and deserves a place in the library of anyone embarking on such a project - though people expecting a cookbook rather than a source of detailed ideas will be disappointed.

The ideas are in general sound and representative of the current state of file system practice. The historical view is a bit Unix-centric - to state that the Berkeley Fast File System is the ancestor of most modern file systems is to ignore arguably superior and significantly earlier implementations from IBM, DEC, and others. This bias carries over into aspects of implementation as well, such as use of the Unix direct/indirect/double-indirect mapping mechanism to manage contiguous 'block runs' without adding file address information to the mapping blocks to eliminate the need to scan them sequentially (save for the double-indirect blocks, which avoid the scan by establishing a standard run-length of 4 blocks - arrgh!) when positioning within the file - and the unbalanced Unix-style tree itself would almost certainly be better implemented as a b-tree variant (with its root in-line in the i-node) indexed on file address. And the text occasionally blurs the distinction between what the BFS chose to implement (a journal system that forced meta-data update transactions to be serialized) and what is possible (a multi-threaded journal supporting concurrent transactions simply by allowing each transaction to submit a log record for each individual change it makes - which would also support staged execution of extremely large transactions eliminating the log size as a constraint on them).

Some of the choices made in BFS can be questioned, even in its particular use context. The 'allocation group' mechanism interacts in subtle ways with the basic file system block size, and given the relative and on-going improvement of disk seek time vs. rotational latency the value of locating related structures relatively near each other (though not actually adjacent) on disk may no longer justify the added complexity (though the effort to place file inodes immediately following the parent directory inode is likely worthwhile if a read-ahead facility exists to take advantage of it). The discussion of on-disk placement also ignores 'disks' that may in fact be composed of multiple striped units, which would further dilute the benefits of allocation groups; note that this would also complicate the read-ahead facility just mentioned, as would a shared-disk environment unless the disk unit itself performed the read-ahead and replication if present was taken into account (as in the Veritas file system, as I remember).

Even the fundamental decision to make attributes indexable deserves closer examination, given the costs of indexing. Current hardware can perform a complete inode scan on a single-user workstation fast enough to satisfy the occasional random query and can scan the inodes for files within some limited sub-tree of the directory structure (e.g., a cluster of e-mail directories) relatively quickly for more common queries, and in a multi-user environment indexing individual attributes across all users is frequently not the behavior desired. Placing index management under explicit application control may be a better approach, perhaps by allowing the application to specify on attribute creation the index, if any, in which its value should be entered (thus preserving the ability to encapsulate the operation within a system-controlled transaction without the need for user-level transaction support) - and storing the index (perhaps by its inode) with the attribute for later change or deletion.

Conspicuous by their omission are any mentions of how to manage very large allocation bit-maps (which one really must expect when other parts of the system are carefully crafted to handle 2**58-byte files) or the impact of a shared-disk environment (if BFS was intended to be limited to desk-top use this may be more understandable, but even desk-tops may soon have high-availability configurations). Security is mentioned briefly as a concern to be addressed later - but BFS's dynamic allocation of inodes from the general space pool makes this impossible, given that directory inode addresses can apparently be fed in from user-mode (the author does note this near the book's end, but fails to discuss possible remedies).

The author also expresses regret in the introduction at not having had time to include more comparative information on other file systems, both current and historical. Perhaps he is leaving himself room to write a second book. I hope so: despite my comments above, this one was worthwhile - both on its own merits, and because of the lack of competition in this subject area.



Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy


War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes


Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law


Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site


The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019