|(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix
|Disk and Filesystems
|Disk Partitions in Solaris
|Solaris Swap Space and Virtual Memory
|Solaris Volume Manager (SVM)
|Solaris Volume Manager - Soft Partitioning Explained
Note: Discussion below is based on the article by Brian Wong, Design, Features, and Applicability of Solaris File Systems
Like any modern OS Solaris includes many file systems, and more are available as add-ons. A file system stores named data sets and attributes about those data sets for subsequent data access and interpretation of the attributes. Attributes include things like ownership, access rights, date of last access, and physical location. More advanced attributes might be extended attributes like OS/2 HPFS, encryption keys, etc. We can distinguish five categories of file system available in the Solaris.
The Cache File System You can use the Cache File System (CacheFS) to improve performance of remote file systems or slow devices such as CD-ROM drives. When a file system is cached, the data read from the remote file system or CD-ROM is stored in a cache on the local system. See “Creating Cache File Systems” for more information.
The Temporary File System (TMPFS) The TMPFS file system, uses local memory for disk reads and writes. Access to files in a TMPFS file system is typically much faster than access to files in a UFS file system. Files in the TMPFS file system are not permanent. They are deleted when the file system is unmounted and when the system is shut down or rebooted.
The default file system type for the /tmp directory in the SunOS 5.x system software is TMPFS. You can copy or move files into or out of the /tmp directory, just as you would in a ufs /tmp file system.
Using TMPFS file systems can improve system performance by saving the cost of reading and writing temporary files to a local disk or across the network. For example, temporary files are created when you compile a program. The operating system generates a lot of disk or network input and output activity while manipulating these files. Using TMPFS file systems to hold these temporary files may significantly speed up their creation, manipulation, and deletion.
The TMPFS file system uses swap space as a temporary storage area. If a system with a TMPFS file system does not have adequate swap space, two problems can occur:
See Chapter 9, “Administering Systems,” for information about increasing swap space.
The Loopback File System (LOFS) The LOFS file system lets you create a new virtual file system. You can access files using an alternative path name. For example, you can create a loopback mount of /onto/tmp/newroot. The entire file system hierarchy looks like it is duplicated under /tmp/newroot, including any file systems that were mounted from NFS servers. All files are accessible either with a path name starting from / or with a path name starting from /tmp/newroot until a different file system is mounted in /tmp/newroot or any of its subdirectories.
Every Solaris system includes UFS. While it is definitely old and lacking some features, it is suitable for a wide variety of applications. The UFS design center handles typical files found in office and business automation systems. The basic I/O characteristics are huge numbers of small, cachable files, accessed randomly by individual processes; bandwidth demand is low. This profile is common in most workloads, such as software development and network services (for example, in name services, web sites, and ftp sites).
When designing your server filesystems with UFS filesystems, pay attention to what role each partition will play for your particular application. Mapping partitions to separate pairs of physical disk to minimize load of each pair (in case of hardware mirroring) improves performance.
If you're running a webserver, for example -- it would benefit performance to have an separate pair of drives dedicated to website storage. You might configure Webserver partition with both the "noatime" and "logging" options along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.
A webservers have mostly a read-requests load and the volume of data is not that big so RAID 10 can be used.
Software mirroring is an additional overhead. In no way you should ever mirror partitions on the same drive, except for training purposes: you'll seriously degrade your performance since you've effectively doubled your seeks.
For small web sites (let's say up to 4G) it make sense to use /tmp for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better (and cheaper) deal then using SANs. You just need to load the content when server reboots. the problem is that after 4G the time to reboot the server became somewhat long but few websites are that big. In any case it make sense to use entire drive for your webserver filesystem. New USB storage might have read performance comparable with best harddrives has no latency for reading and might also be an option.
Logs from Web server can be written on system drive as the volume is rather slim.
You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book (p. 172-173. ) and in Suns Solaris performance tuning course.
In addition to the basic UFS, there are two variants, logging UFS (LUFS) and older UFS that was used in Solaris 7. All three versions share the same basic code that blocks allocation, directory management, and data organization. In particular, older version of Solaris up to Solaris 9 have a nominal maximum UFS size of 1 terabyte. This limit was raised to 16 terabytes in the Solaris 10 OS.
The maximum size file is slightly smaller, about 1009 gigabytes out of a 1024 gigabyte file system. There is no reasonable limit to the number of file systems that can be built on a single system; systems have been run with over 2880 UFS file systems. The major differences between the three UFS variants are in how they handle metadata. Metadata is information that the file system stores about the data, such as the name of the file, ownership and access rights, last modified date, file size, and other similar details. Other, less obvious, but possibly more important metadata are the location of the data on the disk, such as data blocks and the indirect blocks that indicate where data locks reside in the disk.
Getting this metadata wrong would not only mean that the affected file might be lost, but could lead to serious file system-wide problems or even a system crash in the event that live data found itself in the free space list, or worse, that free blocks somehow appeared in the middle of a file. UFS takes the simplest approach to assuring metadata integrity: it writes metadata synchronously and requires an extensive fsck on recovery from a system crash. The time and expense of the fsck operation is proportional to the number of files in the file system being checked.
Large file systems with millions of small files can take tens of hours to check. Logging file systems were developed to avoid both the ongoing performance issues associated with synchronous writes and excessive time for recovery. Logging uses the two-phase commit technique to ensure that metadata updates are either fully updated on disk, or that they will be fully updated on disk upon crash recovery. Logging implementations store pending metadata in a reserved area, and then update the master file system based on the content of the reserved area or log.
In the event of a crash, metadata integrity is assured by inspecting the log and applying any pending metadata updates to the master file system before accepting any new I/O operations from applications. The size of the log is dependent on the amount of changing metadata, not the size of the file system. Because the amount of pending metadata is quite small, usually on the order of a few hundred kilobytes for typical file systems and several tens of megabytes for very busy file systems.
Replaying the log against the master is therefore a very fast operation. Once the metadata integrity is guaranteed, the fsck operation becomes a null operation and crash recovery becomes trivial. Note that for performance reasons, only metadata is logged; user data is not logged.
The metatrans implementation was the first version of UFS to implement logging. It was built into Solstice DiskSuite or Solaris Volume Manager software (the name of the product depends on the version of the code, but otherwise, they are the same). The metatrans implementation is limited to Solaris 7 and was replaced by logging UFS (LUFS).
Logging UFS was introduced into the Solaris 8 OS but unfortunately was not enabled by default. The reason for that was performance degradation, found typically only at artificially high-load levels, and almost no cases have been seen in practical applications.
So in reality logging started be used in typical installation only with Solaris 10, where it is enabled by default. Sun recommends using logging any time that fast crash recovery is required and it can be used starting from Solaris 8 but this recommendation are largely ignored. This is particularly sad in case of root file systems, which usually do not have any significant I/O at all.
One of the most confusing issues associated with logging file systems (and particularly with logging UFS, for some reason) is the effect that the log has on performance. First, and most importantly, logging has absolutely no impact on user data operations; this is because only metadata operations are logged.
The performance of metadata operations is another story, and it is not as easy to describe. The log works by writing pending changes to the log, then actually applying the changes to the master file system. When the master is safely updated, the log entry is marked as committed, meaning that it does not need to be reapplied to the master in the event of a crash. This algorithm means that metadata changes that are accomplished primarily when creating or deleting files might actually require twice as many physical I/O operations as a non-logging implementation. The net impact of this aspect of logging performance is that there are more I/O operations going to storage. Typically, this has no real impact on overall performance, but in the case where the underlying storage was already nearly 100 percent busy, the extra operations associated with logging can tip the balance and produce significantly lower file system throughput. (In this case, throughput is not measured in megabytes per second, but rather in file creations and deletions per second.) If the utilization of the underlying storage is less than approximately 90 percent, the logging overhead is inconsequential.
On the positive side of the ledger, the most common impact on performance has to do with the cancellation of some physical metadata operations. These cases occur only when metadata updates are issued very rapidly, such as when doing a tar (1) extract operation or when removing the entire contents of a directory ("rm -f *"). Without logging, the system is required to force the directory to disk after every file is processed (this is the definition of the phrase "writing metadata synchronously); the effect is to write 512 or 2048 bytes every time 14 bytes is changed. When the file system is logging, the log record is pushed to disk when the log record fills, often when the 512 byte block is completed. This results in a 512/14 = 35 times reduction in physical I/O, and obvious performance improvements result.
The following table illustrates these results. The times are given in seconds, and lower scores are better. Times are the average of five runs, and are intended to show relative differences rather than the fastest possible absolute time. These tests were run on Solaris 8 7/01 using a single disk drive.
The tar test consists of extracting 7092 files from a 175 megabyte archive (the contents of /usr/openwin). Although a significant amount of data is moved, this test is dominated by metadata updates for creating the files. Logging is five times faster. The rm test removes the 7092 extracted files. It is also dominated by metadata updates and is an astonishing 37 times faster than the non-logging case.
On the other hand, the dd write test creates a single 1 gigabyte file in the file system, and the difference between logging and non-logging is a measurable, but insignificant, three percent. Reading the created file from the file system shows no performance impact from logging. Both tests use large block sizes (1 megabyte per I/O) to optimize throughput of the underlying storage.
Another feature present in most of the local file systems is the use of direct I/O. UFS, VxFS, and QFS all have forms of this feature, which is primarily intended to avoid the overhead associated with managing cache buffers for large I/O. At first glance, it might seem that caching is a good thing and that it would improve I/O performance.
There is a great deal of reality underlying these expectations. All of the local file systems perform buffer caching by default. The expected improvements occur for typical workloads that are dominated by metadata manipulation and data sets that are very small when compared to main memory sizes. Metadata, in particular, is very small, amounting to less than one kilobyte per file in most UFS applications, and only slightly more in other file systems. Typical user data sets are also quite small; they average about 70 kilobytes. Even the larger files used in every day work such as presentations created using StarOfficeT software, JPEG images, and audio clips are generally less than 2 megabytes. Compared to typical main memory sizes of 256-2048 megabytes, it is reasonable to expect that these data sets and their attributes can be cached for substantial periods of time. They are reasonably likely to still be in memory when they are accessed again, even if that access comes an hour later.
The situation is quite different with bulk data. Systems that process bulk data tend to have larger memories, up to perhaps 16 gigabytes (for example, 8-64 times larger than typical), but the data sets in these application spaces often exceed 1 gigabyte and sometimes range into the tens or even hundreds of gigabytes. Even if the file literally fits into memory and could theoretically be cached, these data sets are substantially larger than memory that is consistently available for I/O caching. As a result, the likelihood that the data will still be in cache when the data is referenced again is quite low. In practice, cache reuse in these environments is nil.
Caching data anyway would be fine except, that the process requires effort on the part of the OS and processors. For small files, this overhead is insignificant. However, the overhead becomes not only significant, but excessive when "tidal waves" of data flow through the system. When reading 1 gigabyte of data from a disk in large blocks, throughput is similar for both direct and buffered cases; the buffered case delivers 13 percent greater throughput. The big difference between these two cases is that the buffered process consumes five times as much CPU effort. Because there is so little practical value to caching large data sets, Sun recommends using the forcedirectio option on file systems that operate on large files. In this context, large generally means more than about 15-20 megabytes. Note that the direct I/O recommendation is especially true when the server in question is exporting large files through NFS. 8 Design, Features, and Applicability of Solaris File Systems January 2004 If direct I/O is so much more efficient, why not use direct I/O all the time? Direct I/O means that caching is disabled. The impact of standard caching becomes obvious when using a UFS file system in direct I/O mode while doing small file operations. The same tar extraction benchmark used in the logging section above takes over 51 minutes, even with logging enabled, more than 29 times as long as when using regular caching (2:08)! The benchmark results are summarized in the following table.
In this table, throughput is represented by elapsed times in seconds, and smaller numbers are better. The system in question is running Solaris 9 FCS on a 750- megahertz processor. The tests are disk-bound on a single 10K RPM Fibre Channel disk drive. The differences in throughput are mainly attributable to how the file system makes use of the capabilities of the underlying hardware.
A discussion of buffered and direct I/O methodology is incomplete without addressing one particular attribute of the cached I/O strategy. Because file systems are part of the operating system, they can access the entire capability of the hardware. Of particular relevance is that file systems are able to address all of the physical memory, which now regularly exceeds the ability of 32-bit addressing. As a result, the file system is able to function as a kind of memory management unit (MMU) that permits applications that are strictly 32-bit aware to make direct use of physical memories that are far larger than their address pointers. This technique, known as supercaching, can be particularly useful to provide extended caching for applications that are not 64-bit aware. The best examples of this are the open-source databases, MySQL and Postgres. Both of these are compiled in 32-bit mode, leaving their direct addressing capabilities limited to 4 gigabytes.1 However, when their data tables are hosted on a file system operating in buffered mode, they benefit from cached I/O. This is not as efficient as simply using a 64-bit TABLE 2 Analyzing the Performance of Direct I/O and Buffered I/O Direct I/O Throughput (seconds) CPU % Buffered I/O Throughput (seconds) CPU % Create 1 GB file 36 5.0% 31 25.00% Read 1 GB file 30 0.0% 22 22.00% tar extract 3062 0.0% 128 6.0% rm rf * 76 1.2% 65 1.0% 1. They're limited to 4 gigabytes of memory. They obviously can address far more disk space because disk addresses are 63-bit quantities of 512-byte blocks. pointer because the application must run I/O system calls instead of merely dereferencing a 64-bit pointer, but the advantages gained by avoiding I/O outweigh these considerations by a wide margin.
To Solaris users, NFS is by far the most familiar file system. It is an explicit over the wire file sharing protocol that has been a part of the Solaris since 1986. Its manifest purpose is to permit safe, deterministic access to files located on a server with reasonable security. Although NFS is media independent, it is most commonly seen operating over TCP/IP networks. NFS is specifically designed to operate in multiclient environments and to provide a reasonable tradeoff between performance, consistency, and ease-of-administration. Although NFS has historically been neither particularly fast nor particularly secure, recent enhancements address both of these areas. Performance improved by 50-60 percent between the Solaris 8 and Solaris 9 OSs, primarily due to greatly increased efficiency processing attribute-oriented operations5. Data-intensive operations don't improve by the same margin because they are dominated by data transfer times rather than attribute operations. Security, particularly authentication, has been addressed through the use of much stronger authentication mechanisms such as those available using Kerberos. NFS clients now need to trust only their servers, rather than their servers and their client peers. 5. A two times 900 MHz SF280R yielded 7200 NFS operations per second on Solaris 8 2/02. The same system yielded 1717 NFS operations second on Solaris 9 FCS.
UFS is not a shared file system. Despite a fairly widespread interest in a limited-use configuration (specifically, mounted for read/write operation on one system, while mounted read-only on one or more "secondary" systems), UFS is not sharable without the use of an explicit file sharing protocol such as NFS. Although read-only sharing seems as though it should work, it doesn't. This is due to fairly fundamental decisions made in the UFS implementation many years ago, specifically in the caching of metadata.
UFS was designed with only a single system in mind and it also has a relatively complex data structure for files, notably including "indirect blocks," which are blocks of metadata that contain the addresses of real user data. To maintain reasonable performance, UFS caches metadata in memory, even though it writes metadata to disk synchronously. This way, it is not required to re-read inodes, indirect-blocks, and double-indirect blocks to follow an advancing file pointer. In a single-system environment, this is a safe assumption. However, when another system has access to the metadata, assuming that cached metadata is valid is unsafe at best and catastrophic at worst. A writable UFS file system can change the metadata and write it to disk.
Meanwhile, a read-only UFS file system on another node holds a cached copy of that metadata. If the writable system creates a new file or removes or extends an existing file, the metadata changes to reflect the request. Unfortunately, the read-only system does not see these changes and, therefore, has a stale view of the system. This is nearly always a serious problem, with the consequences ranging from corrupted data to a system crash. For example, if the writable system removes a file, its blocks are placed in the free list. The read-only system isn't provided with this information, therefore, a read of the same file will cause the read-only to follow the original data pointers and read blocks that are now on the free list!
Rather than risk such extreme consequences, it is better to use one of the many other options that exist. The selection of which option is driven by a combination of how often updated data must be made available to the other systems, and the size of the data sets involved. If the data is not updated too often, the most logical option is to make a copy of the file system and to provide the copy to other nodes. With pointin-time copy facilities such as Sun Instant Image, HDS ShadowImage, and EMC TimeFinder, copying a file system does not need to be an expensive operation.
It is entirely reasonable to export a point-in-time copy of a UFS file system from storage to another node (for example, for backup) without risk because neither the original nor the copy is being shared. If the data changes frequently, the most practical alternative is to use NFS.
Although performance is usually cited as a reason not to do this, the requirements are usually not demanding enough to warrant other solutions. NFS is far faster than most users realize, especially in environments that involve typical files smaller than 5-10 megabytes.
There are a couple of tricks you can use under Solaris to gain a little extra performance from your filesystems and also increase their data reliability. When designing your filesystem, pay attention to what role it will play for your particular application. Depending on your needs, you map partitions to physical disk to minimize load of each pair of disks (in case of mirroring) and improve performance.
If you're running a webserver, for example - it would benefit performance to have an separate pair of drives dedicated to website storage. You might configure it with both the "noatime" and "logging" options mentioned below along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.
A webservers had mostly a read-requests load. RAID 10 can be used, but RIAD 5 can be used too as both provides a high read transaction rate and provides redundancy in case of a drive failure.
In no way you should even mirror partitions on the same drive. Otherwise, you'll seriously degrade your performance since you've effectively doubled your seeks.
For small web sites (let's say up to 4G) it make sense to use /tmp for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better deal then using SANs. You just need to load the content when server reboots and after 4G the time to reboot the server became annoyingly long. In any case it make sense to use entire drive for your webserver filesystem. New USB storage might have read perfomance comparable with harddrives and it is also an option.
Logs from Web server can be written on system drive as the volume is rather slim.
You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book (p. 172-173. ) and in Suns Solaris performance tuning course.
Adaptive Server devices usually are raw devices or file system devices. Solaris users have a third option, tmpfs, for tempdb.
tmpfs -- the temporary file system -- caches writes only for a session. Files are not preserved across operating system reboots.
Note: Other UNIX platforms may allow you to create a temporary file system device. See your operating system System Administrator.
Should you use tmpfs?
To determine whether tmpfs would benefit your system, perform benchmarks comparing the memory assigned to tmpfs versus the memory assigned to the data cache.
Usually, it is more effective to give extra memory to the server for use as general data cache rather than creating a tmpfs device for tempdb. If tempdb is used heavily, then it will use a fair share of the data cache. If tempdb is not used often, then the server can use the memory assigned to data cache for non-tempdb data processing, but if the memory is assigned for tempfs it is wasted.
Servers that are most likely to benefit from using tmpfs are those that are already near the addressable memory limit:
- For Sybase SQL Server 11.0.x, see TechNote 20239: Addressable Memory Limits for Sybase SQL Server 11.0.x .
- For Adaptive Server Enterprise 11.5.x, see TechNote 20101: Addressable Memory Limits in Adaptive Server Enterprise 11.5.x .
- For Adaptive Server 11.9.2, the limits generally are the same as in TechNote 20101.
Addressable memory in Adaptive Server 11.9.3 generally is 4TB, and therefore tmpfs may not be as beneficial.
Creating a tmpfs device
Follow these steps:
- Create and test an operating system startup script that creates tmpfs after every operating system reboot. See the Solaris man page on tmpfs for details on creating a tmpfs filesystem.
- Create the tmpfs device with disk init just like creating any other filesystem device, except that you are specifying the tmpfs filesystem you just created. For example, if you named and mounted it as "/mytmpfs":1> use master 2> go 1> disk init name = "tempdb1_dev1", 2> physname = "/mytmpfs/tempdb", 3> vdevno = 3, size = 102400 4> go
This creates a 200MB device for tempdb on the /mytmpfs device.
- Use alter database to extend tempdb to the tmpfs device:1> alter database tempdb 2> on tempdb1 = 200 3> go
- Modify your RUN_Server file to issue a UNIX touch command against tempdb on the tmpfs device before the call to the dataserver. This creates the file if it does not exist, as might happen if the operating system had been rebooted. Upon startup, the server can activate the device and rewrite tempdb. If the file entry was missing, the server would not be able to activate it and tempdb would not be available. For example:RUN_SYBASE: ---------------------------------------------- #!/bin/sh # # Adaptive Server name: SYBASE # Master device path: /devices/master.dev # Error log path: /sybase/install/SYBASE.log # Directory for shared memory files: /sybase # touch /mytmpfs/tempdb_dev1 /sybase/bin/dataserver -sSYBASE -d/devices/master.dev \ -e/sybase/install/SYBASE.log -M/sybase \
UFS in its various forms has been with us since the days of BSD on VAXen the size of refrigerators. The basic UFS concepts thus date back to the early 1980s and represent the second pass at a workable UNIX filesystem, after the very slow and simple filesystem that shipped with the truly ancient Version 7 UNIX. Almost all commercial UNIX OSs have had a UFS, and ext3 in Linux is similar to UFS in design. Solaris inherited UFS through SunOS, and SunOS in turn got it from BSD.
Until recently, UFS was the only filesystem that shipped with Solaris. Unlike HP, IBM, SGI, and DEC, Sun did not develop a next-generation filesystem during the 1990s. There are probably at least two reasons for this: most competitors developed their new filesystems using third party code which required per-system royalties, and the availability of VxFS from Veritas. Considering that a lot of the other vendors' filesystem IP was licensed from Veritas anyway, this seems like a reasonable decision.
Solaris 10 can only boot from a UFS root filesystem. In the future, ZFS boot will be available, as it already is in OpenSolaris. But for now, every Solaris system must have at least one UFS filesystem.
UFS is old technology but it is a stable and fast filesystem. Sun has continuously tuned and improved the code over the last decade and has probably squeezed as much performance out of this type of FS as is possible. Journaling support was added in Solaris 7 at the turn of the century and has been enabled by default since Solaris 9. Before that, volume level journaling was available. In this older scheme, changes to the raw device are journaled, and the filesystem is not journaling-aware. This is a simple but inefficient scheme, and it worked with a small performance penalty. Volume level journaling is now end-of-lifed, but interestingly, the same sort of system seems to have been added to FreeBSD recently. What is old is new again.
UFS is accompanied by the Solaris Volume Manager, which provides perfectly servicible software RAID.
Where does UFS fit in in 2008? Besides booting, it provides a filesystem which is stable and predictable and better integrated into the OS than anything else. ZFS will probably replace it eventually, but for now, it is a good choice for databases, which have usually been tuned for a traditional filesystem's access characteristics. It is also a good choice for the pathologically conservative administrator, who may not have an exciting job, but who rarely has his nap time interrupted.
ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.
ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous "rampant layering violation" quote which has been repeated so many times. Remember, though, that this is just one developer's aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.
Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can't fill up the whole pool, and reservations may be changed at will. However, if you don't want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn't just painless, it is irrelevant.
ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.
The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem's root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.
These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet - it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.
SAM and QFS
SAM and QFS are different things but are closely coupled. QFS is Sun's cluster filesystem, meaning that the same filesystem may be simultaneously mounted by multiple systems. SAM is a hierarchical storage manager; it allows a set of disks to be used as a cache for a tape library. SAM and QFS are designed to work together, but each may be used separately.
QFS has some interesting features. A QFS filesystem may span multiple disks with no extra LVM needed to do striping or concatenation. When multiple disks are used, data may be striped or round-robined. Round-robin allocation means that each file is written to one or two disks in the set. This is useful since, unlike striping, participation by all disks is not needed to fetch a file - each disk may seek totally independently. QFS also allows metadata to be separated from data. In this way, a few disks may serve the random metadata workload while the rest serve a sequential data workload. Finally, as mentioned before, QFS is an asymmetric cluster filesystem.
QFS cannot manage its own RAID, besides striping. For this, you need a hardware controller, a traditional volume manager, or a raw ZFS volume.
SAM makes a much larger backing store (typically a tape library) look like a regular UNIX filesystem. This is accomplished by storing metadata and often-referenced data on disk, and migrating infrequently used data in and out of the disk cache as needed. SAM can be configured so that all data is staged out to tape, so that if the disk cache fails, the tapes may be used like a backup. Files staged off of the disk cache are stored in tar-like archives, so that potentially random access of small files can become sequential. This can make further backups much faster.
QFS may be used as a local or cluster filesystem for large-file intensive workloads like Oracle. SAM and QFS are often used for huge data sets such as those encountered in supercomputing. SAM and QFS are optional products and are not cheap, but they have recently been released into OpenSolaris.
The Veritas filesystem and volume manager have their roots in a fault-tolerant proprietary minicomputer built by Veritas in the 1980s. They have been available for Solaris since at least 1993 and have been ported to AIX and Linux. They are integrated into HP-UX and SCO UNIX, and Veritas Volume Manager code has been used (and extensively modified) in Tru64 UNIX and even in Windows. Over the years, Veritas has made a lot of money licensing their tech, and not because it is cheap, but because it works.
VxFS has never been part of Solaris but, when UFS was the only option, it was a popular addition. VxVM and VxFS are tightly integrated. Through vxassist, one may shrink and grow filesystems and their underlying volumes with minimal trouble. VxVM provides online RAID relayout. If you have a RAID5 and want to turn it into a RAID10, no problem, no downtime. If you need more space, just convert it back to a RAID5. VxVM has a reputation for being cryptic, and to some extent it is, but it's not so bad and the flexibility is impressive.
VxFS is a fast, extent based, journaled, clusterable filesystem. In fact, it essentially introduced these features to the world, along with direct IO. Newer versions of VxFS and VxVM have the ability to do cross-platform disk sharing. If you ever wanted to unmount a volume from your AIX box and mount it on Linux or Solaris, now you can.
VxFS and VxVM are still closed source. A version is available from Symantec that is free on small servers, with limitations, but I imagine that most users still pay. Pricing starts around $2500 and can be shocking for larger machines. VxFS and VxVM are solid choices for critical infrastructure workloads, including databases.
These are the four major choices in the Solaris on-disk filesystem world. Other filesystems, such as ext2, have some degree of support in OpenSolaris, and FUSE is also being worked on. But if you are deploying a Solaris server, you are going to be using one or more of these four. I hope that you enjoyed this overview, and if you have any corrections or tales of UNIX filesystem history, please let me know
Description: A tutorial about one of the really hidden features of Solaris - CacheFS.
CacheFS is something similar to a caching proxy.
But this proxy don´t cache web page, it caches files from another filesystem.
Contact: joerg.moellenkamp [ at ] sun.com
August 7, 2007 | KernelTrap Submitted by Jeremy on August 7, 2007 - 9:26am.
In a recent lkml thread, Linus Torvalds was involved in a discussion about mounting filesystems with the
noatimeoption for better performance, "'noatime,data=writeback' will quite likely be *quite* noticeable (with different effects for different loads), but almost nobody actually runs that way." He noted that he set O_NOATIME when writing git, "and it was an absolutely huge time-saver for the case of not having 'noatime' in the mount options. Certainly more than your estimated 10% under some loads." The discussion then looked at using the
relatimemount option to improve the situation, "relative atime only updates the atime if the previous atime is older than the mtime or ctime. Like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified." Ingo Molnar stressed the significance of fixing this performance issue, "I cannot over-emphasize how much of a deal it is in practice. Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_." He submitted some patches to improve
relatime, and noted about
"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"
Feb 22, 2007 (blogs.sun.com)
As many of you noticed, Solaris now supports SATA controllers and devices. To simplify writing SATA HBA drivers the new module and a set of interfaces was created, referred to as either SATA Framework or SATA module. I was a principal architect of SATA framework, but several other Sun engineers were participating in the conceptual design and the shaping of the interfaces.
It is not small piece of software - the source, sata.c, is over 300k in size. Reading this code, with associated header files may be a little confusing. So, I created an overview of the sata module, explaining what it is, how it fits in Solaris kernel, what it does, what are the interfaces and how sample operations are performed. Hopefully, it will be useful for all that want to improve and expand SATA support in Solaris Similar overview was presented about a year ago at Silicon Valley Open Solaris User Group meeting in Santa Clara and on various occasions internally in Sun organization. The overview that I plan to present here will have several parts. Here is the first one...
Often it is necessary to set up a partition table for a disk to be the same as on another disk, for example, where the disks are mirrored. This can be achieved by using the
formatutility. In the instructions that follow, the original disk is called disk a and the second disk (which will have the same partition table) is called disk b.format <select disk a> (Select disk a from the list displayed.) partition print (Print out the partition table to list the partition table.) name rootdisk (Pick a name of your choice.) quit (Go back to the format menu.) disk (Go to the menu that allows you to select disk b.) <select disk b> (Select disk b from the list displayed.) partition (Print out the partition table before changing.) select --Pick rootdisk (Pick from the menu the name you gave above.) label (Write out the partition table to disk.) quit (Go back to the format menu.) quit (Exit format.)
[July 27, 2001] Solaris partitioning: Partitioning in itself, doesn't give efficiency, and can actually be a hindrance, since you cannot easily expand a partition, unless you use LVM (Logical Volume Manager).
It depends on your disk sub system: How many disks, software RAID or hardware RAID (1, 0+1, 5), SCSI or IDE.
Generally, I think of my harddisk content divided into 3 categories: data, configuration-files, and binaries /applications /OS.
Efficency can be gained, by distribute I/O load between different disk "sub-systems".
Eg. lets say; the webserver generates lots of logging info on every request, and that every request generates database I/O activity too. It would then make sense, to place the webserver logging data, and the DB on different disks (and therefore on different partitions). This is especially true, regarding SCSI, but IDE disks should benefit too.
Generel rules of thumb:
/home should be on its own partition and ideally on its own disk. Of course, this depends on whether your server has local users, uses .maildir (qmail).
If you got users and userdata in /home this is very convinient, especially when; performing dangerous upgrades (unmount it), restoring the system after a disk crash or compromise, or if users needs more diskspace (see IBM's excellent article on moving /home, on their developer network). Size? Depends entirely, but _a lot_ since you can't just clean up in the users home dirs, if size becomes a problem.
/var should be on its own partition. This may give a little extra security and stability, since /var is used for dynamic data and log-files. If a process runs amok (or by a DOS) and generates ever expanding logfiles, the damage is constrained to a single partition. This may prevent the system from crashing. A couple of GB's is not too little.
Some like a separate /boot partition on eg. 50MB. (I don't use that)
/usr may be a candidate for its own partition. If so, then allocate it lots of free space, since /usr tends to grow a lot with time, and the extra free space may be needed during distribution upgrades. A couple of GB's will do fine for many.
swap The official guidelines for swap space with kernel 2.4, is swap space=2*RAM.
So if the server has 256MB RAM, use 512MB for swap. Again, check out IBM's Linux section on their developer network. They have a nice article, on swap usage; eg. if you have 2 disks, make eg. a 256MB on each. Then swapping would be parallelized, which mean that it would have the same speed advantage as RAID 0.
Always allocate much more space on a partition than you need.
Don't make too many partitions
The Gartner Group rates IT management processes, referring to different levels of management sophistication as "maturity levels." (Gartner clients can refer to Research Note #DF-08-6312, 'IT Management Process Maturity,' by analysts Donna Scott and David Williams.) Gartner describes a range of maturity levels: Chaotic-No consistent use of performance tools; Reactive-Organization uses event consoles;Proactive-Organization uses performance monitoring and historical tools; Service-Organization employs capacity planning; Value-IT/Business Metric Linkage. This article will discuss why you want to move the maturity of your operations from "chaotic" to "value," and provide an overview of the classes of tool that can simplify that evolution. This broadly based article is intended for those at technical through management positions who are preparing to select or justify the purchase of system performance management tools.
Read this Sun BluePrint by Jon Hill, a consultant for TeamQuest Corporation, and Kemer Thomson, a Senior Staff Engineer in Sun Microsystems' Enterprise Engineering group.
Google matched content
[PDF] File System Performance: The Solaris, UFS, Linux ext3, and ReiserFS
[PDF] Design, Features, and Applicability of Solaris File Systems
You too can understand device numbers and mapping in Solaris ...
High Availability: Configuring Boot, Root and Swap (PDF)
Scrubbing Disk Using the Solaris Operating Environment Format Program (June 2000) -by Rob Snevely Rob explains how to effectively scrub disks on a Solaris Operating Environment system, using the format utility.