|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
A file system consists of blocks of data. The number of bytes constituting a block varies depending on the OS. The internal physical structure of a hard disk consists of cylinders. The hard disk is divided into groups of cylinders known as cylinder groups, further divided into blocks.
|
The file system is comprised of five main blocks (boot block, superblock, Inode block, data block,
Boot block. The boot block is part of the disk label that contains a loader used to boot the operating system.
Super block. All partitions within the Unix filing system usually contain a special block called the super block. The super block contains the basic information about the entire file system. It stores the following details about the file system:
A super block plays an important role during the system boot up and shutdown process. When the system boots, the details in the super block are loaded into the memory to improve the speed of processing. The super block is then updated at regular time intervals from the data in the memory. During system shutdown, a program called sync writes the updated data in the memory back to the super block. This process is very crucial because an inaccurate super block might even lead to an unusable file system. This is precisely why the proper shutdown of a Solaris system is essential.
Because of the critical nature of the super block, it is replicated at the beginning of every cylinder group. These blocks
are known as surrogate super blocks. A damaged or corrupted super block is recovered from one of the surrogate super blocks.
Inode block. Inode is a kernel structure that contains a pointer to the disk blocks that store data. This pointer points to information such as file type, permission type, owner and group information, file size, file modification time, and so on. Note that the inode does not contain the filename as part of the information. The filename is listed in a directory that contains a list of filenames and related inodes associated with the file. When a user attempts to access a given file by name, the name is looked up in the directory where the corresponding inode is found. Inode stores the following information about every file:
Each inode has a unique number associated with it, called the inode number. The -li option of the ls command displays the inode number of a file:
# ls -li
When a user creates a file in the directory or modifies it, the following events occur:
Data block
The data block is the storage unit of data in the Solaris file system. The default size of a data block in the Solaris file system is 8192 bytes. After a block is full, the file is allotted another block. The addresses of these blocks are stored as an array in the Inode.
The first 12 pointers in the array are direct addresses of the file; that is, they point to the first 12 data blocks where the file contents are stored. If the file grows larger than these 12 blocks, then a 13th block is added, which does not contain data. This block, called an indirect block, contains pointers to the addresses of the next set of direct blocks.
If the file grows still larger, then a 14th block is added, which contains pointers to the addresses of a set of indirect blocks.
This block is called the double indirect block. If the file grows still larger, then a 15th block is added, which contains pointers
to the addresses of a set of double indirect blocks. This block is called the triple indirect block.
Vnodes. A Virtual Node or vnode is a data structure that represents an open file, directory, or device that appears in the file system namespace. A vnode does not render the physical file system it implements. The vnode interface allows high-level operating system modules to perform uniform operations on vnodes.
Hard and soft links are a great features of Unix. It is a reference in a directory to a file stored in another directory. In case of soft links it can be a reference to a directory. There might be multiple links to a file. Links eliminate redundancy because you do not need to store multiple copies of a file.
Links are of two types: hard and soft (also known as symbolic).
To create a symbolic link, you must use the -s option with the ln command. Files that are soft linked contain an l symbol at the first bit of the access permission bits displayed by the ls -l command, whereas those that are hard linked do not contain the l symbol. A directory is symbolically linked to a file. However, it cannot be hard linked.
It is obvious that no file exists with a link count less than one. Relative pathnames . or .. are nothing but links for the current directory and its parent directory. These are present in every directory: any directory stores the two links ., .. and the Inode numbers of the files. They can be listed by the ls -lia option. A directory must have a minimum of two links. The number of links increases as the number of sub-directories increase. Whenever you issue a command to list the file attributes, it refers to the Inode block with the Inode number and the corresponding data is retrieved.
Each file system used in Solaris is intended for a specific purpose.
The root file system is at the top of an inverted tree structure. It is the first file system that the kernel mounts during booting. It contains the kernel and device drivers. The / directory is also called the mount point directory of the file system. All references in the file system are relative to this directory. The entire file system structure is attached to the main system tree at the root directory during the process of mounting, and hence the name. During the creation of the file system, a lost + found directory is created within the mount point directory. This directory is used to dump into the file system any unredeemed files that were found during the customary file system check, which you do with the fsck command.
/ (root)
The directory located at the top of the Unix file system. It is represented by the "/" (forward slash) character.
You create file systems with the newfs command. The newfs command accepts only logical raw device names. The syntax is as follows:
newfs [ -v ] [ mkfs-options ] raw-special-device
For example, to create a file system on the disk slice c0t3d0s4, the following command is used:
# newfs -v /dev/rdsk/c0t3d0s4
The -v option prints the actions in verbose mode. The newfs command calls the mkfs command to create a file system. You can invoke the mkfs command directly by specifying a -F option followed by the type of file system.
Mounting file systems is the next logical step to creating file systems. Mounting refers to naming the file system and attaching it to the inverted tree structure. This enables access from any point in the structure. A file system can be mounted during booting, manually from the command line, or automatically if you have enabled the automount feature.
With remote file systems, the server shares the file system over the network and the client mounts it.
The / and /usr file systems, as mentioned earlier, are mounted during booting. To mount a file system, attach it to a directory anywhere in the main inverted tree structure. This directory is known as the mount point. The syntax of the mount command is as follows:
# mount <logical block device name> <mount point>
The following steps mount a file system c0t2d0s7 on the /export/home directory:
# mkdir /export/home # mount /dev/dsk/c0t2d0s7 /export/home
You can verify the mounting by using the mount command, which lists all the mounted file systems.
Note: If the mount point directory has any content prior to the mounting operation, it is hidden and remains inaccessible until the file system is unmounted.
Data is stored and retrieved from the physical disk where the file system is mounted.Although there are no defined specifications for creating the file systems on the physical disk, slices are usually allocated as following:
0. Root or /— Files and directories of the OS.
The slices shown above are all allocated on a single single disk. However, there is no restriction that all file systems need to be located on a single disk. They can also span across multiple disks. Slice 2 refers to the entire disk. Hence, if you want to allocate an entire disk for a file system, you can do so by creating it on slice 2. The mount command supports a variety of useful options.
Option |
Description |
---|---|
-o largefiles |
Files larger than 2GB are supported in the file system. |
-o nolargefiles |
Does not mount file systems with files larger than 2GB. |
-o rw |
File system is mounted with read and write permissions. |
-o ro |
File system is mounted with read-only permission. |
-o bg |
Repeats mount attempts in the background. Used with non-critical file systems. |
-o fg |
Repeats mount attempts in the foreground. Used with critical file systems. |
-p |
Prints the list of mounted file systems in /etc/vfstab format. |
-m |
Mounts without making an entry in /etc/mnt /etc/tab file. |
-O |
Performs an Overlay mount. Mounts over an existing mount point. |
A file system can be unmounted with the umount command. The following is the syntax for umount:
umount <mount-point or logical block device name > File systems cannot be unmounted when they are in use or when the umount command is issued from any subdirectory within the file system mount point.
Note: A file system can be unmounted forcibly if you use the -f option of the umount command. Please refer to the man page to learn about the use of these options.
The umountall command is used to unmount a group of file systems. The umountall command unmounts all file systems in the /etc/mnttab file except the /, /usr, /var, and /proc file systems. If you want to unmount all the file systems from a specified host, use the -h option. If you want to unmount all the file systems mounted from remote hosts, use the -r option.
The /etc/vfstab (Virtual File System Table) file plays a very important role in system operations. This file contains one record for every device that has to be automatically mounted when the system enters run level 2.
Column Name |
Description |
---|---|
device to mount |
The logical block name of the device to be mounted. It can also be a remote resource name for NFS. |
device to fsck |
The logical raw device name to be subjected to the fsck check during booting. It is not applicable for read-only file systems, such as High Sierra File System (HSFS) and network File systems such as NFS. |
Mount point |
The mount point directory. |
FS type |
The type of the file system. |
fsck pass |
The number used by fsck to decide whether the file system is to be checked. 0— File system is not checked. 1— File system is checked sequentially. 2— File system is checked simultaneously along with other file systems where this field is set to 2. |
Mount at boot |
The file system to be mounted by the mount all command at boot time is determined by this field. The options are either yes or no. |
Mount options |
The mount options to be supported by the mount command while the particular file system is mounted. |
Note the no values in this field for the root, /usr, and /var file systems. These are mounted by default. The fd field refers to the floppy disk and the swap field refers to the tmpfs in the /tmp directory.
A sample vfstab file looks like:
#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options # fd - /dev/fd fd - no - /proc - /proc proc - no - /dev/dsk/c0t0d0s4 - - swap - no - /dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no - /dev/dsk/c0t0d0s6 /dev/rdsk/c0t0d0s6 /usr ufs 1 no - /dev/dsk/c0t0d0s3 /dev/rdsk/c0t0d0s3 /var ufs 1 no - /dev/dsk/c0t0d0s7 /dev/rdsk/c0t0d0s7 /export/home ufs 2 yes - /dev/dsk/c0t0d0s5 /dev/rdsk/c0t0d0s5 /opt ufs 2 yes - /dev/dsk/c0t0d0s1 /dev/rdsk/c0t0d0s1 /usr/openwin ufs 2 yes - swap - /tmp tmpfs - yes -
The /etc/mnttab file comprises a table that defines which partitions and/or disks are currently mounted by the system.
The /etc/mnttab file contains the following details about each mounted file system:
The file system name
The mount point directory
The file system type
The mount command options
A number denoting the time of the mounted file system
A sample mnttab file:
/dev/dsk/c0t0d0s0 / ufs rw,intr,largefiles,xattr,onerror=panic,s uid,dev=2200000 1014366934 /dev/dsk/c0t0d0s6 /usr ufs rw,intr,largefiles,xattr,onerror=panic,s uid,dev=2200006 1014366934 /proc /proc proc dev=4300000 1014366933 mnttab /etc/mnttab mntfs dev=43c0000 1014366933 fd /dev/fd fd rw,suid,dev=4400000 1014366935 /dev/dsk/c0t0d0s3 /var ufs rw,intr,largefiles,xattr,onerror=panic,s uid,dev=2200003 1014366937 swap /var/run tmpfs xattr,dev=1 1014366937 swap /tmp tmpfs xattr,dev=2 1014366939 /dev/dsk/c0t0d0s5 /opt ufs rw,intr,largefiles,xattr,onerror=panic,s uid,dev=2200005 1014366939 /dev/dsk/c0t0d0s7 /export/home ufs rw,intr,largefiles,xattr,onerror =panic,suid,dev=2200007 1014366939 /dev/dsk/c0t0d0s1 /usr/openwin ufs rw,intr,largefiles,xattr,onerror =panic,suid,dev=2200001 1014366939 -hosts /net autofs indirect,nosuid,ignore,nobrowse,dev=4580001 10143669 44 auto_home /home autofs indirect,ignore,nobrowse,dev=4580002 10143669 44 -xfn /xfn autofs indirect,ignore,dev=4580003 1014366944 sun:vold(pid295) /vol nfs ignore,dev=4540001 1014366950 #
Some applications and processes create temporary files that occupy a lot of hard disk space. As a result, it is necessary to impose a restriction on the size of the files that are created.
Solaris provides tools to control the storage. They are:
The ulimit command
Disk quotas
The ulimit command is a built-in shell command, which displays the current file size limit. The default value for the maximum file size, set inside the kernel, is 1500 blocks. The following syntax displays the current limit:
$ ulimit -a time(seconds) unlimited file(blocks) unlimited data(kbytes) unlimited stack(kbytes) 8192 coredump(blocks) unlimited nofiles(descriptors) 256 memory(kbytes) unlimited
If the limit is not set, it reports as unlimited.
The system administrator and the individual users change this value to set the file size at the system level and at the user level, respectively. The following is the syntax of the ulimit command:
ulimit <value>
For example, the following syntax sets the file size limit to 1600 blocks:
# ulimit 1600 # ulimit -a time(seconds) unlimited file(blocks) 1600 data(kbytes) unlimited stack(kbytes) 8192 coredump(blocks) unlimited nofiles(descriptors) 256 memory(kbytes) unlimited #
The file size can be limited at the system level or the user level. To set it at the system level, change the value of the ulimit variable in the /etc/profile file. To set it at the user level, change the value in the .profile file present in the user's home directory. The user-level setting always takes precedence over the system-level setting. It is the user's profile file that sets the working environment.
Note: The ulimit values set at the user level and system level cannot exceed the default ulimit value set in the kernel.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Blogged by matty as Solaris Storage - matty Sat 29 Jan 2005 12:14 am
I recently needed to grow a Solaris UFS file system, and accomplished this with the growfs(1m) utility. The growfs(1m) utility takes two arguments. The first argument to growfs ( the value passed to "-M" ) is the mount point of the file system to grow. The second argument is the raw device that backs this mount point. The following example will grow "/test" to the maximum size available on the meta device d100:
$ growfs -M /test /dev/md/rdsk/d100
To see how many sectors will be available on d100 after the grow operation completes, you can run newfs with the "-N" option, and compare that with the current value of df (1m):
$ newfs -N /dev/md/dsk/d100
/dev/md/rdsk/d0: 232331520 sectors in 56944 cylinders of 16 tracks, 255 sectors
113443.1MB in 2191 cyl groups (26 c/g, 51.80MB/g, 6400 i/g)This will report the number of sectors, cylinders and MBs that would be allocated if a new file system was created on meta device d100. As always, test everything on a non critical system prior to making changes to critical boxen.
Recently, I wanted to create a UFS file system on a Maxtor OneTouch II external hard drive I have. I wanted to use the external hard drive for storing some large files and I was going to use the drive exclusively with one of my Solaris systems. Now, I didn't find much information on the web about how to perform this with Solaris (maybe I wasn't searching very well or something) so I thought I would post the procedure I followed here so I'll know how to do it again if I need to.After plugging the hard drive into my system via one of the USB ports, we can verify that the disk was recognized by the OS by examining the
/var/adm/messages
file. With the hard drive I was using, I saw entries like the following:Mar 2 13:10:33 solaris-filer usba: [ID 912658 kern.info] USB 2.0 device (usbd49,7100) operating at hi speed (USB 2.x) on USB 2.0 root hub: storage@3, scsa2u sb0 at bus address 2 Mar 2 13:10:33 solaris-filer usba: [ID 349649 kern.info] Maxtor OneTouch II L60LHYQG Mar 2 13:10:33 solaris-filer genunix: [ID 936769 kern.info] scsa2usb0 is /pci@0,0/pci1028,11d@1d,7/storage@3 Mar 2 13:10:33 solaris-filer genunix: [ID 408114 kern.info] /pci@0,0/pci1028,11d@1d,7/storage@3 (scsa2usb0) online Mar 2 13:10:33 solaris-filer scsi: [ID 193665 kern.info] sd1 at scsa2usb0: target 0 lun 0
The dmesg command could also be used to see similar information. Also, we could use the rmformat command (this lists removable media) to see this information in a much nicer format like so:
# rmformat -l Looking for devices... 1. Logical Node: /dev/rdsk/c1t0d0p0 Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0 Connected Device: QSI CDRW/DVD SBW242U UD25 Device Type: DVD Reader 2. Logical Node: /dev/rdsk/c2t0d0p0 Physical Node: /pci@0,0/pci1028,11d@1d,7/storage@3/disk@0,0 Connected Device: Maxtor OneTouch II 023g Device Type: Removable #Now that we now the drive has been identified by Solaris (as/dev/rdsk/c2t0d0p0
) we need to create one Solaris partition (this is Solaris 10 running on the x86 architecture) that uses the whole disk. This accomplished by passing the-B
flag to thefdisk
command, like so:# fdisk -B /dev/rdsk/c2t0d0p0
Now we will print the disk table to standard out like so:# fdisk -W - /dev/rdsk/c2t0d0p0
This will output the following information to the screen for the hard drive I am using:* /dev/rdsk/c2t0d0p0 default fdisk table * Dimensions: * 512 bytes/sector * 63 sectors/track * 255 tracks/cylinder * 36483 cylinders * * systid: * 1: DOSOS12 * 2: PCIXOS * 4: DOSOS16 * 5: EXTDOS * 6: DOSBIG * 7: FDISK_IFS * 8: FDISK_AIXBOOT * 9: FDISK_AIXDATA * 10: FDISK_0S2BOOT * 11: FDISK_WINDOWS * 12: FDISK_EXT_WIN * 14: FDISK_FAT95 * 15: FDISK_EXTLBA * 18: DIAGPART * 65: FDISK_LINUX * 82: FDISK_CPM * 86: DOSDATA * 98: OTHEROS * 99: UNIXOS * 101: FDISK_NOVELL3 * 119: FDISK_QNX4 * 120: FDISK_QNX42 * 121: FDISK_QNX43 * 130: SUNIXOS * 131: FDISK_LINUXNAT * 134: FDISK_NTFSVOL1 * 135: FDISK_NTFSVOL2 * 165: FDISK_BSD * 167: FDISK_NEXTSTEP * 183: FDISK_BSDIFS * 184: FDISK_BSDISWAP * 190: X86BOOT * 191: SUNIXOS2 * 238: EFI_PMBR * 239: EFI_FS * * Id Act Bhead Bsect Bcyl Ehead Esect Ecyl Rsect Numsect 191 128 0 1 1 254 63 1023 16065 586083330We now need to calculate the maximum amount of usable storage. This is done by multiplying bytes/sectors (512 in my case) by the number of sectors listed at the bottom of the output shown above. We then divide this number by 1024*1024 to yield MBs.So in my case, this will work out as 286173.5009765625 MB.
Now, we need to setup a partition table file. This will be a regular text file and you can name it whatever you like. For the sake of this post, I will name it disk_slices.txt. The contents of this file are:
slices: 0 = 2MB, 286170MB, "wm", "root" : 1 = 0, 1MB, "wu", "boot" : 2 = 0, 286172MB, "wm", "backup"To create these slices on the disk, we run:# rmformat -s disk_slices.txt /dev/rdsk/c2t0d0p0 # devfsadm # devfsadm -C
To create the UFS file system on the newly created slice, I run the following and the output from running this command is also shown:# newfs /dev/rdsk/c2t0d0s0 newfs: construct a new file system /dev/rdsk/c2t0d0s0: (y/n)? y /dev/rdsk/c2t0d0s0: 586076160 sectors in 95390 cylinders of 48 tracks, 128 sectors 286170.0MB in 5962 cyl groups (16 c/g, 48.00MB/g, 5824 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920, Initializing cylinder groups: ............................................................................... ........................................ super-block backups for last 10 cylinder groups at: 585105440, 585203872, 585302304, 585400736, 585499168, 585597600, 585696032, 585794464, 585892896, 585991328 #
And now I'm finished, I now have a UFS file system created on my USB hard drive which can be mounted by my Solaris system. To mount this file system, I can just:# mount -F ufs /dev/rdsk/c2t0d0p0 /u01
I should add is that anyone who tries to mount an unknown ufs filesystem without at least running "fsck -n" over it probably deserves what they get.
This document describes a methodology for configuring a fast file system that handles several small files on the Solaris Operating System. This could be used for building a Java technology-based product or for handling many operations on a large amount of small files. This methodology utilizes a tmpfs volume, and it can speed up operations approximately three times.
The requirements are as follows:
- Solaris 7 OS through Solaris 10 OS Update 1
- Some experience with Solaris system administration. This procedure is not recommended for UNIX users who are uncomfortable with using
mount
, maintaining/etc/vfstab
, or modifying their kernel parameters.Warning: Do not develop on a tmpfs volume. A tmpfs volume is only persistent while the system is powered up, so a power loss or system problem will cause you to lose any changes to that volume.
Procedure
Solaris tmpfs volumes are easy to create, but require a significant amount of RAM and swap space. It is recommended that you have at least 1 Gbyte of RAM, but there have also been major performance gains on systems with 512 Mbytes of RAM. In addition, you should add twice as much swap space as the tmpfs volume you are creating. That is, for a 2-Gbyte tmpfs volume, add 4 Gbytes of swap space to the system. Feel free to experiment with these values.
The following examples are for a 2-Gbyte tmpfs volume, which is approximately what is needed to do a developer build. Replace
<swapfilename>
with the absolute path to aswapfile
(such as/disk1/swapfile
), and<mountpoint>
with the absolute path to where you want the tmpfs volume mounted (such as/ramdisk
).Add swap space to your workstation:
root# /usr/sbin/mkfile 2000m <swapfilename>
Create a mount point for the tmpfs volume:
root# mkdir <mountpoint>
Edit your
/etc/vfstab
file to use the swap and create the tmpfs volume at boot time. Add the following two lines:<swapfilename> - - swap - no - RAMDISK - <mountpoint> tmpfs - yes size=2000mNote that on the Solaris 7 OS you may not make a single tmpfs volume larger than 2 Gbytes.
Edit your kernel parameters to increase the number of files you can create in the tmpfs volume. Add the following line to your
/etc/system
file. (We've had the most success using this value.)
set tmpfs:tmpfs_maxkmem=250000000
Reboot your workstation. Then verify that the tmpfs volume exists at the size you specified:
% df -k <mountpoint>
Make the tmpfs volume writable. Note: This step is necessary after each reboot of the workstation.
root# chmod 777 <mountpoint>
http://www.google.com/search?q=Structure+of+UFS+filesystem&hl=en&rlz=1T4GZHZ_enUS227US227&start=20&sa=N
Wednesday Jun 01, 2005
More UFS technical tidbits in anticipation of OpenSolaris. Today's talk is about UFS I/O. It is a complicated beast and has many different parts and paths it can take.
The interaction of UFS and the VM subsystem has been the cause of numerous bugs, and hard to find problems. Today's blog is an overview of the UFS I/O, with particular attention paid to the VM subsystem interaction. Details on the paths taken when a read() system call is initiated are to show the interaction of UFS and the VM subsystem. I am making some assumptions here that the readers of this blog will have some basic Solaris file system knowledge, or at a minimum some of the basic Solaris file system terminology is understood.
Basic Solaris VM facts
Solaris virtual memory is demand paged, and globally managed. There is integrated file caching and it is layered to allow VM to describe multiple memory types. The paging vnode cache is the unification of file and memory management by use of a vnode object. 1 page of memory == <vnode, offset> tuple. The UFS file system uses this relationship to implement caching for vnodes. The paging vnode cache provides a set of functions for cache management and I/O for vnodes.
The paging vnode cache functions are specified with a pvn_ <xxx> title. The source code for this is located at: xxxx. Some of the more important paging vnode functions are listed below, with basic function descriptions. Also shown is pointers to the code so you can get more detailed data about each of these.
Some important paging vnode cache functions:
pvn_read_kluster():
Finds range of continuous pages within the supplied address/length that fit within the <vnode, offset> values that do not already exist.
- Caller should call pagezero() on any part of last page that is not read from disk.
pvn_write_kluster():
Finds dirty pages within the offset and length. Returns a list of locked pages ready to be written.
Caller then sets up write call with pageio_setup().
Write is initiated via a call to bdev_strategy().
Synchronous writes require the caller to call pvn_write_done(). Otherwise io_done() will call this when write is complete.
pvn_vplist_dirty():
Finds all pages in page cache >= offset and pushes these pages.
Will cluster pages with adjacent pages if it can.
What is a seg_map and why do you care?The seg_map segment maintains mappings of pieces of files into kernel address space. It is only used by file systems and it allows copying of data to or from user to kernel address space. At any given time, seg_map segment has some portion of total file system cache mapped in to the kernel address space. The seg_map segment driver divides the segment in to file system block sized slots.
Some important seg_map functions:
segmap_getmap() && segmap_getmapflt():
Retrieves or creates mapping
getmapflt allows for creation of segment if not found, calls ufs_getpage()
segmap_release():
Releases the mapping for a file segment
segmap_pagecreate():
Creates new pages of memory and slots in the seg_map for a given files
Used for extending files or writing holes to a file
Important in the mapping and getting data from the segmap driver is the fbuf structure. It is defined as follows:
struct fbuf {
caddr_t fb_addr;
u int_t fb_count;
};
This structure is used to get a mapping to part of a file via the segkmap interfaces. It is also used by the pseudo bio functions(shown below) for reading and writing of data. fbuf is used by directory reading to get on UFS on disk contents via a call to blkatoff().
seg_vn and UFS and memory mapped I/O:
Memory mapping allows for a file to be mapped in the a processes address space. This mapping is done via the VOP_MAP call and the seg_vn memory driver. File pages are read when a fault occurs in the address space. The seg_vn driver enables I/O's without process initiated system calls. I/O is performed ,,in units of pages, upon reference to the pages mapped into the address space. Reads are initiated by a memory access, writes are initiated as the VM subsystem finds dirty pages in the mapped address space.
So, why not use the seg_vn driver for non mmap'd I/O as well.? It could be used for mapping the file in to the kernel's address space, but seg_vn is a complex segment driver that manages the mapping of protections, copy-on-write fault handling, shared memory, etc...This is too heavy weight for what is needed for read and write system calls, so the seg_map driver was developed. Read and write system calls only require a few basic mapping functions since they do not map files into a process's address space. seg_map reduces locking complexity and gives better performance.
Pseudo bio functions:
Solaris has a set of interfaces which are considered buffered I/O interfaces, but that are used to read and write buffers containing directory entries only. These interfaces all use the seg_map driver for mapping to address file data. The functions are fbread(), fbwrite(), fbrelese(), fbdwrite(), fbiwrite(), fbzero(). Although these are not directly shown in the picture above, they are important enough to be worth mentioning.
A UFS/VM example, read() system call - non mmap'd:
Note: In general UFS caches the pages for write, but will also cache pages for reads if they are frequently reusable.
read()->ufs_read()->rdip():
- Checks for directio1 enabled, if so tries to bypass page cache
- If cache_read_ahead is set, set appropriate flags for placement of pages on cache list(used in freebehind2)
- calculate whether we need to free pages(freebehind +) behind our read, this will come in later
- if i_contents(reader)3 held, drop it to avoid deadlock in ufs_getpage().
- Calls segmap_getmapflt() which transitions to ufs_getpage() since we are forcing a fault via S_READ
ufs_getpage():
If calling thread is thread owning the current i_contents lock no need to acquire the lock. Also checks to see if the vfs_dqrwlock is required.
Checks to see if the file has holes via bmap_has_holes(), this will be important later
- For a read in ufs_getpage() loop through all the pages in the range off, off + len:
- Call ufs_getpage_ra() to initiate an asynchronous read ahead of the current page. This helps us in page_lookup() process later.
- Check if we should initiate a read ahead of the next cluster of bytes, cluster size is determined from the UFS maxcontig4 value. Read ahead is true if:
seqmode5 + pageoff + cluster size >= i_nextrio(start of next cluster) && pgoff <= i_nextrio && i_nextrio < current file size
- Call page_lookup() to see if page is in page cache
- if yes, update appropriate pointers, continue
- If no, call ufs_getpage_miss():
- Page is either read from disk or created. It is created, without disk read if we call it with S_CREATE or there is a hole in the file at this offset(not backed by a real disk block) in case of read()
- Calls uiomove() to move data in to pages
- We start freeing pages behind the current read if the i_nextr(next byte offset which was set after reading in the pages) > smallfile offset(32k), because we are reading in sequential mode so we know we won't need them
- Call segmap_release() regardless, if cachemode set to freebehind(SM_FREE|SM_DONTNEED|SM_ASYNC) will put them to the head of the page cache
Technorati Tag: Solaris
=====================================================================================1UFS directio will be saved for a later post.
2freebehind is always set to 1.
3The i_contents lock is a krwlock_t which is part of the ufs inode data structure. It protects most of the inodes contents. See my previous blog posting on UFS locking for more details.
4See my previous blog post regarding the use of maxcontig in UFS
5seqmode is determined from the i_nextr field in the current working inode. i_nextr represents the next byte offset for reads. If i_nextr == current offset and we are not creating a page, then we set seqmode == 1.
Google matched content |
Just noticed that Solaris has an entry in Month of Kernel bugs.While I agree that we have an issue that needs looking at, I also believe that the contributor is making much more of it than it really deserves.
First off, to paraphrase the issue:
If I give you a specially massaged filesystem and can convince someone with the appropriate privilege to mount it, it will crash the system.I'd hardly call this a "denial of service", let alone exploitable.
First off, in order to perform a mount operation of a ufs filesystem, you need sys_mount privilege. In Solaris, we currently are runing under the concept of "least privilege". That is, a process is given the least amount of privilege that it needs to run. So, in order to exploit this you need to convince someone with the appropriate level of privilege to mount your filesystem. This would also invlove a bit of social engineering which went unmentioned.
That being said, they system should not panic off this filesystem and I will log a bug to this effect. It is a shame that the contributor did not make the crashdump files available as it would certainly speed up any analysis.
One other thing that I should add is that anyone who tries to mount an unknown ufs filesystem without at least running "fsck -n" over it probably deserves what they get.
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: March 12, 2019