Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Linux disk subsystem tuning

News	Performance tuning	Recommended Books	Recommended Links	Prioritizing Disk Access with ionice	Performance Monitoring	sar
uptime command	free	top	ps	pmap	ptree	lsof
mostat	vmstat	iostat	procstat	sar	nfsstat
tcpdump	iptraf	netstat
Disk subsystem tuning	Linux Kernel Tuning	Linux Virtual Memory Subsystem Tuning	TCP performance tuning	NFS performance tuning	strace
Troubleshooting Linux Performance	Linux performance bottlenecks	Linux Swap filesystem	VMware	Virtualization	Humor	Etc

Even with sufficient memory, most database servers will perform large amounts of disk I/O to bring data records into memory and flush modified data to disk. Therefore, it is important to configure sufficient numbers of disk drives to match the CPU processing power being used.

In general, a minimum of 10 high-speed disk drives is required for each Xeon processor. Optimal configurations can require more than 50 10K-RPM disk drives per Xeon CPU. With most database applications, more drives equals greater performance.

The main factors affecting performance include:

The RAID controllers cache size Depending on storage access patterns, the RAID controller cache may have a major impact on system performance. The cache plays a particularly relevant role for write operations on RAID 5 arrays. Write buffering makes the RAID controller acknowledge the write operation before the write goes to disk. This may positively affect performance in several ways: Updates can overwrite previous updates, thereby reducing the number of disk writes. By grouping several requests, the disk scheduler may achieve optimal performance.
The choice of RAID level. Disregarding any cost constraint, the best choice is almost always RAID 10. The issue is to understand which activities merit the higher cost of the RAID 10 performance increase:
- In a data warehouse (DW), the typical access pattern to data files is almost 100% reading; therefore the better performance delivered by RAID 10 arrays in write operations is irrelevant and the choice of a RAID 5 level is acceptable.
- On the other hand, in a typical online transaction processing (OLTP) environment, the huge number of write operations makes RAID 10 the best level for data files arrays. Of course, even for the OLTP data files, RAID 5 may be acceptable in case of low concurrency.
- For write operations, RAID 10 arrays are much better than RAID 5 arrays. However, for read operations, the difference between the two levels is minimal.
From this simple rule stem the following recommendations:
- Online redo log files: RAID 1 is strongly recommended. In case of very high performance requirements, RAID 10 may be necessary. However, RAID 10 delivers performance benefits only in the case of quite small stripe unit size.
- Archive redo log files: RAID 1 is recommended. However, the archive redo logs are not as critical for performance as redo log files. Accordingly, it is better to have archive redo logs on RAID 5 arrays than to have redo logs on RAID 5 arrays.
- Temporary segments: RAID 1 (or, even better, RAID 10) is recommended in case of many sort operations, as is typical in data warehouses, for example.
- Data files: RAID 5 is acceptable for data warehouses, because the typical access pattern is reading for small databases. Generally, RAID 10 is the recommended RAID level.
The RAID arrays stripe unit size. Typically, the stripe unit size should be a multiple of the database block size (for example, two times or three times the block size). In addition, consider the average I/O size. Theoretically, the I/O size and the stripe unit size should be identical. However, because block boundaries are not necessarily aligned with stripe units, you should make the stripe unit size at least twice the average I/O size.
The database block size The database block size is one parameter that significantly affects I/O performance. However, the only way to change this parameter after the database is created is to create a new database and move data to it.

What partition layout to choose? In the Linux community, the partitioning of a disk subsystem engenders vast discussion. The partitioning layout of a disk subsystem is often dictated by application needs, systems management considerations, and personal liking, not performance. The partition layout will therefore be given in most cases. The only suggestion we want to give here is to use a swap partition. Swap partitions, as opposed to swap files, have a performance benefit because there is no overhead of a file system. Ideally swap partition should be on a separate disk drive (preferably solid state). Large swap partitions can be split into two using different drives for each half.

What file system to use? The installation of RHEL 5.6 limits the choice of file systems to: ext2, ext3 and ext4. The Red Hat Enterprise Linux 5.6 installer defaults to ext3 and this is acceptable in most cases, but we encourage you to consider using ext4. To allow anaconda to manipulate ext4 filesystems, you need to start 5.6 installer using the "ext4" parameter on the command line:

linux ext4

Smaller file systems that have no focus on integrity (for example, a Web server cluster) and systems with a strict need for performance (high-performance computing environments) can benefit from the performance of the ext2 file system. ext2 does not have the overhead of journaling, and while ext3 andnext4 has undergone tremendous improvements, there still is a difference. Also note that ext2 file systems can be upgraded easily.

On Suse ReiserFS can be used for applications that use many small files such as

Mail servers
NFS servers
database servers

or other applications that use synchronous I/O.

When using Ext3 with many files in one directory, consider enabling btree support:

# mkfs.ext3 -O dir_index

When using Ext3 with multiple threads appending to files in the same directory, consider turning preallocation on:

# mount -o reservation

You can benefit from using dedicated logging devices:

ReiserFS

mkreiserfs -j /dev/xxx -s 8193 /dev/xxy

reiserfstune –journal-new-device /dev/xxx -s 8193

Ext3

mke2fs -O journal_dev /dev/xxx

mke2fs -j -J device=/dev/xxx,size=8193 /dev/xxy

tune2fs -J device=/dev/xxx,size=8193 /dev/xxy

File System Tuning Split file systems based on data access patterns

Keep commit heavy data away from data that does not have to be synchronous.
Keep streaming writes and reads on different spindles than random I/O

Consider disabling atime updates on files and directories

# mount -o noatime,nodiratime

Per-request service deadline

caps maximum latency per request
maintains good disk throughput Best for disk-intensive database applications Activated by boot parameter elevator=deadline

Blocker Layer Tunables

Block read ahead buffer

/sys/block/<sdX/hdX>/queue/read_ahead_kb

Default is 128. Increase to 512 for fast storage

(SCSI disks or RAID).

May speed up streaming reads a lot.

Number of requests

/sys/block/<sdX/hdX>/queue/nr_requests

Default is 128. Increase to 256 with CFQ

scheduler for fast storage.

Increases throughput at minor latency expense.

Notes:

For fast disk subsystems, it is desirable to use large flushes of dirty memory pages.
The value stored in /proc/sys/vm/dirty_background_ratio defines at what percentage of main memory the pdflush daemon should write data out to the disk.

If larger flushes are desired then increasing the default value of 10% to a larger value will cause less frequent flushes.

As in the example above the value can be changed to 25 as shown in

# sysctl -w vm.dirty_background_ratio=25
Another related setting in the virtual memory subsystem is the ratio at which dirty pages created by application disk writes will be flushed out to disk.
The default value 10 means that data will be written into system memory until the file system cache has a size of 10% of the server’s RAM.

The ratio at which dirty pages are written to disk can be altered as follows to a setting of 20% of the system memory

# sysctl -w vm.dirty_ratio=20

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

200102 : sg245287 ( sg245287, )

Old News ;-)

sg245287

Disk bottlenecks

The disk subsystem is often the most important aspect of server performance, and it is usually the most common bottleneck. However, problems can be hidden by other factors, such as lack of memory. Applications are considered to be I/O-bound when CPU cycles are wasted simply waiting for I/O tasks to finish.

The most common disk bottleneck is having too few disks. Most disk configurations are based on capacity requirements, not performance. The least expensive solution is to purchase the smallest number of the largest-capacity disks possible. However, this places more user data on each disk, causing greater I/O rates to the physical disk and allowing disk bottlenecks to occur.

The second most common problem is having too many logical disks on the same array, which increases seek time and greatly lowers performance.

We discuss the disk subsystem in greater detail in 15.9, "Tuning the file system" on page 480.

As with the other components of the Linux system we discussed, disk metrics are important when identifying performance bottlenecks. Some of the values that may point to a disk bottleneck are:

Iowait -- This is the time the CPU spends waiting for an I/O to occur.

Average queue length -- This is the number of outstanding I/O requests. In general, when the value is higher than 2 to 3,it means there may be a disk I/O bottleneck. This applies to systems with a single disk. In disk arrays, however, the queue length may be different and not necessarily indicate a Linux bottleneck; it may be under the control of the I/O controller using cache or other methods.

Average wait -- This is a measurement of the average time in ms that it takes for an I/O request to be serviced. The wait time consists of the actual I/O operation and the time it waits in the I/O queue.

Transfers per second -- This refers to the number of I/O operations per second (reads and writes).

Blocks read/write per second -- This refers to the reads/writes per second in blocks of 512 bytes in the kernel 2.6 style.

Linux disk subsystem tuning

NEWS CONTENTS

Old News ;-)

sg245287

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites