Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Performance Monitoring

News	Performance tuning	Books	Recommended Links	Performance tuning	Tutorials	Papers
top	uptime	vmstat	ps	netstat	Software Distribution	Unix System Monitoring
Linux Performance Tuning	AIX performance tuning	NFS performance tuning	Database Performance Tuning	Oracle Performance Tuning	Tivoli perfomance tuning	Etc

In its simplest form, the performance monitor, or system monitor, is a utility which tracks the running processes and give a real time graphical display of the resources utilization. Unix top utility is a classic example of such tool. It can be used to assist you with the planning of upgrades, tracking of processes that need to be optimized, monitoring results of tuning and configuration scenarios, and the understanding of a workload and its effect on resource usage to identify bottlenecks.

Bottlenecks can occur on practically any component of the server, with typical suspect being I/O, memory and CPU. It can be caused by a malfunctioning resource, the system not having enough resources, a program that dominates a particular resource.

Solaris blueprints used to contains several very good materials about performance tuning. I especially recommend blueprints written by Adrian Cockcroft. They Sun blueprint became pretty difficult to get, Google is your friend. Alternative is to get his (also old, 1994) book Sun Performance and Tuning- Sparc & Solaris that has some interesting information and costs $0.01 on Amazon.

Resource_Management
Scenario Planning - Part 1 (February 2000)
-by Adrian Cockcroft
Discusses scenario planning techniques to help predict latent demand during overload periods. In this part 1 he explains how to simplify your model down to a single bottleneck.
Scenario Planning - Part 2 (March 2000)
-by Adrian Cockcroft
Presents part two of the Scenario Planning article and explains how to follow-up a simple planning methodology based on a spreadsheet that is used to break down the problem and experiment with alternative future scenarios.

Observability (December 1999)
-by Adrian Cockcroft
Discusses Capacity Planning and Performance Management techniques.
Processing Accounting Data into Workloads (October 1999)
-by Adrian Cockcroft
Information about Solaris operating system accounting to include code examples that extract the data in a usable format and pattern match it into workloads.

This disappreance of old Sun blueprints is typical for aqusitions, but at the same time is pretty sad (Oracle would lose nothing preserving them for viewing) and I do not know a good collection that still archive them. Here is a long quote from his Performance Monitoring Short Cuts which is still available:

Performance Monitoring Short Cuts

Introduction

Purpose

Scope

Acronyms

References

Who should use this article

How this article is organized

Related material

Typographic conventions

Quick start - for the impatient

Installing the tools

SE Toolkit from sourceforge.net

Verify the SE Toolkit installation

RRDtool from Sunfreeware.com

Verify the RRDtool installation

What data to collect and why

information relevant to performance

Cardinal resources available

Sources of performance degradation

Key guidelines

What data to collect

cpu

Memory

NIC

Disk

Concepts for identification of performance degradation root cause

Tool of the chase

cpustat

kstat

lockstat

mpstat

prset

vmstat

CPU related investigations

troubleshooting sysexec

example of sysexec troubleshooting.

intr

output from Cpu DTrace script

Who is the interrupt culprit

smtx

Using the SE toolkit

Introduction to writing se scripts

Syntax/semantic

Added functions

Structures

Extending the SE toolkit

Adding a new NIC

Baselining and data collection

Source code for scripts

/intrtop.d

./icswlist.d

Using TNF on Solaris 9 - trace probes

Cookbook

CPU statistics

Cache Thrashing

Interrupt statistics

TOC for Tech Tips.

Introduction

Purpose

This article describe what data points are needed to observe performance, and describe why those exact points where chosen.

This article describes what data you need to observe, and how you can observe it and store it.

The guideline is to go minimalistic: Get the the data you really need, nothing more, and make the data collection tools footprint as small as possible, so that the tool wont become one of the blips on the radar.

Both the data that is to be collected and the way that it is collected are discussed and described.
Including the design discussions here make it a lot easier to discuss what, why and how for each aspect of the observation.

After this the reader should be able to:

Gather the relevant Performance data from the System.

Have a starting point for interpreting the observations.

Understand why the data, being collected, are relevant.

Contribute to the performance discussion.

Scope

This article

is only about observation

it does not identify the causes.

it does not describe how to increase performance

only addresses the Solaris operating system.

The actual data collections scripts are not included.

Acronyms

APIC: Advanced Programmable Interrupt Controller.

CPC: CPU Performance Counter DTrace provider.

HAT: Hardware Address Transllation.(Sun06,p3-33)

HTT: HyperThreading Technology (HTT) is Intel's trademark for their implementation of the SMT.

L1D: (intel) Level 1 Data cache.

L1I: (intel) Level 1 Instruction cache.

L2: Level 2 cache

Hyper Threading:

Retired, Instructions:

SIMD: single-instruction multiple-data

SMT: simultaneous multithreading technology.

TLB: Translation Lookaside Buffer

TNF: Trace Normal Form

TTE: Translation Table Entry. (McD07,p590)

References

MySQL Scalability on Nehalem systems (Sun Fire X4270)

Improving Application Efficiency Through Chip Multi-Threading

Multi Processing and Multi Threading

UltraSPARC Processors Documentation

[BluePrints:Performance+Counters+on+Solaris]

Intel® 64 and IA-32 Architectures Software Developer's Manual

Solaris Internals

CMT Utilization

http://wikis.sun.com/display/SunStudio/Sun+Studio+Technical+Articles+Collection

http://prefetch.net/articles/dtracecookbook.html

TTCP

[http://developers.sun.com/solaris/articles/tnf.html]

Coc98_Sun Performance and Tuning: Java™ and the Internet_, Second Edition, Adrian Cockcroft; Richard Pettit. ISBN-13: 978-0-13-095249-3

McD06: Solaris Performance and tools ISBN-13: 978-0-13-156819-8

McD07: Solaris Internals ISBN: 0-13-148209-2

Sun06: Solaris 10 Operating System Internals SI365-S10 Student Guide

Read Me First!: A Style Guide for the Computer Industry, Second Edition by Sun Technical Publications ISBN-13: 978-0-13-142899-7

Resource Management

[
Configuring and Tuning Databases on the Solaris™ Platform|http://my.safaribooksonline.com/0130834173]

Web Performance Tuning, 2nd Edition

System Performance Tuning, 2nd Edition

http://www.setoolkit.org

http://oss.oetiker.ch/rrdtool/

http://opensolaris.org/os/community/dtrace/dtracetoolkit/

OpenSolaris Project: Zone Statistics

Solaris Kernel Statistics - Accessing libkstat with C

Solaris Kernel Statistics, Part II – Accessing libkstat with Shell Script

Who should use this article

System administrator

Performance agent

How this article is organized

Quick start - for the impatient: The section describes what tools are needed and where to get them.

What data to collect and why: Define the cardinal resources, what sources can affect the cardinal resources. Then the data points, needed for each cardinal resource, are defined and described.

:

Drill down - taking potshots at messengers:

Tools of the chase: What tools are good for chasing down issues.

Related material

System Performance Tuning, 2nd Edition

Quick start - for the impatient

Getting going with data collection

To get started the following needs to be installed on the system

SE Toolkit: Tools specifically written to retrieve the kernel statistics data.

You don't have to be root to run this tool.

http://www.setoolkit.org/cms/

http://sourceforge.net/projects/setoolkit/

RRDtool: The database for storing the collected data. The tool also support graph generation. It's a Round Robin Database, that will always stay the same size.

http://oss.oetiker.ch/rrdtool

http://www.sunfreeware.com

Epoch converter: For converting between Human readable time and epoch time. For calculating the start epoch for the RRDtool.

http://www.epochconverter.com/

Installing the tools

SE Toolkit from sourceforge.net

The RICHPse-xxx.pkg.gz is a Sun package file, it will install in /opt/RICHPse

The RICHPse is dependent on the C pre-processor: cpp.
For Solaris 9 the SUNWcpp package must be installed for the SE toolkit to run.

Verify the SE Toolkit installation

/opt/RICHPse/bin/se /opt/RICHPse/examples/cpus.se

/opt/RICHPse/bin/se /opt/RICHPse/examples/net.se

/opt/RICHPse/bin/se /opt/RICHPse/examples/disks.se
RRDtool from Sunfreeware.com

freetype

libart_lgpl

libgcc

libiconv

libpng

rrdtool

zlib

Verify the RRDtool installation

rrdtool create test.rrd --start 920804400 DS:speed:COUNTER:600:U:U RRA:AVERAGE:0.5:1:24 RRA:AVERAGE:0.5:6:10
from http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html

What data to collect and why

Performance is affected by: Tunning/configuration + workload + Fluctuations(Coc98)

Cardinal resources available

CPU

Memory - this is a kind of near storage. This includes the various level 1 through 3 cache.

NIC

Storage

Sources of performance degradation

cache thrashing

Getting data from swapped pages(at least when they are on disk)

On disk data.

Context switching

Interrupts

CPU switching

Cache thrashing on HTT cpu's

Starting too many new processes per time unit.

Using too much memory and swap. Actually needing to much data in physical memory at the same time.

Key guidelines

Get only the the data that is needed as a minimum

Don't make the data collection part of the problem.

What data to collect

cpu

csw

What: context switching. Both Voluntary and Involuntary(McD06,p21)

Why:

How: kstat$cpusys.pswitch

icsw

What: Involuntary context switching.

If you get an involuntary contect switch, then the thread that is running on that cpu is stopped (pinned) and a new thread runs in it's place (Blog:Will a faster cpu make my application faster?)

Why: cause pinning

Limit: When this number increases past 500, the system is under a heavy load according to Sun Java System Portal Server 6 2005Q4 Deployment Planning Guide Appendix B

How: kstat$cpusys.inv_swtch

xcal

What: Cross calls between cpu's

Why: Expensive.

How: kstat$cpusys.xcalls

intr –

What: Device interrupts. This is both HW and SW interrupts(McD06,p22)

Why:

How: kstat$cpusys.intr

migr – Migration is costly

What: Moving a process from one CPU to the other.

Or is this only a thread?

Why: Dirtying the lookup table, thereby reducing the cache efficiency.

How: kstat$cpusys.migr

smtx

What: the number of times the kernel failed to obtain a mutex immediately. (McD06,p22) Seems to indicate that busy spins aren't counted. It also seems that spinning time is counted toward 'sys'.

Why: If the number is more than about 200 per CPU, then usually system time begins to climb.(Coc98,ch10)

How: kstat$cpusys.

sysexec

What: number of new process being started.

Why: It is resource costly to start a new process.

How: kstat$cpusys.sysexec

idle

What:

Why:

How: kstat$cpusys.cpu[CPU_IDLE]

user

What:

Why:

How: kstat$cpusys.cpu[CPU_USER]

sys

What:

Why:

How: kstat$cpusys.cpu[CPU_KERNEL]

run que

What: this is the total run que, the number of processes ready for the CPU but not on a CPU yet. Divide by ncpus for comparison. If the run que is lower than the number of cpu's we are good.

Why:

How: kstat$sysinfo.runque

Memory

scan

What: pages examined by pageout daemon

How many pages are scanned, slowscan when the free memory is lotsfree(Sun06,p5-6)

Why: When this goes above 0 that means that the kernel is looking to pages that to swap out.

So does that mean that no swapout happens before scan goes above 0?

How: kstat$cpuvm.scan

rev

What: revolutions of the page daemon hand

Why:

How: kstat$cpuvm.

as_fault – minor page faults via as_fault()

What: (Sun06,p5-29). Address Space fault(Sun06,p4-12)

Will this counter also go up on a segmentation fault?, it seems like it according to (McD07,p470).

It seems that this count goes up on: segmentation faults, minor page faults and major page faults. as per(McD06,p277)

Why:

How: kstat$cpuvm.

hat_fault

What: minor page faults via hat_fault

Why:

How: kstat$cpuvm.

It looks like minor faults are calculated by mf = as_fault - maj_fault - prot_fault

Though it seems prot_fault is also fired if attempting to write to a RO pg(McD07,474)

A minor fault is caused by an address space or hardware address translation fault that can be resolved without performing a page-in(Coc98, ch13).

maj_fault

What: major page faults: Attempted to access a virtual memory address, where the page does not exist in physical memory(McD07,p473)

Why:

How: kstat$cpuvm.maj_fault

de – deficit (Coc98, ch13).

What:

Why:

How: kstat$cpuvm.

freemem

What: Free memory pages.

Please note that pages by default have differnt sizes on Sparc(8k) and x86(4k)

Why:

How: kstat$vminfo.freemem

swap_free

What: Unallocated swap pages.

Why:

How: kstat$vminfo.

NIC

Both packets and octets are samples; Since you might have:

Fewer big packatges

Lots of small packages

ipackets

What: Packets in

Why:

How: nic_itterator.ipackets

ierrors

What: Input errors

Why:

How: nic_itterator.ierrors

ioctets

What: Octets in

Why:

How: nic_itterator.ioctets

opackets

What: Packets out

Why:

How: nic_itterator.opackets

oerrors

What: Output errors

Why:

How: nic_itterator.oerrors

ooctets

What: Octests out

Why:

How: nic_itterator.ooctets

collisions

What: collisions

Why:

How: nic_itterator.collisions

defer

What: Ethernet metric that counts the rate at which output packets are delayed before transmission(Coc98,ch16).

Why:

How: nic_itterator.defer

This might not always be supported by the NIC driver, in which case it always return 0. Please see the /opt/RICHPse/include/netif.se to see if it's set to zero or actually loaded with data.

nocanput

What: packet discarded because it lacks IP-level buffering on input. It can cause a TCP connection to time out and retransmit the packet from the other end(Coc98,ch16).

Why:

How: nic_itterator.nocanput

Not always supported by the driver.

norcvbuf, noxmtbuf – buffer allocation failure counts

What:

Why:

How: nic_itterator.

Disk

nread

What: bytes read

Why:

How: kstat$disk.nread

nwritten

What: bytes written

Why:

How: kstat$disk.nwritten

reads

What: read operations

Why:

How: kstat$disk.reads

writes

What: write operations

Why:

How: kstat$disk.writes

wcnt

What: elements in wait state

Why:

How: kstat$disk.wcnt

rcnt

What: elements in run state

Why:

How: kstat$disk.rcnt

wtime

What: cumulative wait, pre-service, time(sys/kstat.h).

Why: Having just the count of how many elements are in the que, is not always enough, Same as with the network, many elements in the que, but they are quickly handled. Or few elements, but they take a long time to handle. I would assume that the max time observed, is the time between the samples. So if sampling every 5 seconds, then the max is 5s.

How: kstat$disk.wtime

rtime

What: cumulative wait, service, time(sys/kstat.h).

Why: As for 'wtime'

How: kstat$disk.rtime

There is also a limit to the amount of unflushed data that can be written to a file. This limitation is implemented by the UFS write throttle algorithm, which tries to prevent too much memory from being consumed by pending write data. For each file, between 256 Kbytes and 384 Kbytes of data can be pending. When there are less than 256 Kbytes (the low-water mark ufs_LW), it is left to fsflush to write the data. If there are between 256 Kbytes and 384 Kbytes (the high-water mark ufs_HW), writes are scheduled to flush the data to disk. If more than 384 Kbytes are pending, then when the process attempts to write more, it is suspended until the amount of pending data drops below the low-water mark. So, at high data rates, writes change from being asynchronous to synchronous, and this change slows down applications. The limitation is per-process, per-file. (Coc98, ch8)

The "wait service time" is actually the time spent in the "wait" queue(Coc98,ch3).
measuring a two-stage queue:

wait queue, in the device driver

active queue, in the device itself.

Concepts for identification of performance degradation root cause

In this section are described ways to get from indicators to finding the reason for the problem.

Tool of the chase

analyzer(1): Sun Studio Performance Analyzer

cpustat - Display CPU performance counters

cputrack - Like cpustat but track a process.

DTrace toolkit: http://opensolaris.org/os/community/dtrace/dtracetoolkit/

hotkernel – identify which function is on the CPU the most

execsnoop – list process being started.

Cpu/intoncpu.d – interrupt on-cpu usage.

Cpu/xcallsbypid.d – CPU cross calls by PID

intrstat – interrupt statistics.

kstat – Kernel statistics. The mother lode.

lockstat – kernel lock and profiling statistics

mpstat – per-processor or per-processor-set statistics

poolstat – active pool statistics

psradm – change processor operational status. Disable/enable cores etc.

psradm -f 5 – disable cpu5

psradm -n 5 – enable cpu5

psrinfo – information about processors

prset

vmstat – (virtual) memory statistics

cpustat

You need to be root to run this.

The events given to '-c' is CPU specific.
cpustat -c inst_queue_write_cycles,inst_queue_writes
 
   time cpu event      pic0      pic1 
  5.000   3  tick   7508400  25504987 
  5.001   4  tick   4982002  17097458 
  5.001   5  tick   8303888  28502794 
  5.001   7  tick   8092210  28807695 
  5.002   6  tick  10630526  36690168 
kstat

kstat -p -m cpu_stat -i0 -s 'intr*'

lockstat

According to Coc98,ch10 using lockstat will incur some overhead. At least on the Solaris 2.6

mpstat
mpstat 5
 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0   11   1  319  2830  756  304   12    0    9    0   360    1   1   0  98
  0    6   0    1  2739  726  286    3    0    2    0   323    0   1   0  99
  0    0   0    6  2733  722  291    5    0    4    0   318    0   1   0  99
  0    0   0    3  2725  713  286    3    0    3    0   330    0   1   0  99
  
(man mpstat)

minf – minor faults

mjf – major faults

xcal – inter-processor cross-calls

intr – interrupts

ithr – interrupts as threads (not counting clock interrupt)

csw – context switches

icsw – involuntary context switches

migr – thread migrations (to another processor)

smtx – spins on mutexes (lock not acquired on first try)

srw – spins on readers/writer locks (lock not acquired on first try)

syscl – system calls

usr – percent user time

sys – percent system time

wt – always zero

idl – percent idle time

sze – number of processors in the requested processor set

prset

psrset -q

psrset -e <prset_id> <command>
psrset -Q 1 | grep -v "lwp id"
get list of tasks in specific processor set
 
  psrset -Q 1 | grep -v "lwp id" | awk '{print $3}' | tr -d ':' | xargs -n1 ps -fZp | grep -v CMD
  
vmstat

vmstat 5
 
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd f0 s0 --   in   sy   cs us sy id
 0 0 0 1059032 75600 21  57 28  0  1  0 44  6 -0 -0  0  241  467  236  3 18 79
 0 0 0 1059620 66792  1   9  0  0  0  0  0  0  0  0  0  224  145   90  1 13 86
 0 0 0 1059540 66708  0   2  0  0  0  0  0  0  0  0  0  227   95   92  1 15 83
 0 0 0 1059540 66708  0   1  0  0  0  0  0  0  0  0  0  240  124  105  2 18 80
  
The first line is a summary since the system was started.

r – the number of kernel threads in run queue

b – the number of blocked kernel threads that are waiting for resources I/O, paging, and so forth

w – the number of swapped out light-weight processes (LWPs) that are waiting for processing resources to finish.

swap – available swap space (Kbytes)

free – size of the free list (Kbytes)

re – page reclaims

mf – minor faults

pi – kilobytes paged in

po – kilobytes paged out

fr – kilobytes freed

de – anticipated short-term memory shortfall (Kbytes)

sr – page scan rate

Disk – anticipated short-term memory shortfall (Kbytes)

cd – CD drive

f0 – Floppy?

s0 – SCSI disk0

- - –

Faults

in – interrups.

Is this the same as intr in mpstat?

sy – system calls

cs – Context switches

us – user time

sy – system time

id – idle time

CPU related investigations

troubleshooting sysexec

How many process has been started.

cd /opt/DTraceToolkit-0.99/Proc

./shortlived.d

Let it run for 10 to 20 seconds, to get a fair sample

The 10/20 is just arbitraritly selected.

find out what each of the process, listed in the PPID section, are.

Find you why each of these process has these short lived children, and find out if it is ok.

example of sysexec troubleshooting.
short lived processes:     11.778 secs
total sample duration:     11.076 secs

Total time by process name,
           rrdtool         2150 ms
      exit_this.sh         7631 ms

Total time by PPID,
               387         2150 ms
              6535         7631 ms
intr

cd /opt/DTraceToolkit-0.99/Cpu

./inttimes.d

Let it run for about 15 seconds.

This will give a list of how much time is spendt servicing interrupts from all the sources.

output from Cpu DTrace script
Tracing... Hit Ctrl-C to end.
^C
     DEVICE           TIME (ns)
       igb2              46248
      ahci0              47568
      uhci1              52938
      uhci0              56073
       igb3             100567
      uhci2             101641
      uhci4             102279
      ehci0             116884
      uhci5             121902
      uhci3             163313
       mpt0             469179
      ehci1             560201
       igb0            1068298
Who is the interrupt culprit
mpstat 5
 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0  2679  663  290    8    0   13    0   239    0   1   0  99
  1    1   0    0    19    0  316   30    0    0    0   343    0   0   0 100
  2    0   0    0    16    0   70    0    0    0    0    65    0   0   0 100
  3   47   0   21    36    7  365    1   50   12    0   470    0   0   0  99
  4   76   0   47    22    6  400    4   73   12    0   515    0   0   0  99
  5   64   0   41    30    4  355    7   58    9    0   408    0   0   0  99
  6   94   0   29    47   17  434    5   58   12    0   471    0   0   0  99
  7   85   0    9    26    0  298    3   59    4    0   317    0   0   0 100
./intbycpu.d
It was run for about 5 sec.
CPU                    INTERRUPTS
7                             127
1                             130
2                             132
6                             247
3                             280
5                             301
4                             844
0                            6078
/intrtop.d
CPU#    PID CMD                    Interrupts
   1      0 sched                         467
   2      0 sched                         467
   6      0 sched                         510
   7      0 sched                         552
   5      0 sched                         621
   3      0 sched                         671
   4      0 sched                        1869
   0      0 sched                       11314
smtx

If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention, for example, system calls or a network protocol stack(Coc98, ch10).

if smtx increases sharply. An increase from 50 to 500 is a sign of a system resource bottleneck (ex., network or disk). Sun Java System Portal Server 7 Deployment Planning Guide
Possibly use lockstat.

What locks are being used.

Use lockstat

Who incurs the locks?

Possibly use dtrace

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20081009 : .. so I got one of the new Intel SSDs ( .. so I got one of the new Intel SSD's, Oct 9, 2008 )
20080708 : dim_STAT 8.2 by Dimitri ( dim_STAT 8.2 , Jul 08, 2008 )
20080427 : sysprof 1.0.10 by Søren Sandmann ( sysprof 1.0.10 , Apr 27, 2008 )
20080427 : Observability by Adrian Cockcroft ( Observability, )
20080427 : Processing Accounting Data into Workloads by Adrian Cockcroft ( Processing Accounting Data into Workloads, )
20080427 : Scenario Planning - Part 1 by Adrian Cockcroft ( Scenario Planning - Part 1, )
20080427 : Scenario Planning - Part 2 by Adrian Cockcroft ( Scenario Planning - Part 2, )
20080427 : Static Performance Tuning by Richard Elling ( Static Performance Tuning, )
20080427 : System Performance Management: Moving from Chaos to Value by Jon Hill and Kemer Thomson ( System Performance Management: Moving from Chaos to Value, )
20080427 : freshmeat.net Project details for sysstat ( freshmeat.net Project details for sysstat, )
20080427 : DNet eWEEK Sprint puts backbone flow under surveillance ( DNet eWEEK Sprint puts backbone flow under surveillance, )
20080427 : Solaris Developer Connection ( Solaris Developer Connection, )
20080409 : Linux.com Inspecting disk IO performance with fio by Ben Martin ( Linux.com Inspecting disk IO performance with fio, Apr 09, 2008 )
20080409 : Simulating servers ( )
200606 : Performance Management with Free and Bundled Tools by Adrian Cockcroft Netflix Inc. [email protected] (Co-authored with Mario Jauvin) ( Performance Management with Free and Bundled Tools , )
200606 : System Performance Management: Moving from Chaos to Value ( System Performance Management: Moving from Chaos to Value, )

Old News ;-)

[Oct 9, 2008] .. so I got one of the new Intel SSD's

The kernel summit was two weeks ago, and at the end of that I got one of the new 80GB solid state disks from Intel. Since then, I've been wanting to talk to people about it because I'm so impressed with it, but at the same time I don't much like using the kernel mailing list as some kind of odd public publishing place that isn't really kernel-related, so since I'm testing this whole blogging thing, I might as well vent about it here.
That thing absolutely rocks.
I've been impressed by Intel before (Core 2), but they've had their share of total mistakes and idiotic screw-ups too (Itanic), but the things Intel tends to have done well are the things where they do incremental improvements. So it's a nice thing to be able to say that they can do new things very well too. And while I often tend to get early access to technology, seldom have I looked forward to it so much, and seldom have things lived up to my expectations so well.
In fact, I can't recall the last time that a new tech toy I got made such a dramatic difference in performance and just plain usability of a machine of mine.
So what's so special about that Intel SSD, you ask? Sure, it gets up to 250MB/s reads and 70MB/s writes, but fancy disk arrays can certainly do as well or better. Why am I not gushing about some nice NAS box? I didn't even put the thing into a laptop, after all, it's actually in Tove's Mac Mini (running Linux, in case anybody was confused ;), so a RAID NAS box would certainly have been a lot bigger and probably have more features.
But no, forget about the throughput figures. Others can match - or at last come close - to the throughput, but what that Intel SSD does so well is random reads and writes. You can do small random accesses to it and still get great performance, and quite frankly, that's the whole point of not having some stupid mechanical latencies as far as I'm concerned.
And the sad part is that other SSD's generally absolutely suck when it comes to especially random write performance. And small random writes is what you get when you update various filesystem meta-data on any normal filesystem, so it really does matter. For example, a vendor who shall remain nameless has an SSD disk out there that they were also hawking at the Kernel Summit, and while they get fine throughput (something like 50+MB/s on big contiguous writes), they benchmark a pitiful 10 (yes, that's ten, as in "how many fingers do you have) small random writes per second. That is slower than a rotational disk.
In contrast, the Intel SSD does about 8,500 4kB random writes per second. Yeah, that's over eight thousand IOps on random write accesses with a relevant block size, rather than some silly and unrealistic contiguous write test. That's what I call solid-state media.
The whole thing just rocks. Everything performs well. You can put that disk in a machine, and suddenly you almost don't even need to care whether things were in your page cache or not. Firefox starts up pretty much as snappily in the cold-cache case as it does hot-cache. You can do package installation and big untars, and you don't even notice it, because your desktop doesn't get laggy or anything.
So here's the deal: right now, don't buy any other SSD than the Intel ones, because as far as I can tell, all the other ones are pretty much inferior to the much cheaper traditional disks, unless you never do any writes at all (and turn off 'atime', for that matter).
So people - ignore the manufacturer write throughput numbers. They don't mean squat. The fact that you may be able to push 50MB/s to the SSD is meaningless if that can only happen when you do big, aligned, writes.
If anybody knows of any reasonable SSDs that work as well as Intel's, let me know.

[Jul 08, 2008] dim_STAT 8.2 by Dimitri

About: dim_STAT is a performance analysis and monitoring tool for Solaris and Linux (as well all other UNIX) systems. Its main features are a Web based interface, data storage in a SQL database, several data views, interactive (Java) or static (PNG) graphs, real-time monitoring, multi-host monitoring, post analyzing, statistics integration, professional reporting with automated features, and more.

Changes: A major performance update.

[Apr 27, 2008] sysprof 1.0.10 by Søren Sandmann

About: Sysprof is a sampling CPU profiler that uses a Linux kernel module to profile the entire system, not just a single application. It handles shared libraries, and applications do not need to be recompiled. It profiles all running processes, not just a single application, has a nice graphical interface, shows the time spent in each branch of the call tree, can load and save profiles, and is easy to use.

Changes: Compiles with 2.6.25 and later.

Observability (December 1999) by Adrian Cockcroft

Discusses Capacity Planning and Performance Management techniques.

Processing Accounting Data into Workloads (October 1999) by Adrian Cockcroft

Information about Solaris operating system accounting to include code examples that extract the data in a usable format and pattern match it into workloads.

Scenario Planning - Part 1 (February 2000) by Adrian Cockcroft

Discusses scenario planning techniques to help predict latent demand during overload periods. In this part 1 he explains how to simplify your model down to a single bottleneck.

Scenario Planning - Part 2 (March 2000) by Adrian Cockcroft

Presents part two of the Scenario Planning article and explains how to follow-up a simple planning methodology based on a spreadsheet that is used to break down the problem and experiment with alternative future scenarios.

Static Performance Tuning (May 2000) by Richard Elling

Richard discusses a class of problems that can affect system performance which is not dynamic by nature, and cannot be detected by conventional dynamic tuning tools.

System Performance Management: Moving from Chaos to Value (July 2001)
-by Jon Hill and Kemer Thomson

This article presents the rationale for formal system performance management from a management, systems administrative and vendor perspective. It describes four classes of systems monitoring tools and their uses. The article discusses the issues of tool integration, "best-of-breed versus integrated suite" and the decision to "buy versus build."

freshmeat.net Project details for sysstat

The sysstat package contains the sar, sadf, iostat, mpstat, and pidstat commands for Linux. The sar command collects and reports system activity information. The statistics reported by sar concern I/O transfer rates, paging activity, process-related activites, interrupts, network activity, memory and swap space utilization, CPU utilization, kernel activities, and TTY statistics, among others. The sadf command may be used to display data collected by sar in various formats. The iostat command reports CPU statistics and I/O statistics for tty devices and disks. The pidstat command reports statistics for Linux processes. The mpstat command reports global and per-processor statistics.

Release focus: Minor bugfixes

Changes:
mpstat and sar didn't parse /proc/interrupts correctly when some CPUs had been disabled. This is now fixed. This release also fixes a bug in pidstat which caused confusion between PID and TID, resulting in erroneous statistics values being displayed. The iconfig script has been updated: Help for the --enable-compress-manpg parameter is now available, help for the --enable-install-cron parameter has been updated, and the parameter cron_interval has been added.

DNet eWEEK Sprint puts backbone flow under surveillance

Aiming to provide increasingly higher-quality IP and Internet services at lower prices, Sprint Corp. has begun its most comprehensive study to date of traffic behavior on its Internet backbone.

After a year of developing its own test equipment, the carrPerformance Management with FrPerformance Management with Free and Bundled Tools Adrian Cockcroft Netflix Inc. [email protected] (Co-authored with Mario Jauvinee and Bundled Tools Adrian Cockcroft Netflix Inc. [email protected] (Co-authored with Mario Jauvinbegan collecting data at its San Jose, Calif., Internet POP (point of presence), the first of many sites slated for testing.

Sprint plans to use the data from the testing, called the Internet Measurement Study, to ensure that its network can handle ever-increasing customer traffic volume and to discover which network monitoring tools will be needed in future network equipment.

"Very little is known about the detailed behavior of Internet backbones," said Bryan Lyles, chief scientist at Sprint, in Kansas City, Mo. "Very fine-grained studies are what we need to make rational decisions on the equipment that goes into the network -- even the standards that go into it."

Sprint hopes the multimillion-dollar, multiyear study will enable it to keep its equipment costs as low as possible and ensure that its network delivers optimal performance.

"The goal is to make sure we make the best use of capital and the other resources we put into the network and to keep our customers happy," Lyles said.

Performance, performance, performance

As the Internet's importance to a company's bottom line increases, users expect ISPs (Internet service providers) or other data carriers to meet increasingly stringent service performance goals.

At Quebecor Printing (USA) Inc., which is installing an IP-based VPN (virtual private network) at its many locations, "class of service will include bandwidth allocation and prioritization for certain applications," said Terry Bush, vice president of data communications, in Greenwich, Conn.

At its bigger printing facilities, the company is installing multiple 1.5M-bps circuits to handle growth in its data traffic because IP bandwidth is more efficient and flexible in a VPN than in more conventional network designs, Bush said. Nevertheless, Quebecor demands service levels that rival private network solutions and has a service-level agreement that specifies zero packet loss and a round-trip, coast-to-coast network delay of less than 75 milliseconds, Bush said.

Sprint isn't alone among carriers and ISPs in its quest to improve Internet service. For example, "2001 will probably be the last year that we will buy narrowband switches," said Fred Briggs, chief technical officer at WorldCom Inc., in Clinton, Miss.

Solaris Developer Connection

Chat Title: Solaris Utilities for Monitoring System Performance
Guest Speakers: James Liu and Karpagam Narayanan
This is a moderated forum

LizA: Welcome to the Solaris Live Chat, "Solaris Utilities for Monitoring System Performance" with James Liu and Karpagam Narayanan. James was our first Solaris Live! guest and we're very happy to have him back. James is ready to answer your questions on software development and benchmark formation strategies and configuration, scaling analysis, processor management, thread libraries, and so on. He is joined by Karpagam Narayanan, who has lots of experience with all the standard tools like Virtial Adrian (aka SE Toolkit) disk partitioning, network bandwidth trunking, and other things that get your app to run faster on Solaris[tm]. Karpagam and James, let's say that I'm new to Solaris and I want to know what CPU a process takes. Is there a command that shows me this?

jamesliu: I'll take this one. A number of commands can show this. You can use prstat which is bundled with Solaris 8 and is probably easiest. If you have the freeware top... you can use this too.

LizA: What does NLWP mean in prstat?

karpagam: NLWP refers to the number of light weight processes, or LWP, associated with the process.

LizA: How does someone find out which processors are online or off line?

jamesliu: You can find out using the psrinfo command. -v option gives you a lot of info on the processors

LizA: I need to increase the file descriptors on my server...I bumped up the ulimit but it still doesn't work. What else do I need to do?

karpagam: Increase the rlim_fd_max and rlim_fd_cur parameters in /etc/system. Remember that these take affect after you reboot.

jamesliu: LizA, you can also gain some efficiencies if your problem is related to using network file descriptors (i.e. sockets). You can tune the tcp/ip parameters using the ndd /dev/tcp command to shorten the tcp_time_wait_interval.

tefluid: I'm interested in optimizing application servers in order to run Java[tm] engines such as BEA WebLogic and ATG Dynamo. What advice can you give on profiling the system to best determine where the bottlenecks lie?

karpagam: This is a Java on Solaris question. Java has a profiling tool called hprof that can be included in the command line. Type -Xrunhprof:help for more info on this. The output gives you methods that take more CPU time...

karpagam: tefluid, There is a HAT (Heap Analysis Tool) also available. There are also 3rd party GUI tools available. Optimizeit and JProbe are two of them.

LizA: I heard that in Solaris you can allocate certain processors to work on only one process. Will that help, too?

jamesliu: LizA, you can in fact specify certain processors to a specific process. The command to use is psrset. For folks like Tefluid, binding the JVM PID to a processor set and excluding interrupts can possibly give a boost in performance.

Craki: I have a farm of Sybase database boxes all on Solaris 8. Where can I start in making sure that everything that can be optimized is, for database operations.

karpagam: Craki, I would always start with the db monitoring tools. Once you are sure that you do not have any issues go through the system parameters...

karpagam: Craki, Start by looking into shared memory, semaphores and message queue parameters first in the /etc/system. Then look into disk, network, NFS, swapping/paging, memory, CPU, filesystem, and TCP, one at a time...

karpagam: Craki, do look in http://www.sun.com/sun-on-net/performance/perftools-solaris8.pdf for more info on Solaris tools

Zartaj: I am interested in performance comparisons between Sun Solaris and Wintel. The problem is it is not easy to decide what is the right pair to compare. I have a UE250 450MHz with Solaris 8 and a P3 733 MHz with Windows 2000. I have seen the Wintel box consistently outperform the UE250. But is that a fair comparison? In general if I have a Sun system how do I determine what is the equivalent Wintel system to compare. Going by price alone, Wintel seems to have the edge.

jamesliu: Zartaj, it is often a race for more MIPS/MFLOPS, etc. in the hardware area. I don't know which benchmarks you run but in those apps that are important to Sun's customers. Sun consistently tunes our applications to out scale and outperform anything on the market. It all depends on the use. In your particular case, it may in fact be that Wintel has better price performance. In many of Sun's core customers, our value proposition is reliability, availability and scalability. We've competed well on this philosophy for about 18 years and I predict we'll continue. As for your particulars, perhaps we can communicate offline and discuss how to improve your performance.

alexc: We use some scripts to automate gathering info from ps. We also use sar. We notice that total CPU utilization (by adding up ps info) is usually quite a bit less than what is stated by sar. Why is there a discrepancy?

karpagam: Alexc, I am not sure what ps you are referring to - /usr/ucb/ps? In what version of Solaris? I do not know the time interval that ps uses for data gathering. If you are in Solaris 8, try using prstat. There are a lot of parameters that can come into play here - interval, versions, options for the tools, etc...

LizA: What do I need in order to look at mpstat? What do the columns mutexes and context switching mean?

karpagam: LizA, mutexes occur when a lot of CPUs are trying to grab the same resource lock. Only one CPU will be successful at any time. We do not want this to happen a lot...

jamesliu: LizA, context switching is also something that, done too often, expends resources... What you want to do is to limit these values to certain levels. smtx, for example is best below 500 per CPU per second. Context switches ... you can check at http://www.setoolkit.com.

Zartaj: I'd like to know what tools are available for shared library profiling? Shared libraries cannot be instrumented for prof or gprof. And the LD_PROFILE variable can be used only for one shared library at a time. So how do I go about profiling all shared libraries being used by an app?

karpagam: Zartaj, You can try using truss and sotruss. truss gives shared library activity and entry/exit trace of user-level function calls. sotruss is good and has less noise than truss...

dmdebertin: Are there any particular columns in vmstat (or other command) output that could indicate hardware or software problems? What are some things to look for that could indicate problems, and what is harmless?

jamesliu: DMDebertin, if your CPU percentage is high but system usage is low, most of the CPU is consumed by your app. You may want to think about tuning your code in this case. If system time is high, check out more with mpstat and look at context switch and smtx values.

Emory2: Could you please compare the performance of a 24 CPU SunFire 6800 to the performance of a 24 CPU IBM S80 (configured with the same amount of RAM).

karpagam: Emory2, For what workload? You can consider looking into TPC-C, TPC-D, spec standard benchmark pages that matches your workload.

LizA: How do I monitor the network?

karpagam: LizA, the primary tool you can use is netstat. There are options like -in for cumulative data, -s for TCP/UDP stats, -I for specific interface. I like to put in netstat -in in a while loop...

jamesliu: LizA, Sun also provides some scripts for tuning your network drivers. http://www.sun.com has these scripts. Search for "network tuning" or "syn flood" and you should see some docs on how to tune your network interface.

karpagam: LizA, netstat -a gives a lot more information on thevsockets/ports open. Look for ESTABLISHED and TIME_WAIT

LizA: netstat -a tells me that I have over 8000 connections. But I have only 3000 sessions open. They have a time_wait status on more than half of them. Is that something to do with my application?

jamesliu: LizA, Regarding netstat output, you'll probably have lots of network sessions still waiting to close. The default setting on Solaris is 240 seconds. You can use ndd /dev/tcp to set the tcp_time_wait_interval to a lower value so that these connections close down more quickly. Say 30 seconds is good. Be careful not to set this too low as slow connections (e.g. modems) might get dropped.

Zartaj: I believe a 32-bit process can only use around 3GB out of a possible 4GB. So is it useful to have more than 4GB physical memory on a system that allows it?

karpagam: Zartaj, What you need to look into is how much your application uses/needs. Are you running 64-bit Oracle and need more than 4GB SGA? Use pmap to tell you the processor footprint and calculate on that basis.

Zartaj: In the Solaris Multithreading Guide, it recommends against thread-pooling saying it is cheaper to create threads as needed. Do you agree with that?

jamesliu: Zartaj, in general I would agree that threads are relatively cheaper to create than to pool. Pooling creates many potential oppotunities for contention. However, in some cases, such as Java, the threading model may be more amenable to pooling since there is a Java layer there.

jd: The way I understand load average to be calculated, it is incremented by 1 for every CPU's worth of time spent. (Ex. a 10 CPU system with 10% user time as shown by vmstat will report a load avg. of 1). High system time (as show in vmstat) causes load to jump very high in some cases; I have seen load avg. of 30 on a 10 CPU system with 40% system time/10% user time. I would like to know how the system comes up with that load avg.

jamesliu: jd, I couldn't tell you exactly how the algorithm works. It's been a while since I've touched on it. Karpagam?

karpagam: jd, A high system time of that ratio clearly shows that there is a bottleneck. Did you check to see how your disks are doing. You also might want to see in mpstat/top/prstat/statit how the utilizations per processor is.

Craki: I find that whenever a box has fairly high uptime, memory reports on usage is higher than it should be. My DBA's see this and start getting worrired about the boxes not being big enough. Is this a Solaris behavioral quirk?

jamesliu: Craki, I can't be certain, but our experience shows that in uptimes of 60+ days, the memory footprint remains stable on many of our servers. The most common area of memory growth over time we've seen has perhaps been in memory leaks on the application or windowing side. Many windowing apps or servers or windows managers do in fact leak lots of memory. This may be the cause of growth over time.

jd: I am not asking about a problem in particular, I have just seen the load avg. jump like that and am curious as to how it's calculated.

karpagam: jd, Did you see this on Solaris 8?

Emory2: Does anyone know if there is a working version of "proctool" for Solaris 8? One version that we tested did not work for multiprocessors.

karpagam: Emory2, you can use /usr/proc/bin proc tools - right? pmap, ptree, ptime, pldd, etc...

jd: I have seen it on 2.6 and 8; the most recent was on 8 where a Java programmer had an app. that went crazy with creating/deleting threads.

jamesliu: jd, I guess you're still asking about how the load average is computed. Again, I can't tell you off hand since it's been a while since I've touched the algorithms. But I can imagine that any process that creates/destroys lots of threads is a contrived and somewhat unique situation. Perhaps we can work offline to discuss optimization and development techniques to reduce the CPU utilization.

LizA: Are there any special libraries I can use to improve performance?

jamesliu: LizA, there are a number of libraries that might boost performance. Some are in Solaris 8, some are third party. If you have a thread intensive application and have high smtx values, due to schedlock, you may want to put /usr/lib/lwp at the top of your LD_LIBRARY_PATH which is an alternate thread library. If your app. is memory allocation intensive, there are 3 ISV solutions that replace the bundled malloc on Solaris that improve performance.

alexc: question about threading, etc., ... the way I understand it, some programmers use multiple processes to do threading (spawning child processes) and some use threads within a single process. Clearly, multiple processes can run on multiple processes simultaneously. However, can threads within a single process run on more than one processr simultaneously?

alexc: Rather, multiple processes can use multiple processORs, but can threads within a single process do the same?

jamesliu: Alexc, absolutely. Threads do run on multiple processors on Solaris. As do multiple processors with multiple threads. Solaris supports scheduling that allows a many-to-many relationship between threads or processes and processORs.

Craki: Can you recommend a centralized monitoring/management package? I've done a small deployment of Sun[tm] Management Center and liked it. Would Big Brother be a good solution as well?

karpagam: Sun Management Center is very good. If you want to monitor database statistics also, I know that a lot of folks use Foglight from Quest Software. I do not know about Big Brother - sorry.

LizA: We're about out of time. Thanks to Karpagam and James...and all of you who asked such great questions. Karpagam and James, do you have a few parting words?

jamesliu: It has again been a pleasure. I'd be pleased to field questions in this forum again soon. -JCL

jamesliu: Note to all, if you're running any of the vmstat or mpstat, just make sure you put a time interval like 5 seconds and exclude the first entry in you computations. - jcl

karpagam: Thanks everyone for all the wonderful questions. It has been a pleasure. Thanks LizA for taking this forum smoothly :)

LizA: Be sure to join us again on June 21, at 10 a.m. PDT, when our guest is Rich Teer and the topic is "Secure C Programming."

[Apr 09, 2008] Linux.com Inspecting disk IO performance with fio By Ben Martin

Storage performance has failed to keep up with that of other major components of computer systems. Hard disks have gotten larger, but their speed has not kept pace with the relative speed improvements in RAM and CPU technology. The potential for your hard drive to be your system's performance bottleneck makes knowing how fast your disks and filesystems are and getting quantitative measurements on any improvements you can make to the disk subsystem important. One way to make disk access faster is to use more disks in combination, as in a RAID-5 configuration. To get a basic idea of how fast a physical disk can be accessed from Linux you can use the hdparm tool with the -T and -t options. The -T option takes advantage of the Linux disk cache and gives an indication of how much information the system could read from a disk if the disk were fast enough to keep up. The -t option also reads the disk through the cache, but without any precaching of results. Thus -t can give an idea of how fast a disk can deliver information stored sequentially on disk.

The hdparm tool isn't the best indicator of real-world performance. It operates at a very low level; once you place a filesystem onto a disk partition you might get significantly different results. You will also see large differences in speed between sequential access and random access. It would also be good to be able to benchmark a filesystem stored on a group of disks in a RAID configuration.

fio was created to allow benchmarking specific disk IO workloads. It can issue its IO requests using one of many synchronous and asynchronous IO APIs, and can also use various APIs which allow many IO requests to be issued with a single API call. You can also tune how large the files fio uses are, at what offsets in those files IO is to happen at, how much delay if any there is between issuing IO requests, and what if any filesystem sync calls are issued between each IO request. A sync call tells the operating system to make sure that any information that is cached in memory has been saved to disk and can thus introduce a significant delay. The options to fio allow you to issue very precisely defined IO patterns and see how long it takes your disk subsystem to complete these tasks.

fio is packaged in the standard repository for Fedora 8 and is available for openSUSE through the openSUSE Build Service. Users of Debian-based distributions will have to compile from source with the make; sudo make install combination.

The first test you might like to perform is for random read IO performance. This is one of the nastiest IO loads that can be issued to a disk, because it causes the disk head to seek a lot, and disk head seeks are extremely slow operations relative to other hard disk operations. One area where random disk seeks can be issued in real applications is during application startup, when files are requested from all over the hard disk. You specify fio benchmarks using configuration files with an ini file format. You need only a few parameters to get started. rw=randread tells fio to use a random reading access pattern, size=128m specifies that it should transfer a total of 128 megabytes of data before calling the test complete, and the directory parameter explicitly tells fio what filesystem to use for the IO benchmark. On my test machine, the /tmp filesystem is an ext3 filesystem stored on a RAID-5 array consisting of three 500GB Samsung SATA disks. If you don't specify directory, fio uses the current directory that the shell is in, which might not be what you want. The configuration file and invocation is shown below.
$ cat random-read-test.fio ; random read of 128mb of data [random-read] 
rw=randread size=128m directory=/tmp/fio-testing/data $ fio random-read-test.fio 
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 
Starting 1 process random-read: Laying out IO file(s) (1 file(s) / 128MiB) 
Jobs: 1 (f=1): [r] [100.0% done] [ 3588/ 0 kb/s] [eta 00m:00s] random-read: 
(groupid=0, jobs=1): err= 0: pid=30598 read : io=128MiB, bw=864KiB/s, 
iops=211, runt=155282msec clat (usec): min=139, max=148K, avg=4736.28, 
stdev=6001.02 bw (KiB/s) : min= 227, max= 5275, per=100.12%, avg=865.00, 
stdev=362.99 cpu : usr=0.07%, sys=1.27%, ctx=32783, majf=0, minf=10 
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
issued r/w: total=32768/0, short=0/0 lat (usec): 250=34.92%, 500=0.36%, 
750=0.02%, 1000=0.05% lat (msec): 2=0.41%, 4=12.80%, 10=44.96%, 20=5.16%, 
50=0.94% lat (msec): 100=0.37%, 250=0.01% Run status group 0 (all jobs): 
READ: io=128MiB, aggrb=864KiB/s, minb=864KiB/s, maxb=864KiB/s, mint=155282msec, 
maxt=155282msec Disk stats (read/write): dm-6: ios=32768/148, merge=0/0, 
ticks=154728/12490, in_queue=167218, util=99.59% 
		
fio produces many figures in this test. Overall, higher values for bandwidth and lower values for latency constitute better results.

The bw result shows the average bandwidth achieved by the test. The clat and bw lines show information about the completion latency and bandwidth respectively. The completion latency is the time between submitting a request and it being completed. The min, max, average, and standard deviation for the latency and bandwidth are shown. In this case, the standard deviation for both completion latency and bandwidth is quite large relative to the average value, so some IO requests were served much faster than others. The CPU line shows you how much impact the IO load had on the CPU, so you can tell if the processor in the machine is too slow for the IO you want to perform. The IO depths section is more interesting when you are testing an IO workload where multiple requests for IO can be outstanding at any point in time as is done in the next example. Because the above test only allowed a single IO request to be issued at any time, the IO depths were at 1 for 100% of the time. The latency figures indented under the IO depths section show an overview of how long each IO request took to complete; for these results, almost half the requests took between 4 and 10 milliseconds between when the IO request was issued and when the result of that request was reported. The latencies are reported as intervals, so the 4=12.80%, 10=44.96% section reports that 44.96% of requests took more than 4 (the previous reported value) and up to 10 milliseconds to complete.

The large READ line third from last shows the average, min, and max bandwidth for each execution thread or process. fio lets you define many threads or processes to all submit work at the same time during a benchmark, so you can have many threads, each using synchronous APIs to perform IO, and benchmark the result of all these threads running at once. This lets you test IO workloads that are closer to many server applications, where a new thread or process is spawned to handle each connecting client. In this case we have only one thread. As the READ line near the bottom of output shows, the single thread has an 864Kbps aggregate bandwidth (aggrb) which tells you that either the disk is slow or the manner in which IO is submitted to the disk system is not friendly, causing the disk head to perform many expensive seeks and thus producing a lower overall IO bandwidth. If you are submitting IO to the disk in a friendly way you should be getting much closer to the speeds that hdparm reports (typically around 40-60Mbps).

I performed the same test again, this time using the Linux asynchronous IO subsystem in direct IO mode with the possibility, based on the iodepth parameter, of eight requests for asynchronous IO being issued and not fulfilled because the system had to wait for disk IO at any point in time. The choice of allowing up to only eight IO requests in the queue was arbitrary, but typically an application will limit the number of outstanding requests so the system does not become bogged down. In this test, the benchmark reported almost three times the bandwidth. The abridged results are shown below. The IO depths show how many asynchronous IO requests were issued but had not returned data to the application during the course of execution. The figures are reported for intervals from the previous figure; for example, the 8=96.0% tells you that 96% of the time there were five, six, seven, or eight requests in the async IO queue, while, based on 4=4.0%, 4% of the time there were only three or four requests in the queue.
 
$ cat random-read-test-aio.fio ; same as random-read-test.fio ; ... 
ioengine=libaio iodepth=8 direct=1 invalidate=1 $ fio random-read-test-aio.fio 
random-read: (groupid=0, jobs=1): err= 0: pid=31318 read : io=128MiB, 
bw=2,352KiB/s, iops=574, runt= 57061msec slat (usec): min=8, max=260, 
avg=25.90, stdev=23.23 clat (usec): min=1, max=124K, avg=13901.91, stdev=12193.87 
bw (KiB/s) : min= 0, max= 5603, per=97.59%, avg=2295.43, stdev=590.60 
... IO depths : 1=0.1%, 2=0.1%, 4=4.0%, 8=96.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
... Run status group 0 (all jobs): READ: io=128MiB, aggrb=2,352KiB/s, 
minb=2,352KiB/s, maxb=2,352KiB/s, mint=57061msec, maxt=57061msec
		 
Random reads are always going to be limited by the seek time of the disk head. Because the async IO test could issue as many as eight IO requests before waiting for any to complete, there was more chance for reads in the same disk area to be completed together, and thus an overall boost in IO bandwidth.

The HOWTO file from the fio distribution gives full details of the options you can use to specify benchmark workloads. One of the more interesting parameters is rw, which can specify sequential or random reads and or writes in many combinations. The ioengine parameter can select how the IO requests are issued to the kernel. The invalidate option causes the kernel buffer and page cache to be invalidated for a file before beginning the benchmark. The runtime specifies that a test should run for a given amount of time and then be considered complete. The thinktime parameter inserts a specified delay between IO requests, which is useful for simulating a real application that would normally perform some work on data that is being read from disk. fsync=n can be used to issue a sync call after every n writes issued. write_iolog and read_iolog cause fio to write or read a log of all the IO requests issued. With these commands you can capture a log of the exact IO commands issued, edit that log to give exactly the IO workload you want, and benchmark those exact IO requests. The iolog options are great for importing an IO access pattern from an existing application for use with fio.

Simulating servers

You can also specify multiple threads or processes to all submit IO work at the same time to benchmark server-like filesystem interaction. In the following example I have four different processes, each issuing their own IO loads to the system, all running at the same time. I've based the example on having two memory-mapped query engines, a background updater thread, and a background writer thread. The difference between the two writing threads is that the writer thread is to simulate writing a journal, whereas the background updater must read and write (update) data. bgupdater has a thinktime of 40 microseconds, causing the process to sleep for a little while after each completed IO.
 
$ cat four-threads-randio.fio ; Four threads, two query, two writers. 
[global] rw=randread size=256m directory=/tmp/fio-testing/data ioengine=libaio 
iodepth=4 invalidate=1 direct=1 [bgwriter] rw=randwrite iodepth=32 [queryA] 
iodepth=1 ioengine=mmap direct=0 thinktime=3 [queryB] iodepth=1 ioengine=mmap 
direct=0 thinktime=5 [bgupdater] rw=randrw iodepth=16 thinktime=40 size=32m 
$ fio four-threads-randio.fio bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K, 
ioengine=libaio, iodepth=32 queryA: (g=0): rw=randread, bs=4K-4K/4K-4K, 
ioengine=mmap, iodepth=1 queryB: (g=0): rw=randread, bs=4K-4K/4K-4K, 
ioengine=mmap, iodepth=1 bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K, 
ioengine=libaio, iodepth=16 Starting 4 processes bgwriter: (groupid=0, 
jobs=1): err= 0: pid=3241 write: io=256MiB, bw=7,480KiB/s, iops=1,826, 
runt= 35886msec slat (usec): min=9, max=106K, avg=35.29, stdev=583.45 
clat (usec): min=117, max=224K, avg=17365.99, stdev=24002.00 bw (KiB/s) 
: min= 0, max=14636, per=72.30%, avg=5746.62, stdev=5225.44 cpu : usr=0.40%, 
sys=4.13%, ctx=18254, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.4%, 
8=3.3%, 16=59.7%, 32=36.5%, >=64=0.0% issued r/w: total=0/65536, short=0/0 
lat (usec): 250=0.05%, 500=0.33%, 750=0.70%, 1000=1.11% lat (msec): 
2=7.06%, 4=14.91%, 10=27.10%, 20=21.82%, 50=20.32% lat (msec): 100=4.74%, 
250=1.86% queryA: (groupid=0, jobs=1): err= 0: pid=3242 read : io=256MiB, 
bw=589MiB/s, iops=147K, runt= 445msec clat (usec): min=2, max=165, avg= 
3.48, stdev= 2.38 cpu : usr=70.05%, sys=30.41%, ctx=91, majf=0, minf=65545 
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.20%, 10=22.51%, 
20=1.17%, 50=0.05%, 100=0.05% lat (usec): 250=0.01% queryB: (groupid=0, 
jobs=1): err= 0: pid=3243 read : io=256MiB, bw=455MiB/s, iops=114K, 
runt= 576msec clat (usec): min=2, max=303, avg= 3.48, stdev= 2.31 bw 
(KiB/s) : min=464158, max=464158, per=1383.48%, avg=464158.00, stdev= 
0.00 cpu : usr=73.22%, sys=26.43%, ctx=69, majf=0, minf=65545 IO depths 
: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued 
r/w: total=65536/0, short=0/0 lat (usec): 4=76.81%, 10=21.61%, 20=1.53%, 
50=0.02%, 100=0.03% lat (usec): 250=0.01%, 500=0.01% bgupdater: (groupid=0, 
jobs=1): err= 0: pid=3244 read : io=16,348KiB, bw=1,014KiB/s, iops=247, 
runt= 16501msec slat (usec): min=7, max=42,515, avg=47.01, stdev=665.19 
clat (usec): min=1, max=137K, avg=14215.23, stdev=20611.53 bw (KiB/s) 
: min= 0, max= 1957, per=2.37%, avg=794.90, stdev=495.94 write: io=16,420KiB, 
bw=1,018KiB/s, iops=248, runt= 16501msec slat (usec): min=9, max=42,510, 
avg=38.73, stdev=663.37 clat (usec): min=202, max=229K, avg=49803.02, 
stdev=34393.32 bw (KiB/s) : min= 0, max= 1840, per=10.89%, avg=865.54, 
stdev=411.66 cpu : usr=0.53%, sys=1.39%, ctx=12089, majf=0, minf=9 IO 
depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=22.8%, 16=76.8%, 32=0.0%, >=64=0.0% 
issued r/w: total=4087/4105, short=0/0 lat (usec): 2=0.02%, 4=0.04%, 
20=0.01%, 50=0.06%, 100=1.44% lat (usec): 250=8.81%, 500=4.24%, 750=2.56%, 
1000=1.17% lat (msec): 2=2.36%, 4=2.62%, 10=9.47%, 20=13.57%, 50=29.82% 
lat (msec): 100=19.07%, 250=4.72% Run status group 0 (all jobs): READ: 
io=528MiB, aggrb=33,550KiB/s, minb=1,014KiB/s, maxb=589MiB/s, mint=445msec, 
maxt=16501msec WRITE: io=272MiB, aggrb=7,948KiB/s, minb=1,018KiB/s, 
maxb=7,480KiB/s, mint=16501msec, maxt=35886msec Disk stats (read/write): 
dm-6: ios=4087/69722, merge=0/0, ticks=58049/1345695, in_queue=1403777, 
util=99.74% 
		 
As one would expect, the bandwidth the array achieved in the query and writer processes was vastly different. Queries are performed at about 500Mbps while writing comes in at 1Mbps or 7.5Mbps depending on whether it is read/write or purely write performance respectively. The IO depths show the number of pending IO requests that are queued when an IO request is issued. For example, for the bgupdater process, nearly 1/4 of the async IO requests are being fulfilled with eight or less requests in the queue of a potential 16. In contrast, the bgwriter has more than half of its requests performed with 16 or less pending requests in the queue.

To contrast with the three-disk RAID-5 configuration, I reran the four-threads-randio.fio test on a single Western Digital 750GB drive. The bgupdater process achieved less than half the bandwidth and each of the query processes ran at 1/3 the overall bandwidth. For this test the Western Digital drive was on a different computer with different CPU and RAM specifications as well, so any comparison should be taken with a grain of salt.
 bgwriter: (groupid=0, jobs=1): err= 0: pid=14963 write: io=256MiB, bw=6,545KiB/s, 
iops=1,597, runt= 41013msec queryA: (groupid=0, jobs=1): err= 0: pid=14964 
read : io=256MiB, bw=160MiB/s, iops=39,888, runt= 1643msec queryB: (groupid=0, 
jobs=1): err= 0: pid=14965 read : io=256MiB, bw=163MiB/s, iops=40,680, 
runt= 1611msec bgupdater: (groupid=0, jobs=1): err= 0: pid=14966 read 
: io=16,416KiB, bw=422KiB/s, iops=103, runt= 39788msec write: io=16,352KiB, 
bw=420KiB/s, iops=102, runt= 39788msec READ: io=528MiB, aggrb=13,915KiB/s, 
minb=422KiB/s, maxb=163MiB/s, mint=1611msec, maxt=39788msec WRITE: io=272MiB, 
aggrb=6,953KiB/s, minb=420KiB/s, maxb=6,545KiB/s, mint=39788msec, maxt=41013msec
		 
The vast array of ways that fio can issue its IO requests lends it to benchmarking IO patterns and the use of various APIs to perform that IO. You can also run identical fio configurations on different filesystems or underlying hardware to see what difference changes at that level will make to performance.

Benchmarking different IO request systems for a particular IO pattern can be handy if you are about to write an IO-intensive application but are not sure which API and design will work best on your hardware. For example, you could keep the disk system and RAM fixed and see how well an IO load would be serviced using memory-mapped IO or the Linux asyncio interface. Of course this requires you to have a very intricate knowledge of the typical IO requests that your application will issue. If you already have a tool that uses something like memory-mapped files, then you can get IO patterns for typical use from the existing tool, feed them into fio using different IO engines, and get a reasonable picture of whether it might be worth porting the application to a different IO API for better performance.
Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.

Performance Management with Free and Bundled Tools by Adrian Cockcroft Netflix Inc. [email protected] (Co-authored with Mario Jauvin)

From the SGI Admin Guide - last I checked the CPU spends most of its time waiting for something to do

Table 5-3 : Indications of an I/O-Bound System


Field		Value		sar Option

%busy (% time disk is busy)		>85		sar -d

%rcache (reads in buffer cache)		low, <85	sar -b

%wcache (writes in buffer cache)	low, <60%	sar -b

%wio (idle CPU waiting for disk I/O)	dev. system >30	sar -u
		fileserver >80

Table 5-5 Indications of Excessive Swapping/Paging


bswot/s (ransfers from memory to disk swap area)	>200	sar -w

bswin/s (transfers to memory)	>200	sar -w

%swpocc (time swap queue is occupied)			>10	sar -q

rflt/s (page reference fault)				>0	sar -t

freemem (average pages for user processes)		<100	sar -r

Indications of a CPU bound systems

%idle (% of time CPU has no work to do)			<5	sar -u

runq-sz (processes in memory waiting for CPU)		>2	sar -q

%runocc (% run queue occupied and processes not executing)	>90	sar -q

hypermail /usr/local/src/src/hypermail - mailing list to web page converter; grep hypermail /etc/aliases shows which lists use hypermail

pwck, grpck should be run weekly to make sure ok; grpck produces a ton of errors

can use local man pages - text only - see Ch3 User Services
put in /usr/local/manl (try /usr/man/local/manl) suffix .l
long ones pack -> pack program.1;mv program.1.z /usr/man/local/mannl/program.z

System Performance Management: Moving from Chaos to Value (July 2001)

-by Jon Hill and Kemer Thomson
This article presents the rationale for formal system performance management from a management, systems administrative and vendor perspective. It describes four classes of systems monitoring tools and their uses. The article discusses the issues of tool integration, "best-of-breed versus integrated suite" and the decision to "buy versus build."

Static Performance Tuning (May 2000)
-by Richard Elling
Richard discusses a class of problems that can affect system performance which is not dynamic by nature, and cannot be detected by conventional dynamic tuning tools.
Fast Oracle Parallel Exports on Sun Enterprise[tm] Servers (March 2000)
-by Stan Stringfellow - Special to Sun BluePrints OnLine
Gives a script that performs very fast Oracle database exports by taking advantage of parallel processing on SMP machines. This script can be invaluable for situations where you need to perform exports of large mission-critical databases that require high availability.
Scenario Planning - Part 1 (February 2000)
-by Adrian Cockcroft
Discusses scenario planning techniques to help predict latent demand during overload periods. In this part 1 he explains how to simplify your model down to a single bottleneck.
Scenario Planning - Part 2 (March 2000)
-by Adrian Cockcroft
Presents part two of the Scenario Planning article and explains how to follow-up a simple planning methodology based on a spreadsheet that is used to break down the problem and experiment with alternative future scenarios.
Observability (December 1999)
-by Adrian Cockcroft
Discusses Capacity Planning and Performance Management techniques.
Processing Accounting Data into Workloads (October 1999)
-by Adrian Cockcroft
Information about Solaris operating system accounting to include code examples that extract the data in a usable format and pattern match it into workloads.

Books

System Performance Tuning

Oracle and Unix Performance Tuning ~ Usually ships in 24 hours: Ahmed Alomari / Paperback / Published 1997
Amazon price: $35.96 ~ You Save: $8.99 (20%)
Aix Performance Tuning ~ Usually ships in 2-3 days: Frank Waters / Paperback / Published 1996
Amazon price: $63.00
Optimizing Unix for Performance ~ Usually ships in 24 hours: Amir H. Majidimehr / Paperback / Published 1995
Amazon price: $40.00
Solaris Performance Administration : Performance Measurement, Fine Tuning, and Capacity Planning for Releases 2.5.1 and 2.6 ~ Usually ships in 24 hours: H. Frank Cervone / Paperback / Published 1998
Amazon price: $35.96 ~ You Save: $8.99 (20%)
Sun Performance and Tuning : Java and the Internet ~ Usually ships in 24 hours: Adrian Cockcroft, et al / Paperback / Published 1998
Amazon price: $40.80 ~ You Save: $10.20 (20%)
System Performance Tuning (Nutshell Handbooks) ~ Usually ships in 2-3 days: Michael Kosta Loukides, Mike Loukides / Paperback / Published 1991
Amazon price: $23.96 ~ You Save: $5.99 (20%)
UNIX Performance Tuning; Sys Admin-Essential Reference Series ~ Usually ships in 2-3 days: Sys Admin Magazine(Editor) / Paperback / Published 1997
Amazon price: $23.96 ~ You Save: $5.99 (20%)
Hp-Ux Tuning and Performance : Concepts, Tools and Methods (Hewlett-Packard Professional Books): Robert F. Sauers, Peter Weygant / Paperback / Published 1999
Amazon price: $45.00 (Not Yet Published -- On Order)
Sun Performance and Tuning : Sparc & Solaris: Adrian Cockcroft / Paperback / Published 1994
(Publisher Out Of Stock)
Taming UNIX : UNIX Performance Management Series: Robert A. Lund / Spiral-bound / Published 1997
Amazon price: $59.95 (Special Order)

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 29, 2020

Performance Monitoring

Introduction

Purpose

Scope

Acronyms

References

Who should use this article

How this article is organized

Related material

Quick start - for the impatient

Installing the tools

SE Toolkit from sourceforge.net

Verify the SE Toolkit installation

RRDtool from Sunfreeware.com

Verify the RRDtool installation

What data to collect and why

Cardinal resources available

Sources of performance degradation

Key guidelines

What data to collect

cpu

Memory

NIC

Disk

Concepts for identification of performance degradation root cause

Tool of the chase

cpustat

kstat

lockstat

mpstat

prset

vmstat

CPU related investigations

troubleshooting sysexec

example of sysexec troubleshooting.

intr

output from Cpu DTrace script

Who is the interrupt culprit

smtx