|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
In its simplest form, the performance monitor, or system monitor, is a utility which tracks the running processes and give a real time graphical display of the resources utilization. Unix top utility is a classic example of such tool. It can be used to assist you with the planning of upgrades, tracking of processes that need to be optimized, monitoring results of tuning and configuration scenarios, and the understanding of a workload and its effect on resource usage to identify bottlenecks.
|
Bottlenecks can occur on practically any component of the server, with typical suspect being I/O, memory and CPU. It can be caused by a malfunctioning resource, the system not having enough resources, a program that dominates a particular resource.
Solaris blueprints used to contains several very good materials about performance tuning. I especially recommend blueprints written by Adrian Cockcroft. They Sun blueprint became pretty difficult to get, Google is your friend. Alternative is to get his (also old, 1994) book Sun Performance and Tuning- Sparc & Solaris that has some interesting information and costs $0.01 on Amazon.
This disappreance of old Sun blueprints is typical for aqusitions, but at the same time is pretty sad (Oracle would lose nothing preserving them for viewing) and I do not know a good collection that still archive them. Here is a long quote from his Performance Monitoring Short Cuts which is still available:
- Performance Monitoring Short Cuts
- Introduction
- Quick start - for the impatient
- What data to collect and why
- Concepts for identification of performance degradation root cause
- Using the SE toolkit
- Baselining and data collection
- Source code for scripts
- Using TNF on Solaris 9 - trace probes
- Cookbook
- TOC for Tech Tips.
Introduction
Purpose
This article describe what data points are needed to observe performance, and describe why those exact points where chosen.
This article describes what data you need to observe, and how you can observe it and store it.
The guideline is to go minimalistic: Get the the data you really need, nothing more, and make the data collection tools footprint as small as possible, so that the tool wont become one of the blips on the radar.
Both the data that is to be collected and the way that it is collected are discussed and described.
Including the design discussions here make it a lot easier to discuss what, why and how for each aspect of the observation.After this the reader should be able to:
- Gather the relevant Performance data from the System.
- Have a starting point for interpreting the observations.
- Understand why the data, being collected, are relevant.
- Contribute to the performance discussion.
Scope
This article
- is only about observation
- it does not identify the causes.
- it does not describe how to increase performance
- only addresses the Solaris operating system.
The actual data collections scripts are not included.
Acronyms
- APIC: Advanced Programmable Interrupt Controller.
- CPC: CPU Performance Counter DTrace provider.
- HAT: Hardware Address Transllation.(Sun06,p3-33)
- HTT: HyperThreading Technology (HTT) is Intel's trademark for their implementation of the SMT.
- L1D: (intel) Level 1 Data cache.
- L1I: (intel) Level 1 Instruction cache.
- L2: Level 2 cache
- Hyper Threading:
- Retired, Instructions:
- SIMD: single-instruction multiple-data
- SMT: simultaneous multithreading technology.
- TLB: Translation Lookaside Buffer
- TNF: Trace Normal Form
- TTE: Translation Table Entry. (McD07,p590)
References
- MySQL Scalability on Nehalem systems (Sun Fire X4270)
- Improving Application Efficiency Through Chip Multi-Threading
- Multi Processing and Multi Threading
- UltraSPARC Processors Documentation
- [BluePrints:Performance+Counters+on+Solaris]
- Intelฎ 64 and IA-32 Architectures Software Developer's Manual
- Solaris Internals
- CMT Utilization
- http://wikis.sun.com/display/SunStudio/Sun+Studio+Technical+Articles+Collection
- http://prefetch.net/articles/dtracecookbook.html
- TTCP
- [http://developers.sun.com/solaris/articles/tnf.html]
- Coc98_Sun Performance and Tuning: Java and the Internet_, Second Edition, Adrian Cockcroft; Richard Pettit. ISBN-13: 978-0-13-095249-3
- McD06: Solaris Performance and tools ISBN-13: 978-0-13-156819-8
- McD07: Solaris Internals ISBN: 0-13-148209-2
- Sun06: Solaris 10 Operating System Internals SI365-S10 Student Guide
- Read Me First!: A Style Guide for the Computer Industry, Second Edition by Sun Technical Publications ISBN-13: 978-0-13-142899-7
- Resource Management
- [
Configuring and Tuning Databases on the Solaris Platform|http://my.safaribooksonline.com/0130834173]- Web Performance Tuning, 2nd Edition
- System Performance Tuning, 2nd Edition
- http://www.setoolkit.org
- http://oss.oetiker.ch/rrdtool/
- http://opensolaris.org/os/community/dtrace/dtracetoolkit/
- OpenSolaris Project: Zone Statistics
- Solaris Kernel Statistics - Accessing libkstat with C
- Solaris Kernel Statistics, Part II Accessing libkstat with Shell Script
Who should use this article
- System administrator
- Performance agent
How this article is organized
- Quick start - for the impatient: The section describes what tools are needed and where to get them.
- What data to collect and why: Define the cardinal resources, what sources can affect the cardinal resources. Then the data points, needed for each cardinal resource, are defined and described.
- :
- Drill down - taking potshots at messengers:
- Tools of the chase: What tools are good for chasing down issues.
Related material
Quick start - for the impatient
Getting going with data collection
To get started the following needs to be installed on the system
- SE Toolkit: Tools specifically written to retrieve the kernel statistics data.
- You don't have to be root to run this tool.
- http://www.setoolkit.org/cms/
- http://sourceforge.net/projects/setoolkit/
- RRDtool: The database for storing the collected data. The tool also support graph generation. It's a Round Robin Database, that will always stay the same size.
- Epoch converter: For converting between Human readable time and epoch time. For calculating the start epoch for the RRDtool.
Installing the tools
SE Toolkit from sourceforge.net
The RICHPse-xxx.pkg.gz is a Sun package file, it will install in /opt/RICHPse
The RICHPse is dependent on the C pre-processor: cpp.
For Solaris 9 the SUNWcpp package must be installed for the SE toolkit to run.Verify the SE Toolkit installation
- /opt/RICHPse/bin/se /opt/RICHPse/examples/cpus.se
- /opt/RICHPse/bin/se /opt/RICHPse/examples/net.se
- /opt/RICHPse/bin/se /opt/RICHPse/examples/disks.se
RRDtool from Sunfreeware.com
- freetype
- libart_lgpl
- libgcc
- libiconv
- libpng
- rrdtool
- zlib
Verify the RRDtool installation
- rrdtool create test.rrd --start 920804400 DS:speed:COUNTER:600:U:U RRA:AVERAGE:0.5:1:24 RRA:AVERAGE:0.5:6:10
from http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.htmlWhat data to collect and why
Performance is affected by: Tunning/configuration + workload + Fluctuations(Coc98)
Cardinal resources available
- CPU
- Memory - this is a kind of near storage. This includes the various level 1 through 3 cache.
- NIC
- Storage
Sources of performance degradation
- cache thrashing
- Getting data from swapped pages(at least when they are on disk)
- On disk data.
- Context switching
- Interrupts
- CPU switching
- Cache thrashing on HTT cpu's
- Starting too many new processes per time unit.
- Using too much memory and swap. Actually needing to much data in physical memory at the same time.
Key guidelines
- Get only the the data that is needed as a minimum
- Don't make the data collection part of the problem.
What data to collect
cpu
- csw
- What: context switching. Both Voluntary and Involuntary(McD06,p21)
- Why:
- How: kstat$cpusys.pswitch
- icsw
- What: Involuntary context switching.
- If you get an involuntary contect switch, then the thread that is running on that cpu is stopped (pinned) and a new thread runs in it's place (Blog:Will a faster cpu make my application faster?)
- Why: cause pinning
- Limit: When this number increases past 500, the system is under a heavy load according to Sun Java System Portal Server 6 2005Q4 Deployment Planning Guide Appendix B
- How: kstat$cpusys.inv_swtch
- xcal
- What: Cross calls between cpu's
- Why: Expensive.
- How: kstat$cpusys.xcalls
- intr
- What: Device interrupts. This is both HW and SW interrupts(McD06,p22)
- Why:
- How: kstat$cpusys.intr
- migr Migration is costly
- What: Moving a process from one CPU to the other.
- Or is this only a thread?
- Why: Dirtying the lookup table, thereby reducing the cache efficiency.
- How: kstat$cpusys.migr
- smtx
- What: the number of times the kernel failed to obtain a mutex immediately. (McD06,p22) Seems to indicate that busy spins aren't counted. It also seems that spinning time is counted toward 'sys'.
- Why: If the number is more than about 200 per CPU, then usually system time begins to climb.(Coc98,ch10)
- How: kstat$cpusys.
- sysexec
- What: number of new process being started.
- Why: It is resource costly to start a new process.
- How: kstat$cpusys.sysexec
- idle
- What:
- Why:
- How: kstat$cpusys.cpu[CPU_IDLE]
- user
- What:
- Why:
- How: kstat$cpusys.cpu[CPU_USER]
- sys
- What:
- Why:
- How: kstat$cpusys.cpu[CPU_KERNEL]
- run que
- What: this is the total run que, the number of processes ready for the CPU but not on a CPU yet. Divide by ncpus for comparison. If the run que is lower than the number of cpu's we are good.
- Why:
- How: kstat$sysinfo.runque
Memory
- scan
- What: pages examined by pageout daemon
- How many pages are scanned, slowscan when the free memory is lotsfree(Sun06,p5-6)
- Why: When this goes above 0 that means that the kernel is looking to pages that to swap out.
- So does that mean that no swapout happens before scan goes above 0?
- How: kstat$cpuvm.scan
- rev
- What: revolutions of the page daemon hand
- Why:
- How: kstat$cpuvm.
- as_fault minor page faults via as_fault()
- What: (Sun06,p5-29). Address Space fault(Sun06,p4-12)
- Will this counter also go up on a segmentation fault?, it seems like it according to (McD07,p470).
- It seems that this count goes up on: segmentation faults, minor page faults and major page faults. as per(McD06,p277)
- Why:
- How: kstat$cpuvm.
- hat_fault
- What: minor page faults via hat_fault
- Why:
- How: kstat$cpuvm.
- It looks like minor faults are calculated by mf = as_fault - maj_fault - prot_fault
- Though it seems prot_fault is also fired if attempting to write to a RO pg(McD07,474)
- A minor fault is caused by an address space or hardware address translation fault that can be resolved without performing a page-in(Coc98, ch13).
- maj_fault
- What: major page faults: Attempted to access a virtual memory address, where the page does not exist in physical memory(McD07,p473)
- Why:
- How: kstat$cpuvm.maj_fault
- de deficit (Coc98, ch13).
- What:
- Why:
- How: kstat$cpuvm.
- freemem
- What: Free memory pages.
- Please note that pages by default have differnt sizes on Sparc(8k) and x86(4k)
- Why:
- How: kstat$vminfo.freemem
- swap_free
- What: Unallocated swap pages.
- Why:
- How: kstat$vminfo.
NIC
Both packets and octets are samples; Since you might have:
- Fewer big packatges
- Lots of small packages
- ipackets
- What: Packets in
- Why:
- How: nic_itterator.ipackets
- ierrors
- What: Input errors
- Why:
- How: nic_itterator.ierrors
- ioctets
- What: Octets in
- Why:
- How: nic_itterator.ioctets
- opackets
- What: Packets out
- Why:
- How: nic_itterator.opackets
- oerrors
- What: Output errors
- Why:
- How: nic_itterator.oerrors
- ooctets
- What: Octests out
- Why:
- How: nic_itterator.ooctets
- collisions
- What: collisions
- Why:
- How: nic_itterator.collisions
- defer
- What: Ethernet metric that counts the rate at which output packets are delayed before transmission(Coc98,ch16).
- Why:
- How: nic_itterator.defer
- This might not always be supported by the NIC driver, in which case it always return 0. Please see the /opt/RICHPse/include/netif.se to see if it's set to zero or actually loaded with data.
- nocanput
- What: packet discarded because it lacks IP-level buffering on input. It can cause a TCP connection to time out and retransmit the packet from the other end(Coc98,ch16).
- Why:
- How: nic_itterator.nocanput
- Not always supported by the driver.
- norcvbuf, noxmtbuf buffer allocation failure counts
- What:
- Why:
- How: nic_itterator.
Disk
- nread
- What: bytes read
- Why:
- How: kstat$disk.nread
- nwritten
- What: bytes written
- Why:
- How: kstat$disk.nwritten
- reads
- What: read operations
- Why:
- How: kstat$disk.reads
- writes
- What: write operations
- Why:
- How: kstat$disk.writes
- wcnt
- What: elements in wait state
- Why:
- How: kstat$disk.wcnt
- rcnt
- What: elements in run state
- Why:
- How: kstat$disk.rcnt
- wtime
- What: cumulative wait, pre-service, time(sys/kstat.h).
- Why: Having just the count of how many elements are in the que, is not always enough, Same as with the network, many elements in the que, but they are quickly handled. Or few elements, but they take a long time to handle. I would assume that the max time observed, is the time between the samples. So if sampling every 5 seconds, then the max is 5s.
- How: kstat$disk.wtime
- rtime
- What: cumulative wait, service, time(sys/kstat.h).
- Why: As for 'wtime'
- How: kstat$disk.rtime
There is also a limit to the amount of unflushed data that can be written to a file. This limitation is implemented by the UFS write throttle algorithm, which tries to prevent too much memory from being consumed by pending write data. For each file, between 256 Kbytes and 384 Kbytes of data can be pending. When there are less than 256 Kbytes (the low-water mark ufs_LW), it is left to fsflush to write the data. If there are between 256 Kbytes and 384 Kbytes (the high-water mark ufs_HW), writes are scheduled to flush the data to disk. If more than 384 Kbytes are pending, then when the process attempts to write more, it is suspended until the amount of pending data drops below the low-water mark. So, at high data rates, writes change from being asynchronous to synchronous, and this change slows down applications. The limitation is per-process, per-file. (Coc98, ch8)
The "wait service time" is actually the time spent in the "wait" queue(Coc98,ch3).
measuring a two-stage queue:
- wait queue, in the device driver
- active queue, in the device itself.
Concepts for identification of performance degradation root cause
In this section are described ways to get from indicators to finding the reason for the problem.
Tool of the chase
- analyzer(1): Sun Studio Performance Analyzer
- cpustat - Display CPU performance counters
- cputrack - Like cpustat but track a process.
- DTrace toolkit: http://opensolaris.org/os/community/dtrace/dtracetoolkit/
- hotkernel identify which function is on the CPU the most
- execsnoop list process being started.
- Cpu/intoncpu.d interrupt on-cpu usage.
- Cpu/xcallsbypid.d CPU cross calls by PID
- intrstat interrupt statistics.
- kstat Kernel statistics. The mother lode.
- lockstat kernel lock and profiling statistics
- mpstat per-processor or per-processor-set statistics
- poolstat active pool statistics
- psradm change processor operational status. Disable/enable cores etc.
- psradm -f 5 disable cpu5
- psradm -n 5 enable cpu5
- psrinfo information about processors
- prset
- vmstat (virtual) memory statistics
cpustat
- You need to be root to run this.
- The events given to '-c' is CPU specific.
- cpustat -c inst_queue_write_cycles,inst_queue_writes
time cpu event pic0 pic1 5.000 3 tick 7508400 25504987 5.001 4 tick 4982002 17097458 5.001 5 tick 8303888 28502794 5.001 7 tick 8092210 28807695 5.002 6 tick 10630526 36690168kstat
- kstat -p -m cpu_stat -i0 -s 'intr*'
lockstat
According to Coc98,ch10 using lockstat will incur some overhead. At least on the Solaris 2.6
mpstat
- mpstat 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 11 1 319 2830 756 304 12 0 9 0 360 1 1 0 98 0 6 0 1 2739 726 286 3 0 2 0 323 0 1 0 99 0 0 0 6 2733 722 291 5 0 4 0 318 0 1 0 99 0 0 0 3 2725 713 286 3 0 3 0 330 0 1 0 99
- (man mpstat)
- minf minor faults
- mjf major faults
- xcal inter-processor cross-calls
- intr interrupts
- ithr interrupts as threads (not counting clock interrupt)
- csw context switches
- icsw involuntary context switches
- migr thread migrations (to another processor)
- smtx spins on mutexes (lock not acquired on first try)
- srw spins on readers/writer locks (lock not acquired on first try)
- syscl system calls
- usr percent user time
- sys percent system time
- wt always zero
- idl percent idle time
- sze number of processors in the requested processor set
prset
- psrset -q
- psrset -e <prset_id> <command>
- psrset -Q 1 | grep -v "lwp id"
- get list of tasks in specific processor set
psrset -Q 1 | grep -v "lwp id" | awk '{print $3}' | tr -d ':' | xargs -n1 ps -fZp | grep -v CMDvmstat
vmstat 5
kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd f0 s0 -- in sy cs us sy id 0 0 0 1059032 75600 21 57 28 0 1 0 44 6 -0 -0 0 241 467 236 3 18 79 0 0 0 1059620 66792 1 9 0 0 0 0 0 0 0 0 0 224 145 90 1 13 86 0 0 0 1059540 66708 0 2 0 0 0 0 0 0 0 0 0 227 95 92 1 15 83 0 0 0 1059540 66708 0 1 0 0 0 0 0 0 0 0 0 240 124 105 2 18 80
- The first line is a summary since the system was started.
- r the number of kernel threads in run queue
- b the number of blocked kernel threads that are waiting for resources I/O, paging, and so forth
- w the number of swapped out light-weight processes (LWPs) that are waiting for processing resources to finish.
- swap available swap space (Kbytes)
- free size of the free list (Kbytes)
- re page reclaims
- mf minor faults
- pi kilobytes paged in
- po kilobytes paged out
- fr kilobytes freed
- de anticipated short-term memory shortfall (Kbytes)
- sr page scan rate
- Disk anticipated short-term memory shortfall (Kbytes)
- cd CD drive
- f0 Floppy?
- s0 SCSI disk0
- - -
- Faults
- in interrups.
- Is this the same as intr in mpstat?
- sy system calls
- cs Context switches
- us user time
- sy system time
- id idle time
CPU related investigations
troubleshooting sysexec
How many process has been started.
- cd /opt/DTraceToolkit-0.99/Proc
- ./shortlived.d
- Let it run for 10 to 20 seconds, to get a fair sample
- The 10/20 is just arbitraritly selected.
- find out what each of the process, listed in the PPID section, are.
- Find you why each of these process has these short lived children, and find out if it is ok.
example of sysexec troubleshooting.
short lived processes: 11.778 secs total sample duration: 11.076 secs Total time by process name, rrdtool 2150 ms exit_this.sh 7631 ms Total time by PPID, 387 2150 ms 6535 7631 msintr
- cd /opt/DTraceToolkit-0.99/Cpu
- ./inttimes.d
- Let it run for about 15 seconds.
- This will give a list of how much time is spendt servicing interrupts from all the sources.
output from Cpu DTrace script
Tracing... Hit Ctrl-C to end. ^C DEVICE TIME (ns) igb2 46248 ahci0 47568 uhci1 52938 uhci0 56073 igb3 100567 uhci2 101641 uhci4 102279 ehci0 116884 uhci5 121902 uhci3 163313 mpt0 469179 ehci1 560201 igb0 1068298Who is the interrupt culprit
- mpstat 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 2679 663 290 8 0 13 0 239 0 1 0 99 1 1 0 0 19 0 316 30 0 0 0 343 0 0 0 100 2 0 0 0 16 0 70 0 0 0 0 65 0 0 0 100 3 47 0 21 36 7 365 1 50 12 0 470 0 0 0 99 4 76 0 47 22 6 400 4 73 12 0 515 0 0 0 99 5 64 0 41 30 4 355 7 58 9 0 408 0 0 0 99 6 94 0 29 47 17 434 5 58 12 0 471 0 0 0 99 7 85 0 9 26 0 298 3 59 4 0 317 0 0 0 100- ./intbycpu.d
- It was run for about 5 sec.
CPU INTERRUPTS 7 127 1 130 2 132 6 247 3 280 5 301 4 844 0 6078- /intrtop.d
CPU# PID CMD Interrupts 1 0 sched 467 2 0 sched 467 6 0 sched 510 7 0 sched 552 5 0 sched 621 3 0 sched 671 4 0 sched 1869 0 0 sched 11314smtx
If you see high levels of mutex contention, you need to identify both the locks that are being contended on and the component of the workload that is causing the contention, for example, system calls or a network protocol stack(Coc98, ch10).
- if smtx increases sharply. An increase from 50 to 500 is a sign of a system resource bottleneck (ex., network or disk). Sun Java System Portal Server 7 Deployment Planning Guide
Possibly use lockstat.
- What locks are being used.
- Use lockstat
- Who incurs the locks?
- Possibly use dtrace
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
The kernel summit was two weeks ago, and at the end of that I got one of the new 80GB solid state disks from Intel. Since then, I've been wanting to talk to people about it because I'm so impressed with it, but at the same time I don't much like using the kernel mailing list as some kind of odd public publishing place that isn't really kernel-related, so since I'm testing this whole blogging thing, I might as well vent about it here.That thing absolutely rocks.
I've been impressed by Intel before (Core 2), but they've had their share of total mistakes and idiotic screw-ups too (Itanic), but the things Intel tends to have done well are the things where they do incremental improvements. So it's a nice thing to be able to say that they can do new things very well too. And while I often tend to get early access to technology, seldom have I looked forward to it so much, and seldom have things lived up to my expectations so well.
In fact, I can't recall the last time that a new tech toy I got made such a dramatic difference in performance and just plain usability of a machine of mine.
So what's so special about that Intel SSD, you ask? Sure, it gets up to 250MB/s reads and 70MB/s writes, but fancy disk arrays can certainly do as well or better. Why am I not gushing about some nice NAS box? I didn't even put the thing into a laptop, after all, it's actually in Tove's Mac Mini (running Linux, in case anybody was confused ;), so a RAID NAS box would certainly have been a lot bigger and probably have more features.
But no, forget about the throughput figures. Others can match - or at last come close - to the throughput, but what that Intel SSD does so well is random reads and writes. You can do small random accesses to it and still get great performance, and quite frankly, that's the whole point of not having some stupid mechanical latencies as far as I'm concerned.
And the sad part is that other SSD's generally absolutely suck when it comes to especially random write performance. And small random writes is what you get when you update various filesystem meta-data on any normal filesystem, so it really does matter. For example, a vendor who shall remain nameless has an SSD disk out there that they were also hawking at the Kernel Summit, and while they get fine throughput (something like 50+MB/s on big contiguous writes), they benchmark a pitiful 10 (yes, that's ten, as in "how many fingers do you have) small random writes per second. That is slower than a rotational disk.
In contrast, the Intel SSD does about 8,500 4kB random writes per second. Yeah, that's over eight thousand IOps on random write accesses with a relevant block size, rather than some silly and unrealistic contiguous write test. That's what I call solid-state media.
The whole thing just rocks. Everything performs well. You can put that disk in a machine, and suddenly you almost don't even need to care whether things were in your page cache or not. Firefox starts up pretty much as snappily in the cold-cache case as it does hot-cache. You can do package installation and big untars, and you don't even notice it, because your desktop doesn't get laggy or anything.
So here's the deal: right now, don't buy any other SSD than the Intel ones, because as far as I can tell, all the other ones are pretty much inferior to the much cheaper traditional disks, unless you never do any writes at all (and turn off 'atime', for that matter).
So people - ignore the manufacturer write throughput numbers. They don't mean squat. The fact that you may be able to push 50MB/s to the SSD is meaningless if that can only happen when you do big, aligned, writes.
If anybody knows of any reasonable SSDs that work as well as Intel's, let me know.
About: dim_STAT is a performance analysis and monitoring tool for Solaris and Linux (as well all other UNIX) systems. Its main features are a Web based interface, data storage in a SQL database, several data views, interactive (Java) or static (PNG) graphs, real-time monitoring, multi-host monitoring, post analyzing, statistics integration, professional reporting with automated features, and more.
Changes: A major performance update.
About: Sysprof is a sampling CPU profiler that uses a Linux kernel module to profile the entire system, not just a single application. It handles shared libraries, and applications do not need to be recompiled. It profiles all running processes, not just a single application, has a nice graphical interface, shows the time spent in each branch of the call tree, can load and save profiles, and is easy to use.
Changes: Compiles with 2.6.25 and later.
Discusses Capacity Planning and Performance Management techniques.
Information about Solaris operating system accounting to include code examples that extract the data in a usable format and pattern match it into workloads.
Discusses scenario planning techniques to help predict latent demand during overload periods. In this part 1 he explains how to simplify your model down to a single bottleneck.
Presents part two of the Scenario Planning article and explains how to follow-up a simple planning methodology based on a spreadsheet that is used to break down the problem and experiment with alternative future scenarios.
Richard discusses a class of problems that can affect system performance which is not dynamic by nature, and cannot be detected by conventional dynamic tuning tools.
This article presents the rationale for formal system performance management from a management, systems administrative and vendor perspective. It describes four classes of systems monitoring tools and their uses. The article discusses the issues of tool integration, "best-of-breed versus integrated suite" and the decision to "buy versus build."
The sysstat package contains the sar, sadf, iostat, mpstat, and pidstat commands for Linux. The sar command collects and reports system activity information. The statistics reported by sar concern I/O transfer rates, paging activity, process-related activites, interrupts, network activity, memory and swap space utilization, CPU utilization, kernel activities, and TTY statistics, among others. The sadf command may be used to display data collected by sar in various formats. The iostat command reports CPU statistics and I/O statistics for tty devices and disks. The pidstat command reports statistics for Linux processes. The mpstat command reports global and per-processor statistics.
Release focus: Minor bugfixes
Changes:
mpstat and sar didn't parse /proc/interrupts correctly when some CPUs had been disabled. This is now fixed. This release also fixes a bug in pidstat which caused confusion between PID and TID, resulting in erroneous statistics values being displayed. The iconfig script has been updated: Help for the --enable-compress-manpg parameter is now available, help for the --enable-install-cron parameter has been updated, and the parameter cron_interval has been added.
Aiming to provide increasingly higher-quality IP and Internet services at lower prices, Sprint Corp. has begun its most comprehensive study to date of traffic behavior on its Internet backbone.
After a year of developing its own test equipment, the carrPerformance Management with FrPerformance Management with Free and Bundled Tools Adrian Cockcroft Netflix Inc. [email protected] (Co-authored with Mario Jauvinee and Bundled Tools Adrian Cockcroft Netflix Inc. [email protected] (Co-authored with Mario Jauvinbegan collecting data at its San Jose, Calif., Internet POP (point of presence), the first of many sites slated for testing.
Sprint plans to use the data from the testing, called the Internet Measurement Study, to ensure that its network can handle ever-increasing customer traffic volume and to discover which network monitoring tools will be needed in future network equipment.
"Very little is known about the detailed behavior of Internet backbones," said Bryan Lyles, chief scientist at Sprint, in Kansas City, Mo. "Very fine-grained studies are what we need to make rational decisions on the equipment that goes into the network -- even the standards that go into it."
Sprint hopes the multimillion-dollar, multiyear study will enable it to keep its equipment costs as low as possible and ensure that its network delivers optimal performance.
"The goal is to make sure we make the best use of capital and the other resources we put into the network and to keep our customers happy," Lyles said.
Performance, performance, performance
As the Internet's importance to a company's bottom line increases, users expect ISPs (Internet service providers) or other data carriers to meet increasingly stringent service performance goals.
At Quebecor Printing (USA) Inc., which is installing an IP-based VPN (virtual private network) at its many locations, "class of service will include bandwidth allocation and prioritization for certain applications," said Terry Bush, vice president of data communications, in Greenwich, Conn.
At its bigger printing facilities, the company is installing multiple 1.5M-bps circuits to handle growth in its data traffic because IP bandwidth is more efficient and flexible in a VPN than in more conventional network designs, Bush said. Nevertheless, Quebecor demands service levels that rival private network solutions and has a service-level agreement that specifies zero packet loss and a round-trip, coast-to-coast network delay of less than 75 milliseconds, Bush said.
Sprint isn't alone among carriers and ISPs in its quest to improve Internet service. For example, "2001 will probably be the last year that we will buy narrowband switches," said Fred Briggs, chief technical officer at WorldCom Inc., in Clinton, Miss.
Chat Title: Solaris Utilities for Monitoring System Performance
Guest Speakers: James Liu and Karpagam NarayananThis is a moderated forum
LizA: Welcome to the Solaris Live Chat, "Solaris Utilities for Monitoring System Performance" with James Liu and Karpagam Narayanan. James was our first Solaris Live! guest and we're very happy to have him back. James is ready to answer your questions on software development and benchmark formation strategies and configuration, scaling analysis, processor management, thread libraries, and so on. He is joined by Karpagam Narayanan, who has lots of experience with all the standard tools like Virtial Adrian (aka SE Toolkit) disk partitioning, network bandwidth trunking, and other things that get your app to run faster on Solaris[tm]. Karpagam and James, let's say that I'm new to Solaris and I want to know what CPU a process takes. Is there a command that shows me this?
jamesliu: I'll take this one. A number of commands can show this. You can use
prstat
which is bundled with Solaris 8 and is probably easiest. If you have the freeware top... you can use this too.LizA: What does NLWP mean in
prstat
?karpagam: NLWP refers to the number of light weight processes, or LWP, associated with the process.
LizA: How does someone find out which processors are online or off line?
jamesliu: You can find out using the
psrinfo
command.-v
option gives you a lot of info on the processorsLizA: I need to increase the file descriptors on my server...I bumped up the
ulimit
but it still doesn't work. What else do I need to do?karpagam: Increase the
rlim_fd_max
andrlim_fd_cur
parameters in/etc/system
. Remember that these take affect after you reboot.jamesliu: LizA, you can also gain some efficiencies if your problem is related to using network file descriptors (i.e. sockets). You can tune the
tcp/ip
parameters using thendd /dev/tcp
command to shorten thetcp_time_wait_interval
.tefluid: I'm interested in optimizing application servers in order to run Java[tm] engines such as BEA WebLogic and ATG Dynamo. What advice can you give on profiling the system to best determine where the bottlenecks lie?
karpagam: This is a Java on Solaris question. Java has a profiling tool called
hprof
that can be included in the command line. Type-Xrunhprof:help
for more info on this. The output gives you methods that take more CPU time...karpagam: tefluid, There is a HAT (Heap Analysis Tool) also available. There are also 3rd party GUI tools available. Optimizeit and JProbe are two of them.
LizA: I heard that in Solaris you can allocate certain processors to work on only one process. Will that help, too?
jamesliu: LizA, you can in fact specify certain processors to a specific process. The command to use is
psrset
. For folks like Tefluid, binding the JVM PID to a processor set and excluding interrupts can possibly give a boost in performance.Craki: I have a farm of Sybase database boxes all on Solaris 8. Where can I start in making sure that everything that can be optimized is, for database operations.
karpagam: Craki, I would always start with the db monitoring tools. Once you are sure that you do not have any issues go through the system parameters...
karpagam: Craki, Start by looking into shared memory, semaphores and message queue parameters first in the
/etc/system
. Then look into disk, network, NFS, swapping/paging, memory, CPU, filesystem, and TCP, one at a time...karpagam: Craki, do look in http://www.sun.com/sun-on-net/performance/perftools-solaris8.pdf for more info on Solaris tools
Zartaj: I am interested in performance comparisons between Sun Solaris and Wintel. The problem is it is not easy to decide what is the right pair to compare. I have a UE250 450MHz with Solaris 8 and a P3 733 MHz with Windows 2000. I have seen the Wintel box consistently outperform the UE250. But is that a fair comparison? In general if I have a Sun system how do I determine what is the equivalent Wintel system to compare. Going by price alone, Wintel seems to have the edge.
jamesliu: Zartaj, it is often a race for more MIPS/MFLOPS, etc. in the hardware area. I don't know which benchmarks you run but in those apps that are important to Sun's customers. Sun consistently tunes our applications to out scale and outperform anything on the market. It all depends on the use. In your particular case, it may in fact be that Wintel has better price performance. In many of Sun's core customers, our value proposition is reliability, availability and scalability. We've competed well on this philosophy for about 18 years and I predict we'll continue. As for your particulars, perhaps we can communicate offline and discuss how to improve your performance.
alexc: We use some scripts to automate gathering info from
ps
. We also usesar
. We notice that total CPU utilization (by adding upps
info) is usually quite a bit less than what is stated bysar
. Why is there a discrepancy?karpagam: Alexc, I am not sure what
ps
you are referring to -/usr/ucb/ps
? In what version of Solaris? I do not know the time interval thatps
uses for data gathering. If you are in Solaris 8, try usingprstat
. There are a lot of parameters that can come into play here - interval, versions, options for the tools, etc...LizA: What do I need in order to look at
mpstat
? What do the columns mutexes and context switching mean?karpagam: LizA, mutexes occur when a lot of CPUs are trying to grab the same resource lock. Only one CPU will be successful at any time. We do not want this to happen a lot...
jamesliu: LizA, context switching is also something that, done too often, expends resources... What you want to do is to limit these values to certain levels.
smtx
, for example is best below 500 per CPU per second. Context switches ... you can check at http://www.setoolkit.com.Zartaj: I'd like to know what tools are available for shared library profiling? Shared libraries cannot be instrumented for
prof
orgprof
. And theLD_PROFILE
variable can be used only for one shared library at a time. So how do I go about profiling all shared libraries being used by anapp
?karpagam: Zartaj, You can try using
truss
andsotruss
.truss
gives shared library activity and entry/exit trace of user-level function calls.sotruss
is good and has less noise thantruss
...dmdebertin: Are there any particular columns in
vmstat
(or other command) output that could indicate hardware or software problems? What are some things to look for that could indicate problems, and what is harmless?jamesliu: DMDebertin, if your CPU percentage is high but system usage is low, most of the CPU is consumed by your app. You may want to think about tuning your code in this case. If system time is high, check out more with
mpstat
and look at context switch andsmtx
values.Emory2: Could you please compare the performance of a 24 CPU SunFire 6800 to the performance of a 24 CPU IBM S80 (configured with the same amount of RAM).
karpagam: Emory2, For what workload? You can consider looking into TPC-C, TPC-D, spec standard benchmark pages that matches your workload.
LizA: How do I monitor the network?
karpagam: LizA, the primary tool you can use is
netstat
. There are options like-in
for cumulative data,-s
for TCP/UDP stats,-I
for specific interface. I like to put innetstat
-in
in awhile
loop...jamesliu: LizA, Sun also provides some scripts for tuning your network drivers. http://www.sun.com has these scripts. Search for "network tuning" or "syn flood" and you should see some docs on how to tune your network interface.
karpagam: LizA,
netstat -a
gives a lot more information onthevsockets/ports open
. Look forESTABLISHED
andTIME_WAIT
LizA:
netstat -a
tells me that I have over 8000 connections. But I have only 3000 sessions open. They have atime_wait
status on more than half of them. Is that something to do with my application?jamesliu: LizA, Regarding
netstat
output, you'll probably have lots of network sessions still waiting to close. The default setting on Solaris is 240 seconds. You can usendd /dev/tcp
to set thetcp_time_wait_interval
to a lower value so that these connections close down more quickly. Say 30 seconds is good. Be careful not to set this too low as slow connections (e.g. modems) might get dropped.Zartaj: I believe a 32-bit process can only use around 3GB out of a possible 4GB. So is it useful to have more than 4GB physical memory on a system that allows it?
karpagam: Zartaj, What you need to look into is how much your application uses/needs. Are you running 64-bit Oracle and need more than 4GB SGA? Use
pmap
to tell you the processor footprint and calculate on that basis.Zartaj: In the Solaris Multithreading Guide, it recommends against thread-pooling saying it is cheaper to create threads as needed. Do you agree with that?
jamesliu: Zartaj, in general I would agree that threads are relatively cheaper to create than to pool. Pooling creates many potential oppotunities for contention. However, in some cases, such as Java, the threading model may be more amenable to pooling since there is a Java layer there.
jd: The way I understand load average to be calculated, it is incremented by 1 for every CPU's worth of time spent. (Ex. a 10 CPU system with 10% user time as shown by
vmstat
will report a load avg. of 1). High system time (as show invmstat
) causes load to jump very high in some cases; I have seen load avg. of 30 on a 10 CPU system with 40% system time/10% user time. I would like to know how the system comes up with that load avg.jamesliu: jd, I couldn't tell you exactly how the algorithm works. It's been a while since I've touched on it. Karpagam?
karpagam: jd, A high system time of that ratio clearly shows that there is a bottleneck. Did you check to see how your disks are doing. You also might want to see in
mpstat/top/prstat/statit
how the utilizations per processor is.Craki: I find that whenever a box has fairly high uptime, memory reports on usage is higher than it should be. My DBA's see this and start getting worrired about the boxes not being big enough. Is this a Solaris behavioral quirk?
jamesliu: Craki, I can't be certain, but our experience shows that in uptimes of 60+ days, the memory footprint remains stable on many of our servers. The most common area of memory growth over time we've seen has perhaps been in memory leaks on the application or windowing side. Many windowing apps or servers or windows managers do in fact leak lots of memory. This may be the cause of growth over time.
jd: I am not asking about a problem in particular, I have just seen the load avg. jump like that and am curious as to how it's calculated.
karpagam: jd, Did you see this on Solaris 8?
Emory2: Does anyone know if there is a working version of "proctool" for Solaris 8? One version that we tested did not work for multiprocessors.
karpagam: Emory2, you can use
/usr/proc/bin
proc tools - right?pmap
,ptree
,ptime
,pldd
, etc...jd: I have seen it on 2.6 and 8; the most recent was on 8 where a Java programmer had an app. that went crazy with creating/deleting threads.
jamesliu: jd, I guess you're still asking about how the load average is computed. Again, I can't tell you off hand since it's been a while since I've touched the algorithms. But I can imagine that any process that creates/destroys lots of threads is a contrived and somewhat unique situation. Perhaps we can work offline to discuss optimization and development techniques to reduce the CPU utilization.
LizA: Are there any special libraries I can use to improve performance?
jamesliu: LizA, there are a number of libraries that might boost performance. Some are in Solaris 8, some are third party. If you have a thread intensive application and have high
smtx
values, due to schedlock, you may want to put/usr/lib/lwp
at the top of yourLD_LIBRARY_PATH
which is an alternate thread library. If your app. is memory allocation intensive, there are 3 ISV solutions that replace the bundledmalloc
on Solaris that improve performance.alexc: question about threading, etc., ... the way I understand it, some programmers use multiple processes to do threading (spawning child processes) and some use threads within a single process. Clearly, multiple processes can run on multiple processes simultaneously. However, can threads within a single process run on more than one processr simultaneously?
alexc: Rather, multiple processes can use multiple processORs, but can threads within a single process do the same?
jamesliu: Alexc, absolutely. Threads do run on multiple processors on Solaris. As do multiple processors with multiple threads. Solaris supports scheduling that allows a many-to-many relationship between threads or processes and processORs.
Craki: Can you recommend a centralized monitoring/management package? I've done a small deployment of Sun[tm] Management Center and liked it. Would Big Brother be a good solution as well?
karpagam: Sun Management Center is very good. If you want to monitor database statistics also, I know that a lot of folks use Foglight from Quest Software. I do not know about Big Brother - sorry.
LizA: We're about out of time. Thanks to Karpagam and James...and all of you who asked such great questions. Karpagam and James, do you have a few parting words?
jamesliu: It has again been a pleasure. I'd be pleased to field questions in this forum again soon. -JCL
jamesliu: Note to all, if you're running any of the
vmstat
ormpstat
, just make sure you put a time interval like 5 seconds and exclude the first entry in you computations. - jclkarpagam: Thanks everyone for all the wonderful questions. It has been a pleasure. Thanks LizA for taking this forum smoothly :)
LizA: Be sure to join us again on June 21, at 10 a.m. PDT, when our guest is Rich Teer and the topic is "Secure C Programming."
Storage performance has failed to keep up with that of other major components of computer systems. Hard disks have gotten larger, but their speed has not kept pace with the relative speed improvements in RAM and CPU technology. The potential for your hard drive to be your system's performance bottleneck makes knowing how fast your disks and filesystems are and getting quantitative measurements on any improvements you can make to the disk subsystem important. One way to make disk access faster is to use more disks in combination, as in a RAID-5 configuration. To get a basic idea of how fast a physical disk can be accessed from Linux you can use the hdparm tool with the
-T
and-t
options. The-T
option takes advantage of the Linux disk cache and gives an indication of how much information the system could read from a disk if the disk were fast enough to keep up. The-t
option also reads the disk through the cache, but without any precaching of results. Thus-t
can give an idea of how fast a disk can deliver information stored sequentially on disk.The hdparm tool isn't the best indicator of real-world performance. It operates at a very low level; once you place a filesystem onto a disk partition you might get significantly different results. You will also see large differences in speed between sequential access and random access. It would also be good to be able to benchmark a filesystem stored on a group of disks in a RAID configuration.
fio was created to allow benchmarking specific disk IO workloads. It can issue its IO requests using one of many synchronous and asynchronous IO APIs, and can also use various APIs which allow many IO requests to be issued with a single API call. You can also tune how large the files fio uses are, at what offsets in those files IO is to happen at, how much delay if any there is between issuing IO requests, and what if any filesystem sync calls are issued between each IO request. A sync call tells the operating system to make sure that any information that is cached in memory has been saved to disk and can thus introduce a significant delay. The options to fio allow you to issue very precisely defined IO patterns and see how long it takes your disk subsystem to complete these tasks.
fio is packaged in the standard repository for Fedora 8 and is available for openSUSE through the openSUSE Build Service. Users of Debian-based distributions will have to compile from source with the
make; sudo make install
combination.The first test you might like to perform is for random read IO performance. This is one of the nastiest IO loads that can be issued to a disk, because it causes the disk head to seek a lot, and disk head seeks are extremely slow operations relative to other hard disk operations. One area where random disk seeks can be issued in real applications is during application startup, when files are requested from all over the hard disk. You specify fio benchmarks using configuration files with an ini file format. You need only a few parameters to get started.
rw=randread
tells fio to use a random reading access pattern,size=128m
specifies that it should transfer a total of 128 megabytes of data before calling the test complete, and thedirectory
parameter explicitly tells fio what filesystem to use for the IO benchmark. On my test machine, the /tmp filesystem is an ext3 filesystem stored on a RAID-5 array consisting of three 500GB Samsung SATA disks. If you don't specifydirectory
, fio uses the current directory that the shell is in, which might not be what you want. The configuration file and invocation is shown below.$ cat random-read-test.fio ; random read of 128mb of data [random-read] rw=randread size=128m directory=/tmp/fio-testing/data $ fio random-read-test.fio random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 Starting 1 process random-read: Laying out IO file(s) (1 file(s) / 128MiB) Jobs: 1 (f=1): [r] [100.0% done] [ 3588/ 0 kb/s] [eta 00m:00s] random-read: (groupid=0, jobs=1): err= 0: pid=30598 read : io=128MiB, bw=864KiB/s, iops=211, runt=155282msec clat (usec): min=139, max=148K, avg=4736.28, stdev=6001.02 bw (KiB/s) : min= 227, max= 5275, per=100.12%, avg=865.00, stdev=362.99 cpu : usr=0.07%, sys=1.27%, ctx=32783, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued r/w: total=32768/0, short=0/0 lat (usec): 250=34.92%, 500=0.36%, 750=0.02%, 1000=0.05% lat (msec): 2=0.41%, 4=12.80%, 10=44.96%, 20=5.16%, 50=0.94% lat (msec): 100=0.37%, 250=0.01% Run status group 0 (all jobs): READ: io=128MiB, aggrb=864KiB/s, minb=864KiB/s, maxb=864KiB/s, mint=155282msec, maxt=155282msec Disk stats (read/write): dm-6: ios=32768/148, merge=0/0, ticks=154728/12490, in_queue=167218, util=99.59%fio produces many figures in this test. Overall, higher values for bandwidth and lower values for latency constitute better results.
The bw result shows the average bandwidth achieved by the test. The clat and bw lines show information about the completion latency and bandwidth respectively. The completion latency is the time between submitting a request and it being completed. The min, max, average, and standard deviation for the latency and bandwidth are shown. In this case, the standard deviation for both completion latency and bandwidth is quite large relative to the average value, so some IO requests were served much faster than others. The CPU line shows you how much impact the IO load had on the CPU, so you can tell if the processor in the machine is too slow for the IO you want to perform. The IO depths section is more interesting when you are testing an IO workload where multiple requests for IO can be outstanding at any point in time as is done in the next example. Because the above test only allowed a single IO request to be issued at any time, the IO depths were at 1 for 100% of the time. The latency figures indented under the IO depths section show an overview of how long each IO request took to complete; for these results, almost half the requests took between 4 and 10 milliseconds between when the IO request was issued and when the result of that request was reported. The latencies are reported as intervals, so the
4=12.80%, 10=44.96%
section reports that 44.96% of requests took more than 4 (the previous reported value) and up to 10 milliseconds to complete.The large READ line third from last shows the average, min, and max bandwidth for each execution thread or process. fio lets you define many threads or processes to all submit work at the same time during a benchmark, so you can have many threads, each using synchronous APIs to perform IO, and benchmark the result of all these threads running at once. This lets you test IO workloads that are closer to many server applications, where a new thread or process is spawned to handle each connecting client. In this case we have only one thread. As the READ line near the bottom of output shows, the single thread has an 864Kbps aggregate bandwidth (aggrb) which tells you that either the disk is slow or the manner in which IO is submitted to the disk system is not friendly, causing the disk head to perform many expensive seeks and thus producing a lower overall IO bandwidth. If you are submitting IO to the disk in a friendly way you should be getting much closer to the speeds that hdparm reports (typically around 40-60Mbps).
I performed the same test again, this time using the Linux asynchronous IO subsystem in direct IO mode with the possibility, based on the
iodepth
parameter, of eight requests for asynchronous IO being issued and not fulfilled because the system had to wait for disk IO at any point in time. The choice of allowing up to only eight IO requests in the queue was arbitrary, but typically an application will limit the number of outstanding requests so the system does not become bogged down. In this test, the benchmark reported almost three times the bandwidth. The abridged results are shown below. The IO depths show how many asynchronous IO requests were issued but had not returned data to the application during the course of execution. The figures are reported for intervals from the previous figure; for example,the 8=96.0%
tells you that 96% of the time there were five, six, seven, or eight requests in the async IO queue, while, based on4=4.0%
, 4% of the time there were only three or four requests in the queue.$ cat random-read-test-aio.fio ; same as random-read-test.fio ; ... ioengine=libaio iodepth=8 direct=1 invalidate=1 $ fio random-read-test-aio.fio random-read: (groupid=0, jobs=1): err= 0: pid=31318 read : io=128MiB, bw=2,352KiB/s, iops=574, runt= 57061msec slat (usec): min=8, max=260, avg=25.90, stdev=23.23 clat (usec): min=1, max=124K, avg=13901.91, stdev=12193.87 bw (KiB/s) : min= 0, max= 5603, per=97.59%, avg=2295.43, stdev=590.60 ... IO depths : 1=0.1%, 2=0.1%, 4=4.0%, 8=96.0%, 16=0.0%, 32=0.0%, >=64=0.0% ... Run status group 0 (all jobs): READ: io=128MiB, aggrb=2,352KiB/s, minb=2,352KiB/s, maxb=2,352KiB/s, mint=57061msec, maxt=57061msecRandom reads are always going to be limited by the seek time of the disk head. Because the async IO test could issue as many as eight IO requests before waiting for any to complete, there was more chance for reads in the same disk area to be completed together, and thus an overall boost in IO bandwidth.
The HOWTO file from the fio distribution gives full details of the options you can use to specify benchmark workloads. One of the more interesting parameters is
rw
, which can specify sequential or random reads and or writes in many combinations. Theioengine
parameter can select how the IO requests are issued to the kernel. Theinvalidate
option causes the kernel buffer and page cache to be invalidated for a file before beginning the benchmark. Theruntime
specifies that a test should run for a given amount of time and then be considered complete. Thethinktime
parameter inserts a specified delay between IO requests, which is useful for simulating a real application that would normally perform some work on data that is being read from disk.fsync=n
can be used to issue a sync call after every n writes issued.write_iolog
andread_iolog
cause fio to write or read a log of all the IO requests issued. With these commands you can capture a log of the exact IO commands issued, edit that log to give exactly the IO workload you want, and benchmark those exact IO requests. The iolog options are great for importing an IO access pattern from an existing application for use with fio.
Simulating servers
You can also specify multiple threads or processes to all submit IO work at the same time to benchmark server-like filesystem interaction. In the following example I have four different processes, each issuing their own IO loads to the system, all running at the same time. I've based the example on having two memory-mapped query engines, a background updater thread, and a background writer thread. The difference between the two writing threads is that the writer thread is to simulate writing a journal, whereas the background updater must read and write (update) data. bgupdater has a thinktime of 40 microseconds, causing the process to sleep for a little while after each completed IO.
$ cat four-threads-randio.fio ; Four threads, two query, two writers. [global] rw=randread size=256m directory=/tmp/fio-testing/data ioengine=libaio iodepth=4 invalidate=1 direct=1 [bgwriter] rw=randwrite iodepth=32 [queryA] iodepth=1 ioengine=mmap direct=0 thinktime=3 [queryB] iodepth=1 ioengine=mmap direct=0 thinktime=5 [bgupdater] rw=randrw iodepth=16 thinktime=40 size=32m $ fio four-threads-randio.fio bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 queryA: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=mmap, iodepth=1 queryB: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=mmap, iodepth=1 bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16 Starting 4 processes bgwriter: (groupid=0, jobs=1): err= 0: pid=3241 write: io=256MiB, bw=7,480KiB/s, iops=1,826, runt= 35886msec slat (usec): min=9, max=106K, avg=35.29, stdev=583.45 clat (usec): min=117, max=224K, avg=17365.99, stdev=24002.00 bw (KiB/s) : min= 0, max=14636, per=72.30%, avg=5746.62, stdev=5225.44 cpu : usr=0.40%, sys=4.13%, ctx=18254, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.4%, 8=3.3%, 16=59.7%, 32=36.5%, >=64=0.0% issued r/w: total=0/65536, short=0/0 lat (usec): 250=0.05%, 500=0.33%, 750=0.70%, 1000=1.11% lat (msec): 2=7.06%, 4=14.91%, 10=27.10%, 20=21.82%, 50=20.32% lat (msec): 100=4.74%, 250=1.86% queryA: (groupid=0, jobs=1): err= 0: pid=3242 read : io=256MiB, bw=589MiB/s, iops=147K, runt= 445msec clat (usec): min=2, max=165, avg= 3.48, stdev= 2.38 cpu : usr=70.05%, sys=30.41%, ctx=91, majf=0, minf=65545 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.20%, 10=22.51%, 20=1.17%, 50=0.05%, 100=0.05% lat (usec): 250=0.01% queryB: (groupid=0, jobs=1): err= 0: pid=3243 read : io=256MiB, bw=455MiB/s, iops=114K, runt= 576msec clat (usec): min=2, max=303, avg= 3.48, stdev= 2.31 bw (KiB/s) : min=464158, max=464158, per=1383.48%, avg=464158.00, stdev= 0.00 cpu : usr=73.22%, sys=26.43%, ctx=69, majf=0, minf=65545 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.81%, 10=21.61%, 20=1.53%, 50=0.02%, 100=0.03% lat (usec): 250=0.01%, 500=0.01% bgupdater: (groupid=0, jobs=1): err= 0: pid=3244 read : io=16,348KiB, bw=1,014KiB/s, iops=247, runt= 16501msec slat (usec): min=7, max=42,515, avg=47.01, stdev=665.19 clat (usec): min=1, max=137K, avg=14215.23, stdev=20611.53 bw (KiB/s) : min= 0, max= 1957, per=2.37%, avg=794.90, stdev=495.94 write: io=16,420KiB, bw=1,018KiB/s, iops=248, runt= 16501msec slat (usec): min=9, max=42,510, avg=38.73, stdev=663.37 clat (usec): min=202, max=229K, avg=49803.02, stdev=34393.32 bw (KiB/s) : min= 0, max= 1840, per=10.89%, avg=865.54, stdev=411.66 cpu : usr=0.53%, sys=1.39%, ctx=12089, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=22.8%, 16=76.8%, 32=0.0%, >=64=0.0% issued r/w: total=4087/4105, short=0/0 lat (usec): 2=0.02%, 4=0.04%, 20=0.01%, 50=0.06%, 100=1.44% lat (usec): 250=8.81%, 500=4.24%, 750=2.56%, 1000=1.17% lat (msec): 2=2.36%, 4=2.62%, 10=9.47%, 20=13.57%, 50=29.82% lat (msec): 100=19.07%, 250=4.72% Run status group 0 (all jobs): READ: io=528MiB, aggrb=33,550KiB/s, minb=1,014KiB/s, maxb=589MiB/s, mint=445msec, maxt=16501msec WRITE: io=272MiB, aggrb=7,948KiB/s, minb=1,018KiB/s, maxb=7,480KiB/s, mint=16501msec, maxt=35886msec Disk stats (read/write): dm-6: ios=4087/69722, merge=0/0, ticks=58049/1345695, in_queue=1403777, util=99.74%As one would expect, the bandwidth the array achieved in the query and writer processes was vastly different. Queries are performed at about 500Mbps while writing comes in at 1Mbps or 7.5Mbps depending on whether it is read/write or purely write performance respectively. The IO depths show the number of pending IO requests that are queued when an IO request is issued. For example, for the bgupdater process, nearly 1/4 of the async IO requests are being fulfilled with eight or less requests in the queue of a potential 16. In contrast, the bgwriter has more than half of its requests performed with 16 or less pending requests in the queue.
To contrast with the three-disk RAID-5 configuration, I reran the four-threads-randio.fio test on a single Western Digital 750GB drive. The bgupdater process achieved less than half the bandwidth and each of the query processes ran at 1/3 the overall bandwidth. For this test the Western Digital drive was on a different computer with different CPU and RAM specifications as well, so any comparison should be taken with a grain of salt.
bgwriter: (groupid=0, jobs=1): err= 0: pid=14963 write: io=256MiB, bw=6,545KiB/s, iops=1,597, runt= 41013msec queryA: (groupid=0, jobs=1): err= 0: pid=14964 read : io=256MiB, bw=160MiB/s, iops=39,888, runt= 1643msec queryB: (groupid=0, jobs=1): err= 0: pid=14965 read : io=256MiB, bw=163MiB/s, iops=40,680, runt= 1611msec bgupdater: (groupid=0, jobs=1): err= 0: pid=14966 read : io=16,416KiB, bw=422KiB/s, iops=103, runt= 39788msec write: io=16,352KiB, bw=420KiB/s, iops=102, runt= 39788msec READ: io=528MiB, aggrb=13,915KiB/s, minb=422KiB/s, maxb=163MiB/s, mint=1611msec, maxt=39788msec WRITE: io=272MiB, aggrb=6,953KiB/s, minb=420KiB/s, maxb=6,545KiB/s, mint=39788msec, maxt=41013msecThe vast array of ways that fio can issue its IO requests lends it to benchmarking IO patterns and the use of various APIs to perform that IO. You can also run identical fio configurations on different filesystems or underlying hardware to see what difference changes at that level will make to performance.
Benchmarking different IO request systems for a particular IO pattern can be handy if you are about to write an IO-intensive application but are not sure which API and design will work best on your hardware. For example, you could keep the disk system and RAM fixed and see how well an IO load would be serviced using memory-mapped IO or the Linux asyncio interface. Of course this requires you to have a very intricate knowledge of the typical IO requests that your application will issue. If you already have a tool that uses something like memory-mapped files, then you can get IO patterns for typical use from the existing tool, feed them into fio using different IO engines, and get a reasonable picture of whether it might be worth porting the application to a different IO API for better performance.
Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.
Field Value sar Option %busy (% time disk is busy) >85 sar -d %rcache (reads in buffer cache) low, <85 sar -b %wcache (writes in buffer cache) low, <60% sar -b %wio (idle CPU waiting for disk I/O) dev. system >30 sar -u fileserver >80
bswot/s (ransfers from memory to disk swap area) >200 sar -w bswin/s (transfers to memory) >200 sar -w %swpocc (time swap queue is occupied) >10 sar -q rflt/s (page reference fault) >0 sar -t freemem (average pages for user processes) <100 sar -r Indications of a CPU bound systems %idle (% of time CPU has no work to do) <5 sar -u runq-sz (processes in memory waiting for CPU) >2 sar -q %runocc (% run queue occupied and processes not executing) >90 sar -q
hypermail /usr/local/src/src/hypermail - mailing list to web page converter; grep hypermail /etc/aliases shows which lists use hypermail
pwck, grpck should be run weekly to make sure ok; grpck produces a ton of errors
can use local man pages - text only - see Ch3 User Services
put in /usr/local/manl (try /usr/man/local/manl) suffix .l
long ones pack -> pack program.1;mv program.1.z /usr/man/local/mannl/program.z
-by Jon Hill and Kemer Thomson
This article presents the rationale for formal system performance management from a management, systems administrative and vendor perspective. It describes four classes of systems monitoring tools and their uses. The article discusses the issues of tool integration, "best-of-breed versus integrated suite" and the decision to "buy versus build."
Observability (December 1999)
-by Adrian Cockcroft
Discusses Capacity Planning and Performance Management techniques.
Google matched content |
Other Cockcroft columns at www.sun.com
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haters Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright ฉ 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: March 29, 2020