May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

High Performance Computing (HPC)


High Performance Components


Recommended Links

Small HPC cluster architecture

HPC hardware vendors Bright Cluster Manager
Grid Engine qalter -- Change Job Priority SGE Queues qstat qsub qconf SGE Parallel Environment
Slot limits and restricting number of slots per server Managing User Access SGE cheat sheet SGE Consumable Resources Installation of Grid Engine Master Host Job or Queue Reported in Error State E SGE hostgroups
SGE Execution Host Installation Installing Mellanox InfiniBand Driver on RHEL 6.5 Infiniband MLNX_OFED Message Passing Interface perf stat  
Tools C3 Tools PDSH -- a parallel remote shell rdist rsync   Parallel command execution
Clustered Parallel File Systems GPFS on Red Hat Lustre Optimizing usage of NFS in Grid Engine NFS performance tuning Unix System Monitoring DDR3-1866 Memory Performance
PowerEdge C6220 II Rack Server  Linux Troubleshooting Linux Performance Tuning Suse performance tuning Trunking / Bonding Multiple Network Interfaces Bonding Ethernet Interfaces in Red Hat Linux  
Intel Composer XE Lmod – Alternative Environment Modules VASP Performance optimization Accelrys install      
uptime command mostat top ps sar ptree  
vmstat iostat nfsstat HPC Humor Admin Horror Stories Humor Etc

Partially based on

Most HPC systems use the concept of parallelism. HPC hardware falls into three categories:

The term "cluster" can take different meanings in different contexts. We discuss here High-performance clusters. These clusters are used to run parallel programs for time-intensive computations and are mostly used by the scientific community. They commonly run simulations and other CPU-intensive programs. Two other types of clusters exist:

Grid computing is a broad term that typically refers to a set of (not necessary uniform like in cluster) servers connected to common scheduler, for example Sun Grid Engine.  From this point of view  HPC is a special case of grid computing in which the nodes are uniform and tightly coupled.

Some features of HPC are as follows:

Parallel programming and Amdahl's Law

Parallel programming mostly is limited to a certain subclass of computational algorithms (and recently to search and genomic decoding algorithms). Many algorithms are no parallyzable.

Parallel programming (like all programming) is as much art as science, always leaving room for major design improvements and performance enhancements. Software and hardware go hand in hand when it comes to achieving high performance on a cluster. Programs must be written to explicitly take advantage of the underlying hardware, and existing non-parallel programs must be re-written if they are to perform well on a cluster.

A parallel program does many things at once. Just how many depends on the problem at hand. Suppose 1/N of the total time taken by a program is in a part that can not be parallelized, and the rest (1-1/N) is in the parallelizable part (see Figure 3).

Figure 3. Illustrating Amdahl's Law
Illustrating Amdahl's Law

In theory you could apply an infinite amount of hardware to do the parallel part in zero time, but the sequential part will see no improvements. As a result, the best you can achieve is to execute the program in 1/N of the original time, but no faster. In parallel programming, this fact is commonly referred to as Amdahl's Law.

Amdahl's Law governs the speedup of using parallel processors on a problem versus using only one serial processor. Speedup is defined as the time it takes a program to execute in serial (with one processor) divided by the time it takes to execute in parallel (with many processors):

S = ------

Where T(j) is the time it takes to execute the program when using j processors.

The real hard work in writing a parallel program is to make N as large as possible. But there is an interesting twist to it. You normally attempt bigger problems on more powerful computers, and usually the proportion of the time spent on the sequential parts of the code decreases with increasing problem size (as you tend to modify the program and increase the parallelizable portion to optimize the available resources). Therefore, the value of N automatically becomes large. (See the re-evaluation of Amdhal's Law in the Resources section later in this article.)

Approaches to parallel programming

Creating a parallel program is a huge challenge and typically scalability for multiple cores is limited. Distributing computation between nodes via MPI is another huge challenge and return on investment after certain amount of nodes can be negative.

Most researchers I worked with are extremly naive and generally subscribe to the simple religion "the more cores the better".

There are two major parallel programming approaches:

Distributed memory approach

It is useful to think a master-slave model here:

Obvious practical problems in this approach stem from the distributed-memory organization. Because each node has access to only its own memory, data structures must be duplicated and sent over the network if other nodes want to access them, leading to network overhead.

Shared memory approach

In the shared-memory approach, memory is common to all processors (such as SMP). This approach does not suffer from the problems mentioned in the distributed-memory approach. Also, programming for such systems is easier since all the data is available to all processors and is not much different from sequential programming. The big issue with these systems is scalability: it is not easy to add extra processors.

When file I/O becomes a bottleneck

Some applications frequently need to read and write large amounts of data to the disk, which is often the slowest step in a computation. This is for example can be the case in genome decoding. Faster, SSD drives help, but there are times when they are not enough.

The problem becomes especially pronounced if a physical disk partition is shared between all nodes (using NFS or GPFS, which for simplisity can be viewed as NSF with multiple masters).

Parallel filesystems such as GPFS can slightly improve the sbottlebeck between the NFS server and the switch to which computational nodes are connected (which typically in limited to 40 Mbit or 100 Mbit). 

Parallel filesystems spread the data in a file over several disks attached to multiple specialized (or non-specialized) nodes of the cluster, known as I/O nodes. When a program tries to read a file, small portions of that file are read from several disks in parallel. This reduces the load on any given disk controller and allows it to handle more requests. (PVFS is a good example of an open source parallel filesystem; disk performance of better than 1 GBsec has been achieved on Linux clusters using standard IDE hard disks.)

Open source cluster application resources

Clearly, it will be hard to maintain the cluster above. It is not convenient to copy files to every node, set up SSH and MPI on every node that gets added, make appropriate changes when a node is removed, and so on.

Fortunately, there are some intergrated solution such as Rocks, which provides most of the things we need for the cluster and automate some of the typical tasks.  When it comes to managing a cluster in a production environment with a large user base, job scheduling and monitoring are crucial. 

They include

  1. Scheduler,
  2. Monitoring subsystem
  3. Performance measuring tools
  4. Imaging subsystem


Sun grid engine (now Oracle grid engine) can be used a powerful scheduler for computational clusters. Another  popular scheduling system is OpenPBS, and Torque. Using it you can create queues and submit jobs on them.

You can also create sophisticated job-scheduling policies.

All "grid" schedulers let you view executing jobs, submit jobs, and cancel jobs. It also allows control over the maximum amount of CPU time available to a particular job, which is quite useful for an administrator.

Monitoring subsystem

An important aspect of managing clusters is monitoring, especially if your cluster has a large number of nodes. Several options are available, such as Nagios or Ganglia

Ganglia has a Web-based front end and provides real-time monitoring for CPU and memory usage; you can easily extend it to monitor just about anything. For example, with simple scripts you can make Ganglia report on CPU temperatures, fan speeds, etc.

Measuring performance: Megaflops as false/fake measure of performance fo HPC clusters

Clusters are built to perform, and you need to know how fast they are. Standard benchmark based on  LINPACK calculations is completely bizarre and false measure in many respects but  one: applications with unlimited parallelism.

The real applications usually do not  scale above, say 128 cores well and  many doe not scale above 32 cores. So megaflop value used fro large cluster and in  top500 index is just "art for the  sake of art" and has very little connection to reality.  It is a fake measure that suggests that more nodes and more core are better. That's completly not  true. If your application does not scale above say 16 node that only thing you can do is to run many instances of such  application in paraallel. You will never achieve higher productvity from a cluater with higher  megaflop rating and most probably you will get worse not batter duration of computations.

IBM is among one firms that  propagate this nonsense:

Cluster is generally is as good as each single computational node and megoflops tent  to push designed  toward larger anout of cores per CPU, which is completely wrong approach. It's common to think that the processor frequency determines performance. While this is true to a certain extent, it is of little value in comparing processors from different vendors or even different processor families from the same vendor because different processors do different amounts of work in a given number of clock cycles. This was especially obvious when we compared vector processors with scalar processors (see Part 1).  Speed of memory  also matter.

A more natural way to compare performance is to run some standard tests. Over the years a test known as the LINPACK benchmark has become a gold standard when comparing performance of HPc clusters. It was written by Jack Dongarra more than a decade ago and is still used by (see Resources for a link).  This test is fake in  a sense  that it assumes unlimited parallelism.

This test involves solving a dense system of N linear equations, where the number of floating-point operations is known (of the order of N^3). This test is well suited to speed test computers meant to run scientific applications and simulations because they tend to solve linear equations at some stage or another.

The standard unit of measurement is the number of floating-point operations or flops per second (in this case, a flop is either an addition or a multiplication of a 64-bit number). The test measures the following:

To appreciate these numbers, consider that IBM BlueGene/L can compute in one second that task that on your home computer may take up to five days.

This phaze "IBM BlueGene/L can compute in one second that task that on your home computer may take up to five days" is complete nonsense. Much depends  on particular application.

Top Visited
Past week
Past month


Old News ;-)

[Dec 25, 2017] Huawei Showcases HPC Solutions at SC16

YouTube video.
Nov 29, 2016 |

In this video from SC16, Francis Lam from Huawei describes the company's broad range of HPC and liquid cooling solutions.

"Huawei has increasingly become more prominent in the HPC market. It has successfully deployed HPC clusters for a large number of global vehicle producers, large-scale supercomputing centers, and research institutions. These show that Huawei's HPC platforms are optimized for industry applications which can help customers significantly simplify service processes and improve work efficiency, enabling them to focus on product development and research."

At SC16, Huawei also showcased high-density servers FusionServer E9000 and X6800 with 100G high-speed interconnect technology and powerful cluster scalability. These servers can dramatically accelerate industrial CAE simulation. Huawei was also displaying its KunLun supercomputer, which features the company's proprietary Node Controller interconnect chips. KunLun supports up to 32 CPUs and is perfect for cutting-edge research scenarios that require frequent in-memory computing. Additionally Huawei's high-performance storage solutions, OceanStor 9000 and ES3000 SSD, which can effectively improve data transmission bandwidth and latency, are on display.

Huawei and ANSYS announced the two parties' partnership in building industrial computer aided engineering (CAE) solutions, and jointly released their Fluent Performance white paper, which details optimal system configurations in fluid simulation scenarios for industry customers.

Learn more:

[Oct 24, 2017] LAMMPS -- a classical molecular dynamics software

Oct 24, 2017 |

LAMMPS ( ) is a classical molecular dynamics code that
models an ensemble of particles in a liquid, solid,

or gaseous state. It can model atomic, polymeric,
biological, metallic, granular, and coarse-grained
systems using a variety of force fields and bound-
ary conditions. LAMMPS runs efficiently on
single-processor desktop or laptop machines,
but is designed for parallel computers. It will run
on any parallel machine that compiles C++ and
supports the MPI message-passing library. This
includes distributed- or shared-memory parallel
machines and Beowulf-style clusters. LAMMPS
can model systems with only a few particles up
to millions or billions.
The current version of LAMMPS is written in
C++. In the most general sense, LAMMPS inte-
grates Newton's equations of motion for collec-
tions of atoms, molecules, or macroscopic particles
that interact via short- or long-range forces with a
variety of initial and/or boundary conditions. For
computational efficiency LAMMPS uses neighbor
lists to keep track of nearby particles. The lists
are optimized for systems with particles that are
repulsive at short distances, so that the local density
of particles never becomes too large. On parallel
machines, LAMMPS uses spatial-decomposition
techniques to partition the simulation domain into small 3D sub-domains, one of which is assigned
to each processor. Processors communicate and
store "ghost" atom information for atoms that
border their sub-domain. The simulation used
in this study is a strong scaling analysis with the
RhodoSpin benchmark. The run time to compute
the dynamics of the atomic fluid with 32,000
atoms for 100 time steps is measured. The execu-
tion time is shown in Figure 10. This LAMMPS
benchmark is not memory intensive and does not
show significant difference in performance when
memory and processor affinity arc forced. Red
Storm scales well even beyond 64 tasks although
the balance of computation to communication
is steadily decreased for this strong scaling test.
Instrumentation data is being collected using
performance tools to understand why TLCC does
not scale beyond 64 MPI tasks.

[Oct 24, 2017] The combine law of Parkinson-Murphy

"The increase of capacity and quantity of resources of any system does not affect the efficiency of its operation, since all new resources and even some of the old ones would be wasted on eliminations of internal problems (errors) that arise as a result of the very increase in resources.". One only has to look at the space science sphere right now.

[Oct 23, 2017] Optimizing HPC Applications with Intel Cluster Tools

Oct 23, 2017 |

Table of Contents

Chapter 1: No Time to Read This Book?

Chapter 2: Overview of Platform Architectures

Chapter 3: Top-Down Software Optimization

Chapter 4: Addressing System Bottlenecks

Chapter 5: Addressing Application Bottlenecks: Distributed Memory

Chapter 6: Addressing Application Bottlenecks: Shared Memory

Chapter 7: Addressing Application Bottlenecks: Microarchitecture

Chapter 8: Application Design Considerations

[Oct 17, 2017] Perf- A Performance Monitoring and Analysis Tool for Linux

Oct 17, 2017 |

In a day of fierceless competition between companies, it is important that we learn how to use what we have at the best of its capacity. The waste of hardware or software resources, or the lack of ability to know how to use them more efficiently, ends up being a loss that we just can't afford if we want to be at the top of our game.

At the same time, we must be careful to not take our resources to a limit where sustained use will yield irreparable damage.

In this article we will introduce you to a relatively new performance analysis tool and provide tips that you can use to monitor your Linux systems, including hardware and applications. This will help you to ensure that they operate so that you are capable to produce the desired results without wasting resources or your own energy.

Introducing and installing Perf in Linux

Among others, Linux provides a performance monitoring and analysis tool called conveniently perf . So what distinguishes perf from other well-known tools with which you are already familiar?

The answer is that perf provides access to the Performance Monitoring Unit in the CPU, and thus allows us to have a close look at the behavior of the hardware and its associated events.

In addition, it can also monitor software events, and create reports out of the data that is collected.

You can install perf in RPM-based distributions with:

# yum update && yum install perf     [CentOS / RHEL / Fedora]
# dnf update && dnf install perf     [Fedora 23+ releases]

In Debian and derivatives:

# sudo aptitude update && sudo aptitude install linux-tools-$(uname -r) linux-tools-generic

If uname -r in the command above returns extra strings besides the actual version ( 3.2.0-23-generic in my case), you may have to type linux-tools-3.2.0-23 instead of using the output of uname

It is also important to note that perf yields incomplete results when run in a guest on top of VirtualBox or VMWare as they do not allow access to hardware counters as other virtualization technologies (such as KVM or XEN ) do.

Additionally, keep in mind that some perf commands may be restricted to root by default, which can be disabled (until the system is rebooted) by doing:

# echo 0 > /proc/sys/kernel/perf_event_paranoid

If you need to disable paranoid mode permanently, update the following setting in /etc/sysctl.conf file.

kernel.perf_event_paranoid = 0

Once you have installed perf , you can refer to its man page for a list of available subcommands (you can think of subcommands as special options that open a specific window into the system). For best and more complete results, use perf either as root or through sudo

Perf list

perf list (without options) returns all the symbolic event types (long list). If you want to view the list of events available in a specific category, use perf list followed by the category name ([ hw|sw|cache|tracepoint|pmu|event_glob ]), such as:

Display list of software pre-defined events in Linux:

# perf list sw
List Software Pre-defined Events in Linux

List Software Pre-defined Events in Linux Perf stat

perf stat runs a command and collects Linux performance statistics during the execution of such command. What happens in our system when we run dd

# perf stat dd if=/dev/zero of=test.iso bs=10M count=1
Collects Performance Statistics of Linux Command

Collects Performance Statistics of Linux Command

The stats shown above indicate, among other things:

  1. The execution of the dd command took 21.812281 milliseconds of CPU. If we divide this number by the "seconds time elapsed" value below ( 23.914596 milliseconds), it yields 0.912 (CPU utilized).
  2. While the command was executed, 15 context-switches (also known as process switches) indicate that the CPUs were switched 15 times from one process (or thread) to another.
  3. CPU migrations is the expected result when in a 2-core CPU the workload is distributed evenly between the number of cores.
    During that time ( 21.812281 milliseconds), the total number of CPU cycles that were consumed was 62,025,623 , which divided by 0.021812281 seconds gives 2.843 GHz.
  4. If we divide the number of cycles by the total instructions count we get 4.9 Cycles Per Instruction, which means each instruction took almost 5 CPU cycles to complete (on average). We can blame this (at least in part) on the number of branches and branch-misses (see below), which end up wasting or misusing CPU cycles.
  5. As the command was executed, a total of 3,552,630 branches were encountered. This is the CPU-level representation of decision points and loops in the code. The more branches, the lower the performance. To compensate for this, all modern CPUs attempt to predict the flow the code will take. 51,348 branch-misses indicate the prediction feature was wrong 1.45% of the time.

The same principle applies to gathering stats (or in other words, profiling) while an application is running. Simply launch the desired application and after a reasonable period of time (which is up to you) close it, and perf will display the stats in the screen. By analyzing those stats you can identify potential problems.

Perf top

perf top is similar to top command , in that it displays an almost real-time system profile (also known as live analysis).

With the -a option you will display all of the known event types, whereas the -e option will allow you to choose a specific event category (as returned by perf list ):

Will display all cycles event.

perf top -a

Will display all cpu-clock related events.

perf top -e cpu-clock
Live Analysis of Linux Performance

Live Analysis of Linux Performance

The first column in the output above represents the percentage of samples taken since the beginning of the run, grouped by function Symbol and Shared Object. More options are available in man perf-top

Perf record

perf record runs a command and saves the statistical data into a file named inside the current working directory. It runs similarly to perf stat

Type perf record followed by a command:

# perf record dd if=/dev/null of=test.iso bs=10M count=1
Record Command Statistical Data

Record Command Statistical Data Perf report

perf report formats the data collected in above into a performance report:

# sudo perf report
Perf Linux Performance Report

Perf Linux Performance Report

All of the above subcommands have a dedicated man page that can be invoked as:

# man perf-subcommand

where subcommand is either list , stat top record , or report . These are the most frequently used subcommands; others are listed in the documentation (refer to the Summary section for the link).


In this guide we have introduced you to perf , a performance monitoring and analysis tool for Linux. We highly encourage you to become familiar with its documentation which is maintained in .

If you find applications that are consuming a high percentage of resources, you may consider modifying the source code, or use other alternatives.

If you have questions about this article or suggestions to improve, we are all ears. Feel free to reach us using the comment form below.

[Oct 17, 2017] perf-stat(1) - Linux man page

Oct 17, 2017 |


perf-stat - Run a command and gather performance counter statistics

perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
perf stat [-e <EVENT> | --event=EVENT] [-a] - <command> [<options>]

This command runs a command and gathers performance counter statistics from it.



Any command you can specify in a shell.
-e, --event=
Select the PMU event. Selection can be a symbolic event name (use perf list to list all events) or a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor.
-i, --no-inherit
child tasks do not inherit counters
-p, --pid=<pid>
stat events on existing process id (comma separated list)
-t, --tid=<tid>
stat events on existing thread id (comma separated list)
-a, --all-cpus
system-wide collection from all CPUs
-c, --scale
scale/normalize counter values
-r, --repeat=<n>
repeat command and print average + stddev (max: 100)
-B, --big-num
print large numbers with thousands' separators according to locale
-C, --cpu=
Count only on the list of CPUs provided. Multiple CPUs can be provided as a comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2. In per-thread mode, this option is ignored. The -a option is still necessary to activate system-wide monitoring. Default is to count on all CPUs.
-A, --no-aggr
Do not aggregate counts across all monitored CPUs in system-wide mode (-a). This option is only valid in system-wide mode.
-n, --null
null run - don't start any counters
-v, --verbose
be more verbose (show counter open errors, etc)
-x SEP, --field-separator SEP
print counts using a CSV-style output to make it easy to import directly into spreadsheets. Columns are separated by the string specified in SEP.
-G name, --cgroup name
monitor only in the container (cgroup) called "name". This option is available only in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to container "name" are monitored when they run on the monitored CPUs. Multiple cgroups can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup to first event, second cgroup to second event and so on. It is possible to provide an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have corresponding events, i.e., they always refer to events defined earlier on the command line.
-o file, --output file
Print the output into the designated file.
Append to the output file designated with the -o option. Ignored if -o is not specified.
Log output to fd, instead of stderr. Complementary to --output, and mutually exclusive with it. --append may be used here. Examples: 3>results perf stat --log-fd 3 - $cmd 3>>results perf stat --log-fd 3 --append - $cmd

$ perf stat - make -j

Performance counter stats for 'make -j':
8117.370256  task clock ticks     #      11.281 CPU utilization factor
        678  context switches     #       0.000 M/sec
        133  CPU migrations       #       0.000 M/sec
     235724  pagefaults           #       0.029 M/sec
24821162526  CPU cycles           #    3057.784 M/sec
18687303457  instructions         #    2302.138 M/sec
  172158895  cache references     #      21.209 M/sec
   27075259  cache misses         #       3.335 M/sec
Wall-clock time elapsed:   719.554352 msecs
See Also

perf-top (1), perf-list (1)

Referenced By perf (1), perf-record (1), perf-report (1) perf-record(1) - Linux man page Name

perf-record - Run a command and record its profile into

perf record [-e <EVENT> | --event=EVENT] [-l] [-a] <command>
perf record [-e <EVENT> | --event=EVENT] [-l] [-a] - <command> [<options>]

This command runs a command and gathers a performance counter profile from it, into - without displaying anything.

This file can then be inspected later on, using perf report .



Any command you can specify in a shell.
-e, --event=
Select the PMU event. Selection can be a symbolic event name (use perf list to list all events) or a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor.
Event filter.
-a, --all-cpus
System-wide collection from all CPUs.
Scale counter values.
-p, --pid=
Record events on existing process ID (comma separated list).
-t, --tid=
Record events on existing thread ID (comma separated list).
-u, --uid=
Record events in threads owned by uid. Name or number.
-r, --realtime=
Collect data with this RT SCHED_FIFO priority.
-D, --no-delay
Collect data without buffering.
-A, --append
Append to the output file to do incremental profiling.
-f, --force
Overwrite existing data file. (deprecated)
-c, --count=
Event period to sample.
-o, --output=
Output file name.
-i, --no-inherit
Child tasks do not inherit counters.
-F, --freq=
Profile at this frequency.
-m, --mmap-pages=
Number of mmap data pages. Must be a power of two.
-g, --call-graph
Do call-graph (stack chain/backtrace) recording.
-q, --quiet
Don't print any message, useful for scripting.
-v, --verbose
Be more verbose (show counter open errors, etc).
-s, --stat
Per thread counts.
-d, --data
Sample addresses.
-T, --timestamp
Sample timestamps. Use it with perf report -D to see the timestamps, for instance.
-n, --no-samples
Don't sample.
-R, --raw-samples
Collect raw sample records from all opened counters (default for tracepoint counters).
-C, --cpu
Collect samples only on the list of CPUs provided. Multiple CPUs can be provided as a comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2. In per-thread mode with inheritance mode on (default), samples are captured only when the thread executes on the designated CPUs. Default is to monitor all CPUs.
-N, --no-buildid-cache
Do not update the builid cache. This saves some overhead in situations where the information in the file (which includes buildids) is sufficient.
-G name,..., --cgroup name,...
monitor only in the container (cgroup) called "name". This option is available only in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to container "name" are monitored when they run on the monitored CPUs. Multiple cgroups can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup to first event, second cgroup to second event and so on. It is possible to provide an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have corresponding events, i.e., they always refer to events defined earlier on the command line.
-b, --branch-any
Enable taken branch stack sampling. Any type of taken branch may be sampled. This is a shortcut for --branch-filter any. See --branch-filter for more infos.
-j, --branch-filter
Enable taken branch stack sampling. Each sample captures a series of consecutive taken branches. The number of branches captured with each sample depends on the underlying hardware, the type of branches of interest, and the executed code. It is possible to select the types of branches captured by enabling filters. The following filters are defined:
Б─╒ any: any type of branches
Б─╒ any_call: any function call or system call
Б─╒ any_ret: any function return or system call return
Б─╒ ind_call: any indirect branch
Б─╒ u: only when the branch target is at the user level
Б─╒ k: only when the branch target is in the kernel
Б─╒ hv: only when the target is at the hypervisor level
The option requires at least one branch type among any, any_call, any_ret, ind_call. The privilege levels may be ommitted, in which case, the privilege levels of the associated event are applied to the branch filter. Both kernel (k) and hypervisor (hv) privilege levels are subject to permissions. When sampling on multiple events, branch stack sampling is enabled for all the sampling events. The sampled branch type is the same for all events. The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k Note that this feature may not be available on all processors.
See Also

perf-stat (1), perf-list (1)

Referenced By perf (1), perf-annotate (1), perf-archive (1), perf-buildid-cache (1), perf-buildid-list (1), perf-diff (1), perf-evlist (1), perf-inject (1), perf-kmem (1), perf-kvm (1), perf-probe (1), perf-sched (1), perf-script (1), perf-timechart (1)

[Oct 15, 2017] cp2k download

Oct 15, 2017 |

[Oct 14, 2017] Performance analysis in Linux

Notable quotes:
"... Based on the example from here . ..."
Oct 14, 2017 |

Posted on 21/03/2017 by Gabriel Krisman Bertazi

Dynamic profilers are tools to collect data statistics about applications while they are running, with minimal intrusion on the application being observed.

The kind of data that can be collected by profilers varies deeply, depending on the requirements of the user. For instance, one may be interested in the amount of memory used by a specific application, or maybe the number of cycles the program executed, or even how long the CPU was stuck waiting for data to be fetched from the disks. All this information is valuable when tracking performance issues, allowing the programmer to identify bottlenecks in the code, or even to learn how to tune an application to a specific environment or workload.

In fact, maximizing performance or even understanding what is slowing down your application is a real challenge on modern computer systems. A modern CPU carries so many hardware techniques to optimize performance for the most common usage case, that if an application doesn't intentionally exploit them, or worse, if it accidentally lies in the special uncommon case, it may end up experiencing terrible results without doing anything apparently wrong.

Let's take a quite non-obvious way of how things can go wrong, as an example.

Forcing branch mispredictions

Based on the example from here .

The code below is a good example of how non-obvious performance assessment can be. In this function, the first for loop initializes a vector of size n with random values ranging from 0 to N. We can assume the values are well distributed enough for the vector elements to be completely unsorted.

The second part of the code has a for loop nested inside another one. The outer loop, going from 0 to K, is actually a measurement trick. By executing the inner loop many times, it stresses out the performance issues in that part of the code. In this case, it helps to reduce any external factor that may affect our measurement.

The inner loop is where things get interesting. This loop crawls over the vector and decides whether the value should be accumulated in another variable, depending on whether the element is higher than N/2 or not. This is done using an if clause, which gets compiled into a conditional branch instruction, which modifies the execution flow depending on the calculated value of the condition, in this case, if vec[i] >= N/2, it will enter the if leg, otherwise it will skip it entirely.

long rand_partsum(int n)
  int i,k;
  long sum = 0;
  int *vec = malloc(n * sizeof(int));

  for (i = 0; i < n; i++)
    vec[i] = rand()%n;

  for (k = 0; k < 1000000; k++)
    for (i = 0; i < n; i++)
      if (vec[i] > n/2)
        sum += vec[i];

  return sum;

When executing the code above on an Intel Core i7-5500U, with a vector size of 5000 elements (N=5000), it takes an average of 29.97 seconds. Can we do any better?

One may notice that this vector is unsorted, since each element comes from a call to rand(). What if we sorted the vector before executing the second for loop? For the sake of the example, let's say we add a call to the glibc implementation of QuickSort right after the initialization loop.

A naive guess would suggest that the algorithm got worse, because we just added a new sorting step, thus raising the complexity of the entire code. One should assume this would result on a higher execution time.

But, in fact, when executing the sorted version in the same machine, the average execution time drops to 13.20 seconds, which is a reduction of 56% in execution time. Why does adding a new step actually reduces the execution time? The fact is that pre-sorting the vector in this case, allows the cpu to do a much better job at internally optimizing the code during execution. In this case, the issue observed was a high number of branch mispredictions, which were triggered by the conditional branch that implements the if clause.

Modern CPUs have quite deep pipelines, meaning that the instruction being fetched on any given cycle is always a few instructions down the road than the instruction actually executed on that cycle. When there is a conditional branch along the way, there are two possible paths that can be followed, and the prefetch unit has no idea which one it should choose, until all the actual condition for that instruction is calculated.

The obvious choice for the Prefetch unit on such cases is to stall and wait until the execution unit decides the correct path to follow, but stalling the pipeline like this is very costly. Instead, a speculative approach can be taken by a unit called Branch Predictor, which tries to guess which path should be taken. After the condition is calculated, the CPU verifies the guessed path: if it got the prediction right, in other words, if a branch prediction hit occurs, the execution just continues without much performance impact, but if it got it wrong, the processor needs to flush the entire pipeline, go back, and restart executing the correct path. The later is called a branch prediction miss, and is also a costly operation.

In systems with a branch predictor, like any modern CPU, the predictor is usually based on the history of the particular branches. If a conditional branch usually goes a specific way, the next time it appears, the predictor will assume it will take the same route.

Back to our example code, that if condition inside the for loop does not have any specific pattern. Since the vector elements are completely random, sometimes it will enter the if leg, sometimes it will skip it entirely. That is a very hard situation for the branch predictor, who keeps guessing wrong and triggering flushes in the pipeline, which keeps delaying the application.

In the sorted version, instead, it is very easy to guess whether it should enter the if leg or not. For the first part of the vector, where the elements are mostly < N/2, the if leg will always be skipped, while for the second part, it will always enter the if leg. The branch predictor is capable of learning this pattern after a few iterations, and is able to do much better guesses about the flow, reducing the number of branch misses, thus increasing the overall performance.

Well, pointing specific issues like this is usually hard, even for a simple code like the example above. How could we be sure that the the program is hitting enough branch mispredictions to affect performance? In fact, there are always many things that could be the cause of slowness, even for a slightly more complex program.

Perf_events is an interface in the Linux kernel and a userspace tool to sample hardware and software performance counters. It allows, among many other things, to query the CPU register for the statistics of the branch predictor, i.e. the number of prediction hits and misses of a given application.

The userspace tool, known as the perf command, is available in the usual channels of common distros. In Debian, for instance, you can install it with:

sudo apt install linux-perf

We'll dig deeper into the perf tool later on another post, but for now, let's use the, perf record and perf annotate commands, which allow tracing the program and annotating the source code with the time spent on each instruction, and the perf stat command, which allows to run a program and display statistics about it:

At first, we can instruct perf to instrument the program and trace its execution:

[krisman@dilma bm]$ perf record ./branch-miss.unsorted
[ perf record: Woken up 19 times to write data ]
[ perf record: Captured and wrote 4.649 MB (121346 samples) ]

The perf record will execute the program passed as parameter and collect performance information into a new file. This file can then be passed to other perf commands. In this case, we pass it to the perf annotate command, which crawls over each address in the program and prints the number of samples that was collected while the program was executing each instruction. Instructions with a higher number of samples indicates that the program spent more time in that region, indicating that it is hot code, and a good part of the program to try to optimize. Notice that, for modern processors, the exact position is an estimation, so this information must be used with care. As a rule of thumb, one should be looking for hot regions, instead of single hot instructions.

Below is the output of perf annotate, when analyzing the function above. The output is truncated to display only the interesting parts.

[krisman@dilma bm]$ perf annotate

        :      int rand_partsum()
        :      {
   0.00 :        74e:   push   %rbp
   0.00 :        74f:   mov    %rsp,%rbp
   0.00 :        752:   push   %rbx
   0.00 :        753:   sub    $0x38,%rsp
   0.00 :        757:   mov    %rsp,%rax
   0.00 :        75a:   mov    %rax,%rbx


   0.00 :        7ce:   mov    $0x0,%edi
   0.00 :        7d3:   callq  5d0 <time@plt>
   0.00 :        7d8:   mov    %eax,%edi
   0.00 :        7da:   callq  5c0 <srand@plt>
        :              for (i = 0; i < n; i++)
   0.00 :        7df:   movl   $0x0,-0x14(%rbp)
   0.00 :        7e6:   jmp    804 <main+0xb6>
        :                      vec[i] = rand()%n;
   0.00 :        7e8:   callq  5e0 <rand@plt>
   0.00 :        7ed:   cltd   
   0.00 :        7ee:   idivl  -0x24(%rbp)
   0.00 :        7f1:   mov    %edx,%ecx
   0.00 :        7f3:   mov    -0x38(%rbp),%rax
   0.00 :        7f7:   mov    -0x14(%rbp),%edx
   0.00 :        7fa:   movslq %edx,%rdx
   0.00 :        7fd:   mov    %ecx,(%rax,%rdx,4)
        :              for (i = 0; i < n; i++)
   0.00 :        800:   addl   $0x1,-0x14(%rbp)
   0.00 :        804:   mov    -0x14(%rbp),%eax
   0.00 :        807:   cmp    -0x24(%rbp),%eax
   0.00 :        80a:   jl     7e8 <main+0x9a>


         :              for (k = 0; k < 1000000; k++)
    0.00 :        80c:   movl   $0x0,-0x18(%rbp)
    0.00 :        813:   jmp    85e <main+0x110>
         :                      for (i = 0; i < n; i++)
    0.01 :        815:   movl   $0x0,-0x14(%rbp)
    0.00 :        81c:   jmp    852 <main+0x104>
         :                              if (vec[i] > n/2)
    0.20 :        81e:   mov    -0x38(%rbp),%rax
    6.47 :        822:   mov    -0x14(%rbp),%edx
    1.94 :        825:   movslq %edx,%rdx
   26.86 :        828:   mov    (%rax,%rdx,4),%edx
    0.08 :        82b:   mov    -0x24(%rbp),%eax
    1.46 :        82e:   mov    %eax,%ecx
    0.62 :        830:   shr    $0x1f,%ecx
    3.82 :        833:   add    %ecx,%eax
    0.06 :        835:   sar    %eax
    0.70 :        837:   cmp    %eax,%edx
    0.42 :        839:   jle    84e <main+0x100>
         :                                      sum += vec[i];
    9.15 :        83b:   mov    -0x38(%rbp),%rax
    5.91 :        83f:   mov    -0x14(%rbp),%edx
    0.26 :        842:   movslq %edx,%rdx
    5.87 :        845:   mov    (%rax,%rdx,4),%eax
    2.09 :        848:   cltq
    9.31 :        84a:   add    %rax,-0x20(%rbp)
         :                      for (i = 0; i < n; i++)
   16.66 :        84e:   addl   $0x1,-0x14(%rbp)
    6.46 :        852:   mov    -0x14(%rbp),%eax
    0.00 :        855:   cmp    -0x24(%rbp),%eax
    1.63 :        858:   jl     81e <main+0xd0>
         :              for (k = 0; k < 1000000; k++)


The first thing to notice is that the perf command tries to interleave C code with the Assembly code. This feature requires compiling the test program with -g3 to include debug information.

The number before the ':' is the percentage of samples collected while the program was executing each instruction. Once again, this is not an exact information, so you should be looking for hot regions, and not specific instructions.

The first and second hunk are the function prologue, which was executed only once, and the vector initialization. According to the profiling data, there is little point in attempting to optimize them, because the execution practically didn't spend any time on it. The third hunk is the second loop, where it spent almost all the execution time. Since that loop is where most of our samples where collected, we can assume that it is a hot region, which we can try to optimize. Also, notice that most of the samples were collected around that if leg. This is another indication that we should look into that specific code.

To find out what might be causing the slowness, we can use the perf stat command, which prints a bunch of performance counters information for the entire program. Let's take a look at its output.

[krisman@dilma bm]$ perf stat ./branch-miss.unsorted

 Performance counter stats for './branch-miss.unsorted:

    29876.773720  task-clock (msec) #    1.000 CPUs utilized
              25  context-switches  #    0.001 K/sec
               0  cpu-migrations    #    0.000 K/sec
              49  page-faults       #    0.002 K/sec
  86,685,961,134  cycles            #    2.901 GHz
  90,235,794,558  instructions      #    1.04  insn per cycle
  10,007,460,614  branches          #  334.958 M/sec
   1,605,231,778  branch-misses     #   16.04% of all branches

   29.878469405 seconds time elapsed

Perf stat will dynamically profile the program passed in the command line and report back a number of statistics about the entire execution. In this case, let's look at the 3 last lines in the output. The first one gives the rate of instructions executed per CPU cycle; the second line, the total number of branches executed; and the third, the percentage of those branches that resulted in a branch miss and pipeline flush.

Perf is even nice enough to put important or unexpected results in red. In this case, the last line, Branch-Misses, was unexpectedly high, thus it was displayed in red in this test.

And now, let's profile the pre-sorted version. Look at the number of branch misses:

[krisman@dilma bm]$ perf stat ./branch-miss.sorted

 Performance counter stats for './branch-miss.sorted:

    14003.066457  task-clock (msec) #    0.999 CPUs utilized
             175  context-switches  #    0.012 K/sec
               4  cpu-migrations    #    0.000 K/sec
              56  page-faults       #    0.004 K/sec
  40,178,067,584  cycles            #    2.869 GHz
  89,689,982,680  instructions      #    2.23  insn per cycle
  10,006,420,927  branches          #  714.588 M/sec
       2,275,488  branch-misses     #    0.02% of all branches

  14.020689833 seconds time elapsed

It went down from over 16% to just 0.02% of the total branches! This is very impressive and is likely to explain the reduction in execution time. Another interesting value is the number of instructions per cycle, which more than doubled. This happens because, once we reduced the number of stalls, we make better use of the pipeline, obtaining a better instruction throughput. Wrapping up

As demonstrated by the example above, figuring out the root cause of a program slowness is not always easy. In fact, it gets more complicated every time a new processor comes out with a bunch of shiny new optimizations.

Despite being a short example code, the branch misprediction case is still quite non-trivial for anyone not familiar with how the branch prediction mechanism works. In fact, if we just look at the algorithm, we could have concluded that adding a sort algorithm would just add more overhead to the algorithm. Thus, this example gives us a high-level view of how helpful profiling tools really are. By using just one of the several features provided by the perf tool, we were able to draw major conclusions about the program being examined.

Comments (10)
  1. Alan:
    Apr 03, 2017 at 11:46 AM

    sum += n[i];
    should be
    sum += vec[i];

    Reply to this comment

    Reply to this comment

    1. Krisman:
      Apr 03, 2017 at 01:47 PM

      Thanks Alan. That's correct, I've fixed it now.

      Reply to this comment

      Reply to this comment

  2. Arvin :
    Apr 03, 2017 at 06:08 PM

    Thank you for the excellent write-up, Krisman. For those following along, I was able to grab perf for my current kernel on Ubuntu with the following command: sudo apt install linux-tools-`uname -r`

    I was amazed at how well the -O3 compiler option was over -O2 and below with the unsorted code (-O2, -O1, and without were pretty much the same interestingly enough).

    Is this essentially doing under-the-hood what the sorted code is doing? Or is the compiler using other tricks to drastically improve performance here? Thanks again!

    Reply to this comment

    Reply to this comment

    1. krisman:
      Apr 03, 2017 at 08:03 PM

      The compiler is likely not sorting the vector, because it can't be sure such transformation would be correct or even helpful. But, which optimization it actually applies when increasing the optimization level depends on the compiler you have and which exact version you used. It may try, for instance, unrolling the loop
      to use more prediction slots, though I don't think it would make a difference here.

      A higher optimization level could also eliminate that outer loop, should it conclude it is useless for calculating the overall sum. To find out what happened in your
      case, you might wish to dump the binary with a tool like objdump and checkout the generated assembly for clues.

      gcc -O3 main.c -o branch-miss
      objdump -D branch-miss | less

      In my system, when compiling with -O3, gcc was able to optimize that inner loop with vector instructions, which eliminated most of the branch misses.

      In the second perf stat you shared, you can see that the result was similar, it drastically reduced the number of branch misses, resulting in an increase of the instructions per cycle rate.

      1. Arvin:
        Apr 03, 2017 at 10:17 PM

        Interesting, thanks! I'll keep playing with it. I was also curious how clang compared. Same number of branch misses, but many more instructions! Notable increase in execution time.

        All in all, this was fun and I learned something new today :)

        1. Anon:
          Apr 06, 2017 at 07:50 AM

          The optimisation change that flattens the results is explained in the most popular stack overflow answer ever:

          Reply to this comment

          Reply to this comment

  3. Thomas:
    Apr 03, 2017 at 06:39 PM

    Nice post, thanks for sharing.
    The return type of rand_partsum() should be long though to match the variable sum.
    1. krisman:
      Apr 03, 2017 at 08:05 PM

      Thanks! fixed that as well.
  4. Solerman Kaplon:
    Apr 03, 2017 at 08:56 PM

    how does the perf annotate looks like in the sort version? I'm curious how the cpu would understand that the data is sorted, never heard of such a thing
    1. krisman:
      Apr 04, 2017 at 01:42 AM

      Hi Solerman,

      It's not that the CPU understands the data is sorted, it doesn't. Instead, we use the knowledge acquired with perf to assemble the data in a specific way to explore the characteristics of the processor.

      In this case, we prepared the data in a way that made the conditional branch taken by the 'if' clause predictable for a history-based branch predictor, like the ones in modern cpus. By sorting the data, we ensure the first part of the array will always skip the 'if' leg, while the second part will always enter the 'if' leg. There might still be branch misses, when entering the vector and when switching from the first part of the vector to the second, for instance. But those branch misses are
      negligible since, by puting some order in the data, we ensured the vast majority of iterations won't trigger mispredictions.

      The expectation for the perf annotation of the optimized version would be a more even distribution of samples along the program code. If we only have this function alone in our program, it's likely that most samples will still be in the nested loops since that is, by far, the hottest path in our simple program. But
      even then, the optimized version may still have a slightly better distribution of samples, since we don't waste too much time stalled on that conditional branch. In the article example, perf annotate allowed us to isolate the region that made the most sense trying to optimize, which are always the parts where the execution
      spends most time.

Performance analysis in Linux (continued)


In this post, I will show one more example of how easy it is to disrupt performance of a modern CPU, and also run a quick discussion on

XDC 2017 - Links to recorded presentations (videos)


Many thanks to Google for recording all the XDC2017 talks. To make them easier to watch, here are direct links to each talk recorded at

DebConf 17: Flatpak and Debian


Last week, I attended DebConf 17 in Montréal, returning to DebConf for the first time in 10 years (last time was DebConf 7 in Edinburgh).

Android: NXP i.MX6 on Etnaviv Update


More progress is being made in the area of i.MX6, etnaviv and Android. Since the last post a lot work has gone into upstreaming and stabilizing

vkmark: more than a Vulkan benchmark


Ever since Vulkan was announced a few years ago, the idea of creating a Vulkan benchmarking tool in the spirit of glmark2 had been floating

Quick hack: Performance debugging Linux graphics on Mesa


Debugging graphics performance in a simple and high-level manner is possible for all Gallium based Mesa drivers using GALLIUM_HUD, a feature About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

[Jul 28, 2017] Module Environment Developer Notes

Jul 28, 2017 |
  1. Toolchains:
  2. Building Module Files
  3. Naming Modules
  4. Module Directory Organization
  5. Module Migration

Instructions and policies for installing and maintaining environment modules on Peregrine.


Libraries and applications are built around the concept of 'toolchains'; at present a toolchain is defined as a specific version of a compiler and MPI library or lack thereof. Applications are typically built with only a single toolchain, whereas libraries are built with and installed for potentially multiple toolchains as necessary to accommodate ABI differences produced by different toolchains. Workflows are primarily composed of the execution of a sequence of applications which may use different tools and might be orchestrated by an application or other tool. The toolchains presently supported are:

Loading one of the above MPI-compiler modules will also automatically load the associated compiler module (currently gcc 4.8.2 and comp-intel/13.1.3 are the recommended compilers). Certain applications may of course require alternative toolchains. If demand for additional options becomes significant, requests for additional toolchain support will be considered on a case-by-case basis.

Building Module Files Here are the steps for building an associated environment module for the installed mysoft software. First, create the appropriate module location
% mkdir -p /nopt/nrel/apps/modules/candidate/modulefiles/mysoft  # Use a directory and not a file.
% touch /nopt/nrel/apps/modules/candidate/modulefiles/mysoft/1.3 # Place environment module tcl code here.
% touch .version                                                 # If required, indicate default module in this file.
Next, edit the module file itself ("1.3" in the example). The current version of the HPC Standard Module Template is:
#%Module -*- tcl -*-

# Specify conflicts
# conflict 'appname'

# Prerequsite modules
# prereq 'appname/version....'

#################### Set top-level variables #########################

# 'Real' name of package, appears in help,display message
set PKG_NAME      pkg_name

# Version number (eg v major.minor.patch)
set PKG_VERSION   pkg_version 

# Name string from which enviro/path variable names are constructed
# Will be similar to, be not necessarily the same as, PKG_NAME
set PKG_PREFIX    pkg_prefix

# Path to the top-level package install location.
# Other enviro/path variable values constructed from this
set PKG_ROOT      pkg_root

# Library name from which to construct link line
# eg PKG_LIBNAME=fftw ---> -L/usr/lib -lfftw
set PKG_LIBNAME   pkg_libname

proc ModulesHelp { } {
    global PKG_VERSION
    global PKG_ROOT
    global PKG_NAME
    puts stdout "Build:       $PKG_NAME-$PKG_VERSION"
    puts stdout "URL:         http://www.___________"
    puts stdout "Description: ______________________"
    puts stdout "For assistance contact"

module-whatis "$PKG_NAME: One-line basic description"

# Standard install locations
prepend-path PATH             $PKG_ROOT/bin
prepend-path MANPATH          $PKG_ROOT/share/man
prepend-path INFOPATH         $PKG_ROOT/share/info
prepend-path LD_LIBRARY_PATH  $PKG_ROOT/lib
prepend-path LD_RUN_PATH      $PKG_ROOT/lib

# Set environment variables for configure/build

##################### Top level variables ##########################
setenv ${PKG_PREFIX}              "$PKG_ROOT"
setenv ${PKG_PREFIX}_ROOT         "$PKG_ROOT"
setenv ${PKG_PREFIX}_DIR          "$PKG_ROOT"

################ Template include directories ######################
# Only path names
setenv ${PKG_PREFIX}_INCLUDE      "$PKG_ROOT/include"
setenv ${PKG_PREFIX}_INCLUDE_DIR  "$PKG_ROOT/include"
# 'Directives'
setenv ${PKG_PREFIX}_INC          "-I $PKG_ROOT/include"

##################  Template library directories ####################
# Only path names
setenv ${PKG_PREFIX}_LIB          "$PKG_ROOT/lib"    
setenv ${PKG_PREFIX}_LIBDIR       "$PKG_ROOT/lib"
# 'Directives'
setenv ${PKG_PREFIX}_LD           "-L$PKG_ROOT/lib"
setenv ${PKG_PREFIX}_LIBS         "-L$PKG_ROOT/lib -l$PKG_LIBNAME"

The current module file template is maintained in a version control repo at The template file is located in hpc-devel/modules/modTemplate . To see the current file

git clone
cd ./hpc-devel/modules/
cat modTemplate

Next specify a default version of the module package. Here is an example of an an associated .version file for a set of module files

% cat /nopt/nrel/apps/modules/candidate/modulefiles/mysoft/.version
# vim: syntax=tcl

set ModulesVersion "1.3"

The .version file is only useful if there are multiple versions of the software installed. Put notes in the modulefile as necessary in stderr of the modulefile for the user to use the software correctly and for additional pointers.

NOTE : For modules with more than one level of sub-directory, although the default module as specified above is displayed correctly by the modules system, it is not loaded correctly!if more than one version exists, the most recent one will be loaded by default. In other words, the above will work fine for dakota/5.3.1 if 5.3.1 is a file alongside the file dakota/5.4 , but not for dakota/5.3.1/openmpi-gcc when a dakota/5.4 directory is present. In this case, to force the correct default module to be loaded, a dummy symlink needs to be added in dakota/ that points to the module specified in .version


% cat /nopt/nrel/apps/modules/default/modulefiles/dakota/.version
# vim: syntax=tcl

set ModulesVersion "5.3.1/openmpi-gcc"

% module avail dakota
------------------------------------------------------------------ /nopt/nrel/apps/modules/default/modulefiles -------------------------------------------------------------------
dakota/5.3.1/impi-intel           dakota/5.3.1/openmpi-epel         dakota/5.3.1/openmpi-gcc(default) dakota/5.4/openmpi-gcc            dakota/default

% ls -l /nopt/nrel/apps/modules/default/modulefiles/dakota
total 8
drwxrwsr-x 2 ssides   n-apps 8192 Sep 22 13:56 5.3.1
drwxrwsr-x 2 hsorense n-apps   96 Jun 19 10:17 5.4
lrwxrwxrwx 1 cchang   n-apps   17 Sep 22 13:56 default -> 5.3.1/openmpi-gcc
Naming Modules

Software which is made accessible via the modules system generally falls into one of three categories.

  1. Applications: these may be intended to carry out scientific calculations, or tasks like performance profiling of codes.
  2. Libraries: collections of header files and object code intended to be incorporated into an application at build time, and/or accessed via dynamic loading at runtime. The principal exceptions are technical communication libraries such as MPI, which are categorized as toolchain components below.
  3. Toolchains: compilers (e.g., Intel, GCC, PGI) and MPI libraries (OpenMPI, IntelMPI, mvapich2).

Often a package will contain both executable files and libraries. Whether it is classified as an Application or a Library depends on its primary mode of utilization. For example, although the HDF5 package contains a variety of tools for querying HDF5-format files, its primary usage is as a library which applications can use to create or access HDF5-format files. Each package can also be distinguished as a vendor- or developer-supplied binary, or a collection of source code and build components ( e.g. , Makefile(s)).

For pre-built applications or libraries, or for applications built from source code, the basic form of the module name should be


. For libraries built from source, or any package containing components which can be linked against in normal usage, the name should be


The difference arises from two considerations. For supplied binaries, the assumed vendor or developer expectation is that a package will run either on a specified Linux distribution (and may have specific requirements satisfied by the distribution), or across varied distributions (and has fairly generic requirements satisfied by most or all distributions). Thus, the toolchain for supplied binaries is implicitly supplied by the operating system. For source code applications, the user should not be directly burdened with the underlying toolchain requirement; where this is relevant ( i.e. , satisfying dependencies), the associated information should be available in module help output, as well as through dependency statements in the module itself.


{package_name} : This should be chosen such that the associated Application, Library, or Toolchain component is intuitively obvious, while concomitantly distinguishing its target from other Applications, Libraries, or Toolchain components likely to be made available on the system through the modules. So, "gaussian" is a sensible package_name , whereas "gsn" would be too generic and of unclear intent. Within these guidelines, though, there is some discretion left to the module namer.

{version} : The base version generally reflects the state of development of the underlying package, and is supplied by the developers or vendor. However, a great deal of flexibility is permitted here with respect to build options outside of the recognized {toolchain} terms. So, a Scalapack-enabled package version might be distinguished from a LAPACK-linked one by appending "-sc" to the base version, provided this is explained in the "module help" or "module show" information. {version} provides the most flexibility to the module namer.

{toolchain} : This is solely intended to track the compiler and MPI library used to build a source package. It is not intended to track the versions of these toolchain components, nor to track the use of associated toolkits ( e.g. , Cilk Plus) or libraries ( e.g. , MKL, Scalapack). As such, this term takes the form {MPI}-{compiler} , where {MPI} is one of

  1. openmpi
  2. impi (Intel MPI)

and {compiler} is one of

  1. gcc
  2. intel
  3. epel (which implies the gcc supplied with the OS, possibly at a newer version number than that in the base OS exposed in the filesystem without the EPEL module).
Module Directory Organization For general support, modulefiles can be installed in three top locations:

In addition, more specific requests can be satisfied in two other ways:

For the '/nopt/nrel/apps' modules location (where most general installations should be made), the following sub-directories have been created: to manage how modules are developed, tested and provided for production level use. An example directory hierarchy for the module files is as follows:
[wjones@login2 nrel]$ tree -a apps/modules/default/modulefiles/hdf5-parallel/
├── .1.6.4
│   ├── impi-intel
│   ├── openmpi-gcc
│   └── .version
├── 1.8.11
│   ├── impi-intel
│   └── openmpi-gcc
└── .version

[wjones@login2 nrel]$ tree -a apps/modules/default/modulefiles/hdf5
├── .1.6.4
│   └── intel
├── 1.8.11
│   ├── gcc
│   └── intel
└── .version

[wjones@login2 nrel]$ module avail hdf5

------------------------------------------------------- /nopt/nrel/apps/modules/default/modulefiles -------------------------------------------------------
hdf5/1.8.11/gcc                          hdf5-parallel/1.8.11/impi-intel(default)
hdf5/1.8.11/intel(default)               hdf5-parallel/1.8.11/openmpi-gcc
Module Migration
  1. There are three file paths for which this document is intended. Each corresponds to a status of modules within a broader workflow for managing modules. (The other module locations are not directly part of the policy).
    1. /nopt/nrel/apps/modules/candidate/modulefiles : This is the starting point for new modules. Modules are to be created here for testing and validation prior to production release. Modules here are not necessarily expected to work without issues, and may be modified or deleted without warning.
    2. /nopt/nrel/apps/modules/default/modulefiles : This is the production location, visible to the general user community by default. Modules here carry the expectation of functioning properly. Movement of modulefiles into and out of this location is managed through a monthly migration process.
    3. /nopt/nrel/apps/modules/deprecated/modulefiles : This location contains older modules which are intended for eventual archiving. Conflicts with newer software may render these modules non-functional, and so there is not an expectation of maintenance for these. They are retained to permit smooth migration out of the Peregrine software stack ( i.e. , users will still have access to them and may register objections/issues while retaining their productivity).
  2. "modifications" to modules entail
    1. Additions to any of the three stages;
    2. Major changes in functionality for modules in /default or /deprecated;
    3. Archiving modules from /deprecated; or,
    4. Making a module "default"

    These are the only acceptable atomic operations. Thus, a migration is defined as an addition to one path and a subsequent deletion from its original path.

  3. Announcements to users may be one of the following six options:
    1. Addition to /candidate!"New Module";
    2. Migration from /candidate to /default!"Move to Production";
    3. Migration from /default to /deprecated!"Deprecate";
    4. Removing visibility and accessibility from /deprecated!"Archive"; or,
    5. Major change in functionality in /default or /deprecated!"Modify"
    6. Make default!"Make default"

    Changes outside of these options, e.g. , edits in /candidate, will not be announced as batching these changes would inhibit our ability to respond nimbly to urgent problems.

  4. A "major change in functionality" is an edit to the module that could severely compromise users' productivity in the absence of adaptation on their part. So, pointing to a different application binary could result in incompatibilities in datasets generated before and after the module change; changing a module name can break workflows over thousands of jobs. On the other hand, editing inline documentation, setting an environment variable that increases performance with no side effects, or changing a dependency maintenance revision (e.g., a secondary module load of a library from v3.2.1 to v3.2.2) is unlikely to create major problems and does not need explicit attention.
  5. All module modifications are to be documented in the Sharepoint Modules Modifications table prior to making any changes (this table is linked at
  6. Module modifications are to be batched for execution on monthly calendar boundaries, and (a) announced to two weeks prior to execution, and (b) added to as a new page, which will auto-populate the table visible on the front page. Endeavor to make this list final prior to the first announcement.
  7. Modules may not be added to or deleted from /default without a corresponding deletion/addition from one of the other categories, i.e. , they may only be migrated relative to /default, not created or deleted directly.
  8. Good faith testing. There is not currently a formally defined testing mechanism for new modules in /candidate. It is thus left to the individual module steward's (most likely the individual who owns the modulefile in the *NIX sense) discretion what is a defensible test regimen. Within the current document's scope, this specifically relates to the module functionality, not the application functionality.
  9. Library and toolchain dependencies must be checked for prior to removal of modules from .../deprecated. For example, if a user identifies an application dependency on a deprecated library or toolchain, then the application module will point to the specific library or toolchain version!if it were not, then presumably an updated library/toolchain would be breaking the application. Thus, checking for dependencies on deprecated versions can be done via simple grep of all candidate and production modules. (An obvious exception is if the user is handling the dependencies in their own scripts; this case can not be planned around). It is assumed that an identified dependency on a deprecated module would spur rebuilding and testing of the application against newer libraries/toolchain, so that critical dependency on deprecated tools may not often arise in practice.
last modified Jul 06, 2015 03:16 PM

[Jul 28, 2017] HPC Environment Modules

Jul 28, 2017 |
Basic module usage

To know what modules are available, you'll need to run the "module avail" command from an interactive session:

[asrini@consign ~]$ bsub -Is bash
Job <9990024> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node063.hpc.local>>
[asrini@node063 ~]$ module avail

------------------------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------------------------
NAMD-2.9-Linux-x86_64-multicore dot                             module-info                     picard-1.96                     rum-2.0.5_05
STAR-2.3.0e                     java-sdk-1.6.0                  modules                         pkg-config-path                 samtools-0.1.19
STAR-hg19                       java-sdk-1.7.0                  mpich2-x86_64                   python-2.7.5                    use.own
STAR-mm9                        ld-library-path                 null                            r-libs-user
bowtie2-2.1.0                   manpath                         openmpi-1.5.4-x86_64            ruby-1.8.7-p374
devtoolset-2                    module-cvs                      perl5lib                        ruby-1.9.3-p448

The module names should be pretty self-explainatory, but some are not. To see information about a module you can issue a module show [module name] :

[asrini@node063 ~]$ module show null

module-whatis    does absolutely nothing

[asrini@node063 ~]$ module show r-libs-user

module-whatis    Sets R_LIBS_USER=$HOME/R/library
setenv           R_LIBS_USER ~/R/library

[asrini@node063 ~]$ module show devtoolset-2

module-whatis    Devtoolset-2 packages include the newer versions of gcc
prepend-path     PATH /opt/rh/devtoolset-2/root/usr/bin
prepend-path     MANPATH /opt/rh/devtoolset-2/root/usr/share/man
prepend-path     INFOPATH /opt/rh/devtoolset-2/root/usr/share/info

Example use of modules:

[asrini@node063 ~]$ python -V
Python 2.6.6

[asrini@node063 ~]$ which python

[asrini@node063 ~]$ module load python-2.7.5

[asrini@node063 ~]$ python -V
Python 2.7.5

[asrini@node063 ~]$ which python

After running the above commands, you will be able to use python v2.7.5 till you exit out of the interactive session or till you unload the module:

[asrini@node063 ~]$ module unload python-2.7.5

[asrini@node063 ~]$ which python

Modules may also be included in your job scripts and submitted as a batch job.

Using Modules at Login

In order to have modules automatically load into your environment, you would add the module commands to your $HOME/.bashrc file. Note that modules are not available on the PMACS head node, hence, you'll need to ensure that your login script attempts to load a module only if you are on a compute node:

[asrini@consign ~]$ more .bashrc
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
# Modules to load
if [ $HOSTNAME != "consign.hpc.local" ] && [ $HOSTNAME != "" ]; then
        module load python-2.7.5

# more stuff below .....

[asrini@consign ~]$ which python
[asrini@consign ~]$ bsub -Is bash
Job <172129> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node063.hpc.local>>
[asrini@node063 ~]$ which python

[Jun 19, 2017] Source Repository – OpenHPC

Jun 19, 2017 |

Welcome to the OpenHPC site. OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community. Over time, the community also plans to identify and develop abstraction interfaces between key components to further enhance modularity and interchangeability. The community includes representation from a variety of sources including software vendors, equipment manufacturers, research institutions, supercomputing sites, and others.

All of the source collateral related to the OpenHPC integration effort is managed with git and is hosted on GitHub at the following location:

The top-level organization of the git repository is grouped into into three primary categories:


The components/ directory houses all of the build-related and packaging collateral for each individual packages currently included within OpenHPC. This generally includes items such as RPM .spec files and any patches applied during the build. Note that packages are generally grouped by functionality and the following functional groupings have been identified:

Note that the above functionality groupings are also used to organize work-item issues on the OpenHPC GitHub site via labels assigned to each component.


The docs/ directory in the GitHub repo houses related installation recipes that leverage OpenHPC packaged components.

The documentation is typeset using LaTeX and companion parsing utilities are used to derive automated installation scripts directly from the raw LatTeX files in order to validate the embedded instructions as part of the continuous integration (CI) process.

Copies of the latest documentation products are available on the Downloads page.

[Jun 19, 2017] How to easily install configure the Torque-Maui open source scheduler in Bright by Robert Stober

Jun 19, 2017 |
How to easily install & configure the Torque/Maui open source scheduler in Bright | August 14, 2012 | workload manager , HPC job scheduler , Maui , Torque Bright Cluster Manager makes most cluster management tasks very easy to perform, and installing workload managers is one of them. There are many workload managers that are pre-configured, admin-selectable options when you install Bright, including PBS Pro, SLURM , LSF, openlava, Torque, and Grid Engine .

The open source scheduler Maui is not pre-configured, but it's really easy to install and configure this software in Bright Cluster Manager. This article shows you how. The process is to download and install the Maui scheduler, then to configure Bright to use Maui to schedule torque jobs.

Getting Started

Step1: Download the Maui scheduler from the Adaptive Computing website: You will need to register on their site before you can download it.

Step 2: Install it as shown below. This command will overwrite the Bright zero-length Maui placeholder file.

# cp -f maui-3.3.1.tar.gz /usr/src/redhat/SOURCES/maui-3.3.1.tar.gz

Step 3: Build the Maui RPM.

# rpmbuild -bb /usr/src/redhat/SPECS/maui.spec

Step 4: Install the RPM.

# rpm -ivh /usr/src/redhat/RPMS/x86_64/maui-3.3.1-59_cm6.0.x86_64.rpm

Preparing... ########################################### [100%]

1:maui ########################################### [100%]

Select the node that is running the Torque server (usually the head node) resource, then the "roles" tab. Configure the "scheduler" property of the Torque Server role to use the Maui scheduler.

Step 5. Load the Torque and Maui modules. This adds the Maui commands to your PATH in the current shell.

$ module load torque

$ module load maui

The "initadd" command adds the Torque and Maui modules to your environment so that next time you log in they're automatically loaded.

$ module initadd torque maui

Step 6. Submit a simple Torque job.

$ qsub

The job has been submitted and is running.

$ qstat

Job id Name User Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

5.torque-head stresscpu rstober 0 R shortq

The Maui showq command displays information about active, eligible, blocked, and/or recently completed jobs. Since Torque is not actually scheduling jobs, the showq command displays the actual job ordering.

$ showq

ACTIVE JOBS--------------------


5 rstober Running 1 99:23:59:28 Thu Aug 9 11:40:45

IDLE JOBS----------------------

0 Idle Jobs

BLOCKED JOBS----------------

Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

The Maui checkjob displays detailed job information for queued, blocked, active, and recently completed jobs.

$ checkjob 5

checking job 5

State: Running
Creds:  user:rstober  group:rstober  class:shortq  qos:DEFAULT
WallTime: 00:01:31 of 99:23:59:59
SubmitTime: Thu Aug  9 11:40:44
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Thu Aug  9 11:40:45
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Allocated Nodes:

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '5' (-00:01:31 -> 99:23:58:28  Duration: 99:23:59:59)
PE:  1.00  StartPriority:  1

[Jun 16, 2017] Tutorial - Submitting a job using qsub by Sreedhar Manchu

Notable quotes:
"... (the path to your home directory) ..."
"... (which language you are using) ..."
"... (the name that you logged in with) ..."
"... (standard path to excecutables) ..."
"... (location of the users mail file) ..."
"... (command shell, i.e bash,sh,zsh,csh, ect.) ..."
"... (the name of the host upon which the qsub command is running) ..."
"... (the hostname of the pbs_server which qsub submits the job to) ..."
"... (the name of the original queue to which the job was submitted) ..."
"... (the absolute path of the current working directory of the qsub command) ..."
"... (each member of a job array is assigned a unique identifier) ..."
"... (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job) ..."
"... (the job identifier assigned to the job by the batch system) ..."
"... (the job name supplied by the user) ..."
"... (the name of the file contain the list of nodes assigned to the job) ..."
"... (the name of the queue from which the job was executed from) ..."
"... (the walltime requested by the user or default walltime allotted by the scheduler) ..."

Last modified by Yanli Zhang on Jul 10, 2012

qsub Tutorial

  1. Synopsis
  2. What is qsub
  3. What does qsub do?
  4. Arguments to control behavior
Synopsis qsub Synopsis ?
qsub [-a date_time] [-A account_string] [-b secs] [-c checkpoint_options] n No checkpointing is to be performed. s Checkpointing is to be performed only when the server executing the job is shutdown . c Checkpointing is to be performed at the default minimum time for the server executing the job. c=minutes Checkpointing is to be performed at an interval of minutes, which is the integer number of minutes of CPU time used by the job. This value must be greater than zero. [-C directive_prefix] [-d path] [-D path] [-e path] [-f] [-h] [-I ] [-j join ] [-k keep ] [-l resource_list ] [-m mail_options] [-N name] [-o path] [-p priority] [-P user[:group]] [-q destination] [-r c] [-S path_list] [-t array_request] [-u user_list] [- v variable_list] [-V ] [-W additional_attributes] [-X] [-z] [script]

For detailed information, see this page .

What is qsub?

qsub is the command used for job submission to the cluster. It takes several command line arguments and can also use special directives found in the submission scripts or command file. Several of the most widely used arguments are described in detail below.

Useful Information

For more information on qsub do More information on qsub ?
$ man qsub

What does qsub do?


All of our clusters have a batch server referred to as the cluster management server running on the headnode. This batch server monitors the status of the cluster and controls/monitors the various queues and job lists. Tied into the batch server, a scheduler makes decisions about how a job should be run and its placement in the queue. qsub interfaces into the the batch server and lets it know that there is another job that has requested resources on the cluster. Once a job has been received by the batch server, the scheduler decides the placement and notifies the batch server which in turn notifies qsub (Torque/PBS) whether the job can be run or not. The current status (whether the job was successfully scheduled or not) is then returned to the user. You may use a command file or STDIN as input for qsub.

Environment variables in qsub

The qsub command will pass certain environment variables in the Variable_List attribute of the job. These variables will be available to the job. The value for the following variables will be taken from the environment of the qsub command:

These values will be assigned to a new name which is the current name prefixed with the string "PBS_O_". For example, the job will have access to an environment variable named PBS_O_HOME which have the value of the variable HOME in the qsub command environment.

In addition to these standard environment variables, there are additional environment variables available to the job.

Arguments to control behavior

As stated before there are several arguments that you can use to get your jobs to behave a specific way. This is not an exhaustive list, but some of the most widely used and many that you will will probably need to accomplish specific tasks.

Declare the date/time a job becomes eligible for execution

To set the date/time which a job becomes eligible to run, use the -a argument. The date/time format is [[[[CC]YY]MM]DD]hhmm[.SS]. If -a is not specified qsub assumes that the job should be run immediately.


To test -a get the current date from the command line and add a couple of minutes to it. It was 10:45 when I checked. Add hhmm to -a and submit a command from STDIN.

Example: Set the date/time which a job becomes eligible to run ?
$ echo "sleep 30" | qsub -a 1047

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -a 1047
Defining the working directory path to be used for the job

To define the working directory path to be used for the job -d option can be used. If it is not specified, the default working directory is the home directory.

Example: Define the working directory path to be used for the job ?
$ pwd /home/manchu $ cat dflag.pbs echo "Working directory is $PWD" $ qsub dflag.pbs 5596682.hpc0. local $ cat dflag.pbs.o5596682 Working directory is /home/manchu $ mv dflag.pbs random_pbs/ $ qsub -d /home/manchu/random_pbs/ /home/manchu/random_pbs/dflag.pbs 5596703.hpc0. local $ cat random_ps/dflag.pbs.o5596703 Working directory is /home/manchu/random_pbs $ qsub /home/manchu/random_pbs/dflag.pbs 5596704.hpc0. local $ cat dflag.pbs.o5596704 Working directory is /home/manchu

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -d /home/manchu/random_pbs

Manipulate the output files

As a default all jobs will print all stdout (standard output) messages to a file with the name in the format <job_name>.o<job_id> and all stderr (standard error) messages will be sent to a file named <job_name>.e<job_id>. These files will be copied to your working directory as soon as the job starts. To rename the file or specify a different location for the standard output and error files, use the -o for standard output and -e for the standard error file. You can also combine the output using -j.

Create a simple submission file: ?
$ cat sleep .pbs #!/bin/sh for i in {1..60} ; do echo $i sleep 1 done
Create a simple submission file: ?
$ qsub -o sleep .log sleep .pbs

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -o sleep.log
Submit your job with the standard error file renamed: ?
$ qsub -e sleep .log sleep .pbs

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -e sleep.log
Combine them using the name sleep.log: ?
$ qsub -o sleep .log -j oe .pbs

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -o sleep.log #PBS -j oe


The order of two letters next to flag -j is important. It should always start with the letter that's been already defined before, in this case 'o'. Place the joined output in another location other than the working directory: ?
$ qsub -o $HOME/tutorials/logs/sleep.log -j oe sleep .pbs

Mail job status at the start and end of a job

The mailing options are set using the -m and -M arguments. The -m argument sets the conditions under which the batch server will send a mail message about the job and -M will define the users that emails will be sent to (multiple users can be specified in a list seperated by commas). The conditions for the -m argument include:

Using the sleep.pbs script created earlier, submit a job that emails you for all conditions: ? $ qsub -m abe -M sleep .pbs

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -m abe #PBs -M

Submit a job to a specific queue

You can select a queue based on walltime needed for your job. Use the 'qstat -q' command to see the maximum job times for each queue.

Submit a job to the bigmem queue: ?
$ qsub -q bigmem sleep .pbs

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -q bigmem

Submitting a job that is dependent on the output of another

Often you will have jobs that will be dependent on another for output in order to run. To add a dependency, we will need to use the -W (additional attributes) with the depend option. We will be using the afterok rule, but there are several other rules that may be useful. (man qsub)


To illustrate the ability to hold execution of a specific job until another has completed, we will write two submission scripts. The first will create a list of random numbers. The second will sort those numbers. Since the second script will depend on the list that is created we will need to hold execution until the first has finished.

random.pbs ?
$ cat random.pbs #!/bin/sh cd $HOME sleep 120 for i in {1..100}; do echo $RANDOM >> rand.list done
sort.pbs ?
$ cat sort .pbs #!/bin/sh cd $HOME sort -n rand.list > sorted.list sleep 30

Once the file are created, lets see what happens when they are submitted at the same time:

Submit at the same time ?
$ qsub random.pbs ; qsub sort .pbs 5594670.hpc0. local 5594671.hpc0. local $ ls random.pbs sorted.list sort .pbs sort .pbs.e5594671 sort .pbs.o5594671 $ cat sort .pbs.e5594671 sort : open failed: rand.list: No such file or directory

Since they both ran at the same time, the sort script failed because the file rand.list had not been created yet. Now submit them with the dependencies added.

Submit them with the dependencies added ?
$ qsub random.pbs 5594674.hpc0. local $ qsub -W depend=afterok:5594674.hpc0. local sort .pbs 5594675.hpc0. local $ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5594674.hpc0.loc manchu ser2 random.pbs 18029 1 1 -- 48:00 R 00:00 5594675.hpc0.loc manchu ser2 sort .pbs 1 1 -- 48:00 H --

We now see that the sort.pbs job is in a hold state. And once the dependent job completes the sort job runs and we see:

Job status with the dependencies added ?
$ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5594675.hpc0.loc manchu ser2 sort .pbs 18165 1 1 -- 48:00 R --

Useful Information

Submitting multiple jobs in a loop that depend on output of another job

This example show how to submit multiple jobs in a loop where each job depends on output of job submitted before it.


Let's say we need to write numbers from 0 to 999999 in order onto a file output.txt. We can do 10 separate runs to achieve this, where each run has a separate pbs script writing 100,000 numbers to output file. Let's see what happens if we submit all 10 jobs at the same time.

The script below creates required pbs scripts for all the runs.

Create PBS Scripts for all the runs ?
$ cat #!/bin/bash for i in {0..9} do cat > pbs.script.$i << EOF #!/bin/bash #PBS -l nodes=1:ppn=1,walltime=600 cd \$PBS_O_WORKDIR for ((i=$((i*100000)); i<$(((i+1)*100000)); i++)) { echo "\$i" >> output.txt } exit 0; EOF done
Change permission to make it an executable ?
$ chmod u+x
Run the Script ?
$ ./
List of Created PBS Scripts ?
$ ls -l pbs.script.* -rw-r--r-- 1 manchu wheel 134 Oct 27 16:32 pbs.script.0 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.1 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.2 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.3 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.4 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.5 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.6 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.7 -rw-r--r-- 1 manchu wheel 139 Oct 27 16:32 pbs.script.8 -rw-r--r-- 1 manchu wheel 140 Oct 27 16:32 pbs.script.9
PBS Script ?
$ cat pbs.script.0 #!/bin/bash #PBS -l nodes=1:ppn=1,walltime=600 cd $PBS_O_WORKDIR for ((i=0; i<100000; i++)) { echo "$i" >> output.txt } exit 0;
Submit Multiple Jobs at a Time ?
$ for i in {0..9}; do qsub pbs.script.$i ; done 5633531.hpc0. local 5633532.hpc0. local 5633533.hpc0. local 5633534.hpc0. local 5633535.hpc0. local 5633536.hpc0. local 5633537.hpc0. local 5633538.hpc0. local 5633539.hpc0. local 5633540.hpc0. local $
output.txt ?
$ tail output.txt 699990 699991 699992 699993 699994 699995 699996 699997 699998 699999 - bash -3.1$ grep -n 999999 $_ 210510:999999 $

This clearly shows the nubmers are in no order like we wanted. This is because all the runs wrote to the same file at the same time, which is not what we wanted.

Let's submit jobs using qsub dependency feature. This can be achieved with a simple script shown below.

Simple Script to Submit Multiple Dependent Jobs ?
$ cat dependency.pbs #!/bin/bash job=`qsub pbs.script.0` for i in {1..9} do job_next=`qsub -W depend=afterok:$job pbs.script.$i` job=$job_next done
Let's make it an executable ?
$ chmod u+x dependency.pbs
Submit dependent jobs by running the script ?
$ ./dependency.pbs $ qstat -u manchu hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5633541.hpc0.loc manchu ser2 pbs.script.0 28646 1 1 -- 00:10 R -- 5633542.hpc0.loc manchu ser2 pbs.script.1 -- 1 1 -- 00:10 H -- 5633543.hpc0.loc manchu ser2 pbs.script.2 -- 1 1 -- 00:10 H -- 5633544.hpc0.loc manchu ser2 pbs.script.3 -- 1 1 -- 00:10 H -- 5633545.hpc0.loc manchu ser2 pbs.script.4 -- 1 1 -- 00:10 H -- 5633546.hpc0.loc manchu ser2 pbs.script.5 -- 1 1 -- 00:10 H -- 5633547.hpc0.loc manchu ser2 pbs.script.6 -- 1 1 -- 00:10 H -- 5633548.hpc0.loc manchu ser2 pbs.script.7 -- 1 1 -- 00:10 H -- 5633549.hpc0.loc manchu ser2 pbs.script.8 -- 1 1 -- 00:10 H -- 5633550.hpc0.loc manchu ser2 pbs.script.9 -- 1 1 -- 00:10 H -- $
Output after first run ?
$ tail output.txt 99990 99991 99992 99993 99994 99995 99996 99997 99998 99999 $
Output after final run ?
$ tail output.txt 999990 999991 999992 999993 999994 999995 999996 999997 999998 999999 $ grep -n 100000 output.txt 100001:100000 $ grep -n 999999 output.txt 1000000:999999 $

This shows that numbers are written in order to output.txt. Which in turn shows that jobs ran one after successful completion of another.

Opening an interactive shell to the compute node

To open an interactive shell to a compute node, use the -I argument. This is often used in conjunction with the -X (X11 Forwarding) and the -V (pass all of the users environment)

Open an interactive shell to a compute node ?
$ qsub -I

Passing an environment variable to your job

You can pass user defined environment variables to a job by using the -v argument.


To test this we will use a simple script that prints out an environment variable.

Passing an environment variable ?
$ cat variable.pbs #!/bin/sh if [ "x" == "x$MYVAR" ] ; then echo "Variable is not set" else echo "Variable says: $MYVAR" fi

Next use qsub without the -v and check your standard out file

qsub without -v ?
$ qsub variable.pbs 5596675.hpc0. local $ cat variable.pbs.o5596675 Variable is not set

Then use the -v to set the variable

qsub with -v ?
$ qsub - v MYVAR= "hello" variable.pbs 5596676.hpc0. local $ cat variable.pbs.o5596676 Variable says: hello

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?
#PBS -v MYVAR="hello"

Useful Information

Multiple user defined environment variables can be passed to a job at a time. Passing Multiple Variables ?
$ cat variable.pbs #!/bin/sh echo "$VAR1 $VAR2 $VAR3" > output.txt $ $ qsub - v VAR1= "hello" ,VAR2= "Sreedhar" ,VAR3= "How are you?" variable.pbs 5627200.hpc0. local $ cat output.txt hello Sreedhar How are you? $

Passing your environment to your job

You may declare that all of your environment variables are passed to the job by using the -V argument in qsub.


Use qsub to perform an interactive login to one of the nodes:

Passing your environment: qsub with -V ?
$ qsub -I -V

Handy Hint

This option can be added to pbs script with a PBS directive such as Equivalent PBS Directive ?

Once the shell is opened, use the env command to see that your environment was passed to the job correctly. You should still have access to all your modules that you loaded previously.

Submitting an array job: Managing groups of jobs .hostname would have PBS_ARRAYID set to 0. This will allow you to create job arrays where each job in the array will perform slightly different actions based on the value of this variable, such as performing the same tasks on different input files. One other difference in the environment between jobs in the same array is the value of the PBS_JOBNAME variable.


First we need to create data to be read. Note that in a real application, this could be data, configuration setting or anything that your program needs to run.

Create Input Data

To create input data, run this simple one-liner:

Creating input data ?
$ for i in {0..4}; do echo "Input data file for an array $i" > input.$i ; done $ ls input.* input.0 input.1 input.2 input.3 input.4 $ cat input.0 Input data file for an array 0

Submission Script
Submission Script: array.pbs ?
$ cat array.pbs #!/bin/sh #PBS -l nodes=1:ppn=1,walltime=5:00 #PBS -N arraytest cd ${PBS_O_WORKDIR} # Take me to the directory where I launched qsub # This part of the script handles the data. In a real world situation you will probably # be using an existing application. cat input.${PBS_ARRAYID} > output.${PBS_ARRAYID} echo "Job Name is ${PBS_JOBNAME}" >> output.${PBS_ARRAYID} sleep 30 exit 0;

Submit & Monitor

Instead of running five qsub commands, we can simply enter:

Submitting and Monitoring Array of Jobs ?
$ qsub -t 0-4 array.pbs 5534017[].hpc0. local

qstat ?
$ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534017[].hpc0.l sm4082 ser2 arraytest 1 1 -- 00:05 R -- $ qstat -t -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534017[0].hpc0. sm4082 ser2 arraytest-0 12017 1 1 -- 00:05 R -- 5534017[1].hpc0. sm4082 ser2 arraytest-1 12050 1 1 -- 00:05 R -- 5534017[2].hpc0. sm4082 ser2 arraytest-2 12084 1 1 -- 00:05 R -- 5534017[3].hpc0. sm4082 ser2 arraytest-3 12117 1 1 -- 00:05 R -- 5534017[4].hpc0. sm4082 ser2 arraytest-4 12150 1 1 -- 00:05 R -- $ ls output.* output.0 output.1 output.2 output.3 output.4 $ cat output.0 Input data file for an array 0 Job Name is arraytest-0


pbstop by default doesn't show all the jobs in the array. Instead, it shows a single job in just one line in the job information. Pressing 'A' shows all the jobs in the array. Same can be achieved by giving the command line option '-A'. This option along with '-u <NetID>' shows all of your jobs including array as well as normal jobs.

pbstop ?
$ pbstop -A -u $USER


Typing 'A' expands/collapses array job representation.

Comma delimited lists

The -t option of qsub also accepts comma delimited lists of job IDs so you are free to choose how to index the members of your job array. For example:

Comma delimited lists ?
$ rm output.* $ qsub -t 2,5,7-9 array.pbs 5534018[].hpc0. local $ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534018[].hpc0.l sm4082 ser2 arraytest 1 1 -- 00:05 Q -- $ qstat -t -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534018[2].hpc0. sm4082 ser2 arraytest-2 12319 1 1 -- 00:05 R -- 5534018[5].hpc0. sm4082 ser2 arraytest-5 12353 1 1 -- 00:05 R -- 5534018[7].hpc0. sm4082 ser2 arraytest-7 12386 1 1 -- 00:05 R -- 5534018[8].hpc0. sm4082 ser2 arraytest-8 12419 1 1 -- 00:05 R -- 5534018[9].hpc0. sm4082 ser2 arraytest-9 12452 1 1 -- 00:05 R -- $ ls output.* output.2 output.5 output.7 output.8 output.9 $ cat output.2 Input data file for an array 2 Job Name is arraytest-2

A more general for loop - Arrays with step size

By default, PBS doesn't allow array jobs with step size. qsub -t 0-10 <pbs.script> increments PBS_ARRAYID in 1. To submit jobs in steps of a certain size, let's say step size of 3 starting at 0 and ending at 10, one has to do

qsub -t 0,3,6,9 <pbs.script>

To make it easy for users we have put a wrapper which takes starting point, ending point and step size as arguments for -t flag. This avoids default necessity that PBS_ARRAYID increment be 1. The above request can be accomplished with (which happens behind the scenes with the help of wrapper)

qsub -t 0-10:3 <pbs.script>

Here, 0 is the starting point, 10 is the ending point and 3 is the step size. It is not necessary that starting point must be 0. It can be any number. Incidentally, in a situation in which the upper-bound is not equal to the lower-bound plus an integer-multiple of the increment, for example

qsub -t 0-10:3 <pbs.script>

wrapper automatically changes the upper bound as shown in the example below.

Arrays with step size ?
[sm4082@login-0-0 ~]$ qsub -t 0-10:3 array.pbs 6390152[].hpc0. local [sm4082@login-0-0 ~]$ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6390152[].hpc0.l sm4082 ser2 arraytest -- 1 1 -- 00:05 Q -- [sm4082@login-0-0 ~]$ qstat -t -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6390152[0].hpc0. sm4082 ser2 arraytest-0 25585 1 1 -- 00:05 R -- 6390152[3].hpc0. sm4082 ser2 arraytest-3 28227 1 1 -- 00:05 R -- 6390152[6].hpc0. sm4082 ser2 arraytest-6 8515 1 1 -- 00:05 R 00:00 6390152[9].hpc0. sm4082 ser2 arraytest-9 505 1 1 -- 00:05 R -- [sm4082@login-0-0 ~]$ ls output.* output.0 output.3 output.6 output.9 [sm4082@login-0-0 ~]$ cat output.9 Input data file for an array 9 Job Name is arraytest-9 [sm4082@login-0-0 ~]$


By default, PBS doesn't support arrays with step size. On our clusters, it's been achieved with a wrapper. This option might not be there on clusters at other organizations/schools that use PBS/Torque.


If you're trying to submit jobs through ssh to login nodes from your pbs scripts with statement such as ?
ssh login-0-0 "cd ${PBS_O_WORKDIR};`which qsub` -t 0-10:3 <pbs.script>"

arrays with step size wouldn't work unless you either add

shopt -s expand_aliases

to your pbs script that's in bash or add this to your .bashrc in your home directory. Adding this makes alias for qsub come into effect there by making wrapper act on command line options to qsub (For that matter this brings any alias to effect for commands executed via SSH).

If you have

#PBS -t 0-10:3

in your pbs script you don't need to add this either to your pbs script or to your .bashrc in your home directory.

A List of Input Files/Pulling data from the ith line of a file

Suppose we have a list of 1000 input files, rather than input files explicitly indexed by suffix, in a file file_list.text one per line:

A List of Input Files/Pulling data from the ith line of a file ?
[sm4082@login-0-2 ~]$ cat array.list #!/bin/bash #PBS -S /bin/bash #PBS -l nodes=1:ppn=1,walltime=1:00:00 INPUT_FILE=` awk "NR==$PBS_ARRAYID" file_list.text` # # ...or use sed: # sed -n -e "${PBS_ARRAYID}p" file_list.text # # ...or use head/tail # $(cat file_list.text | head -n $PBS_ARRAYID | tail -n 1) ./executable < $INPUT_FILE

In this example, the '-n' option suppresses all output except that which is explicitly printed (on the line equal to PBS_ARRAYID).

qsub -t 1-1000 array.list

Let's say you have a list of 1000 numbers in a file, one number per line. For example, the numbers could be random number seeds for a simulation. For each task in an array job, you want to get the ith line from the file, where i equals PBS_ARRAYID, and use that value as the seed. This is accomplished by using the Unix head and tail commands or awk or sed just like above.

A List of Input Files/Pulling data from the ith line of a file ?
[sm4082@login-0-2 ~]$ cat array.seed #!/bin/bash #PBS -S /bin/bash #PBS -l nodes=1:ppn=1,walltime=1:00:00 SEEDFILE=~/data/seeds SEED=$( cat $SEEDFILE | head -n $PBS_ARRAYID | tail -n 1) ~/programs/executable $SEED > ~/results/output.$PBS_ARRAYID
qsub -t 1-1000 array.seed 

You can use this trick for all sorts of things. For example, if your jobs all use the same program, but with very different command-line options, you can list all the options in the file, one set per line, and the exercise is basically the same as the above, and you only have two files to handle (or 3, if you have a perl script generate the file of command-lines).

Delete all jobs in array

We can delete all the jobs in array with a single command.

Deleting array of jobs ?
$ qsub -t 2-5 array.pbs 5534020[].hpc0. local $ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534020[].hpc0.l sm4082 ser2 arraytest 1 1 -- 00:05 R -- $ qdel 5534020[] $ qstat -u $USER $

Delete a single job in array

Delete a single job in array, e.g. number 4,5 and 7

Deleting a single job in array ?

$ qsub -t 0-8 array.pbs 5534021[].hpc0. local $ qstat -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------- -- ---- ---------- ---- ---- -- ----- --- - --- 5534021[].hpc0.l sm4082 ser2 arraytest 1 1 -- 00:05 Q -- $ qstat -t -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534021[0].hpc0. sm4082 ser2 arraytest-0 26618 1 1 -- 00:05 R -- 5534021[1].hpc0. sm4082 ser2 arraytest-1 14271 1 1 -- 00:05 R -- 5534021[2].hpc0. sm4082 ser2 arraytest-2 14304 1 1 -- 00:05 R -- 5534021[3].hpc0. sm4082 ser2 arraytest-3 14721 1 1 -- 00:05 R -- 5534021[4].hpc0. sm4082 ser2 arraytest-4 14754 1 1 -- 00:05 R -- 5534021[5].hpc0. sm4082 ser2 arraytest-5 14787 1 1 -- 00:05 R -- 5534021[6].hpc0. sm4082 ser2 arraytest-6 10711 1 1 -- 00:05 R -- 5534021[7].hpc0. sm4082 ser2 arraytest-7 10744 1 1 -- 00:05 R -- 5534021[8].hpc0. sm4082 ser2 arraytest-8 9711 1 1 -- 00:05 R -- $ qdel 5534021[4] $ qdel 5534021[5] $ qdel 5534021[7] $ qstat -t -u $USER hpc0. local : Req 'd Req' d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 5534021[0].hpc0. sm4082 ser2 arraytest-0 26618 1 1 -- 00:05 R -- 5534021[1].hpc0. sm4082 ser2 arraytest-1 14271 1 1 -- 00:05 R -- 5534021[2].hpc0. sm4082 ser2 arraytest-2 14304 1 1 -- 00:05 R -- 5534021[3].hpc0. sm4082 ser2 arraytest-3 14721 1 1 -- 00:05 R -- 5534021[6].hpc0. sm4082 ser2 arraytest-6 10711 1 1 -- 00:05 R -- 5534021[8].hpc0. sm4082 ser2 arraytest-8 9711 1 1 -- 00:05 R -- $ qstat -t -u $USER $

[Feb 08, 2017] Sge,Torque, Pbs WhatS The Best Choise For A Ngs Dedicated Cluster

Feb 08, 2017 |
Question: Sge,Torque, Pbs : What'S The Best Choise For A Ngs Dedicated Cluster ? 11 gravatar for abihouee 4.4 years ago by abihouee • 110 abihouee • 110 wrote:

Sorry, it may be off topics...

We plan to install a scheduler on our cluster (DELL blade cluster over Infiniband storage on Linux CentOS 6.3). This cluster is dedicated to do NGS data analysis.

It seems to me that the most current is SGE, but since Oracle bougth the stuff, there are several alternative developments ( OpenGridEngine , SonGridEngine , Univa Grid Engine ...)

An other possible scheluler is Torque / PBS .

I' m a little bit lost in this scheduler forest ! Is there someone with any experiment on this or who knows some existing benchmark ?

Thanks a lot. Audrey

next-gen analysis clustering • 15k views ADD COMMENT • link • modified 2.1 years ago by joe.cornish826 4.4k • written 4.4 years ago by abihouee • 110 2

I worked with SGE for years at a genome center in Vancouver. Seemed to work quite well. Now I'm at a different genome center and we are using LSF but considering switching to SGE, which is ironic because we are trying to transition from Oracle DB to PostGres to get away from Oracle... SGE and LSF seemed to offer similar functionality and performance as far as I can tell. Both clusters have several 1000 cpus.

ADD REPLY • link modified 4.3 years ago • written 4.3 years ago by Malachi Griffith 14k 1

openlava ( source code ) is an open-source fork of LSF that while lacking some features does work fairly well.

ADD REPLY • link written 2.1 years ago by Malachi Griffith 14k 1

Torque is fine, and very well tested; either of the SGE forks are widely used in this sort of environment, and has qmake, which some people are very fond of. SLURM is another good possibility.

ADD REPLY • link modified 2.1 years ago • written 2.1 years ago by Jonathan Dursi • 250 10 gravatar for matted 4.4 years ago by matted 6.3k Boston, United States matted 6.3k wrote:

I can only offer my personal experiences, with the caveat that we didn't do a ton of testing and so others may have differing opinions.

We use SGE, which installs relatively nicely on Ubuntu with the standard package manager (the gridengine-* packages). I'm not sure what the situation is on CentOS.

We previously used Torque/PBS, but the scheduler performance seemed poor and it bogged down with lots of jobs in the queue. When we switched to SGE, we didn't have any problems. This might be a configuration error on our part, though.

When I last tried out Condor (several years ago), installation was quite painful and I gave up. I believe it claims to work in a cross-platform environment, which might be interesting if for example you want to send jobs to Windows workstations.

LSF is another option, but I believe the licenses cost a lot.

My overall impression is that once you get a system running in your environment, they're mostly interchangeable (once you adapt your submission scripts a bit). The ease with which you can set them up does vary, however. If your situation calls for "advanced" usage (MPI integration, Kerberos authentication, strange network storage, job checkpointing, programmatic job submission with DRMAA, etc. etc.), you should check to see which packages seem to support your world the best.

ADD COMMENT • link written 4.4 years ago by matted 6.3k 1

Recent versions of torque have improved a great deal for large numbers of jobs, but yes, that was a real problem.

I also agree that all are more or less fine once they're up and working, and the main way to decide which to use would be to either (a) just pick something future users are familiar with, or (b) pick some very specific things you want to be able to accomplish with the resource manager/scheduler and start finding out which best support those features/workflows.

ADD REPLY • link written 2.1 years ago by Jonathan Dursi • 250 4 gravatar for Jeremy Leipzig 4.4 years ago by Jeremy Leipzig 16k Philadelphia, PA Jeremy Leipzig 16k wrote:

Unlike PBS, SGE has qrsh , which is a command that actually run jobs in the foreground, allowing you to easily inform a script when a job is done. What will they think of next?

This is one area where I think the support you pay for going commercial might be worthwhile. At least you'll have someone to field your complaints.

ADD COMMENT • link modified 2.1 years ago • written 4.4 years ago by Jeremy Leipzig 16k 2

EDIT: Some versions of PBS also have qsub -W block=true that works in a very similar way to SGE qsrh.

ADD REPLY • link modified 4.4 years ago • written 4.4 years ago by Sean Davis 22k

you must have a newer version than me

>qsub -W block=true 
qsub: Undefined attribute  MSG=detected presence of an unknown attribute
>qsub --version
version: 2.4.11

ADD REPLY • link modified 4.4 years ago • written 4.4 years ago by Jeremy Leipzig 16k

For Torque and perhaps versions of PBS without -W block=true, you can use the following to switches. The behaviour is similar but when called, any embedded options to qsub will be ignored. Also, stderr/stdout is sent to the shell.

qsub -I -x
ADD REPLY • link modified 16 months ago • written 16 months ago by matt.demaere • 0 1

My answer should be updated to say that any DRMAA-compatible cluster engine is fine, though running jobs through DRMAA (e.g. Snakemake --drmaa ) instead of with a batch scheduler may anger your sysadmin, especially if they are not familiar with scientific computing standards.

using qsub -I just to get a exit code is not ok

ADD REPLY • link written 2.1 years ago by Jeremy Leipzig 16k

Torque definitely allows interactive jobs -

qsub -I

As for Condor, I've never seen it used within a cluster; it was designed back in the day for farming out jobs between diverse resources (e.g., workstations after hours) and would have a lot of overhead for working within a homogeneous cluster. Scheduling jobs between clusters, maybe?

ADD REPLY • link modified 2.1 years ago • written 2.1 years ago by Jonathan Dursi • 250 4 gravatar for Ashutosh Pandey 4.4 years ago by Ashutosh Pandey 10k Philadelphia Ashutosh Pandey 10k wrote:

We use Rocks Cluster Distribution that comes with SGE.

ADD COMMENT • link written 4.4 years ago by Ashutosh Pandey 10k 1

+1 Rocks - If you're setting up a dedicated cluster, it will save you a lot of time and pain.

ADD REPLY • link written 4.3 years ago by mike.thon • 30

I'm not a huge rocks fan personally, but one huge advantage, especially (but not only) if you have researchers who use XSEDE compute resources in the US, is that you can use the XSEDE campus bridging rocks rolls which bundle up a large number of relevant software packages as well as the cluster management stuff. That also means that you can directly use XSEDEs extensive training materials to help get the cluster's new users up to speed.

ADD REPLY • link written 2.1 years ago by Jonathan Dursi • 250 3 gravatar for samsara 4.3 years ago by samsara • 470 The Earth samsara • 470 wrote:

It has been more than a year i have been using SGE for processing NGS data. I have not experienced any problem with it. I am happy with it. I have not used any other scheduler except Slurm few times.

ADD COMMENT • link written 4.3 years ago by samsara • 470 2 gravatar for richard.deborja 2.1 years ago by richard.deborja • 80 Canada richard.deborja • 80 wrote:

Used SGE at my old institute, currently using PBS and I really wish we had SGE on the new cluster. Things I miss the most, qmake and the "-sync y" qsub option. These two were completely pipeline savers. I also appreciated the integration of MPI with SGE. Not sure how well it works with PBS as we currently don't have it installed.

ADD COMMENT • link written 2.1 years ago by richard.deborja • 80 1 gravatar for joe.cornish826 2.1 years ago by joe.cornish826 4.4k United States joe.cornish826 4.4k wrote:

NIH's Biowulf system uses PBS, but most of my gripes about PBS are more about the typical user load. PBS always looks for the next smallest job, so your 30 node run that will take an hour can get stuck behind hundreds (and thousands) of single node jobs that take a few hours each. Other than that it seems to work well enough.

In my undergrad our cluster (UMBC Tara) uses SLURM, didn't have as many problems there but usage there was different, more nodes per user (82 nodes with ~100 users) and more MPI/etc based jobs. However, a grad student in my old lab did manage to crash the head nodes because we were rushing to rerun a ton of jobs two days before a conference. I think it was likely a result of the head node hardware and not SLURM. Made for a few good laughs.

ADD COMMENT • link modified 2.1 years ago • written 2.1 years ago by joe.cornish826 4.4k 2

"PBS always looks for the next smallest job" -- just so people know, that's not something inherent to PBS. That's a configurable choice the scheduler (probably maui in this case) makes, but you can easily configure the scheduler so that bigger jobs so that they don't get starved out by little jobs that get "backfilled" into temporarily open slots.

ADD REPLY • link written 2.1 years ago by Jonathan Dursi • 250

Part of it is because Biowulf looks for the next smallest job but also prioritizes by how much cpu time a user has been consuming. If I've run 5 jobs with 30x 24 core nodes each taking 2 hours of wall time, I've used roughly 3600 CPU hours. If someone is using a single core on each node (simple because of memory requirements), they're basically at a 1:1 ratio between wall and cpu time. It will take a while for their CPU hours to catch up to mine.

It is a pain, but unlike math/physics/etc there are fewer programs in bioinformatics that make use of message passing (and when they do, they don't always need low-latency ICs), so it makes more sense to have PBS work for the generic case. This behavior is mostly seen on the ethernet IC nodes, there's a much smaller (245 nodes) system set up with infiniband for jobs that really need it (e.g. MrBayes, structural stuff).

Still I wish they'd try and strike a better balance. I'm guilty of it but it stinks when the queue gets clogged with memory intensive python/perl/R scripts that probably wouldn't need so much memory if they were written in C/C++/etc.

[Mar 02, 2016] Son of Grid engine version 8.1.9 is availble

Mar 02, 2016 |


This is Son of Grid Engine version v8.1.9.

See <> for information on recent changes. See <> for more information.

The .deb and .rpm packages and the source tarball are signed with PGP key B5AEEEA9.

* sge-8.1.9.tar.gz, sge-8.1.9.tar.gz.sig:  Source tarball and PGP signature

* RPMs for Red Hat-ish systems, installing into /opt/sge with GUI
  installer and Hadoop support:

  * gridengine-8.1.9-1.el5.src.rpm:  Source RPM for RHEL, Fedora

  * gridengine-*8.1.9-1.el6.x86_64.rpm:  RPMs for RHEL 6 (and
    CentOS, SL)

  See < > for
  hwloc 1.6 RPMs if you need them for building/installing RHEL5 RPMs.

* Debian packages, installing into /opt/sge, not providing the GUI
  installer or Hadoop support:

  * sge_8.1.9.dsc, sge_8.1.9.tar.gz:  Source packaging.  See
    <> , and see
    <  > if you need (a more
    recent) hwloc.

  * sge-common_8.1.9_all.deb, sge-doc_8.1.9_all.deb,
    sge_8.1.9_amd64.deb, sge-dbg_8.1.9_amd64.deb: Binary packages
    built on Debian Jessie.

* debian-8.1.9.tar.gz:  Alternative Debian packaging, for installing
  into /usr.

* arco-8.1.6.tar.gz:  ARCo source (unchanged from previous version)

* dbwriter-8.1.6.tar.gz:  compiled dbwriter component of ARCo
  (unchanged from previous version)

More RPMs (unsigned, unfortunately) are available at < >.

[Nov 08, 2015] 2013 Keynote: Dan Quinlan: C++ Use in High Performance Computing Within DOE: Past and Future At 31 min there is an interesting slide that gives some information about the scale of system in DOE. Current system has 18,700 News system will have 50K to 500K nodes, 32 core per node (power consumption is ~15 MW equal to a small city power consumption). The cost is around $200M
Jun 09, 2013 | YouTube


[Jan 30, 2014] 12 Best HPC Blogs To Follow

January 30, 2014 |

News Blogs

  1. HPCwire ( - while not strictly a blog, HPCwire is a great source of short articles covering HPC news and opinion pieces written by their professional journalists. A handy feature is their independent RSS feeds that let you keep abreast of specific topics.
  2. InsideHPC ( - is another reliable source of HPC industry news. While they cover many of the same stories as HPCwire, insideHPC often brings a different perspective.
  3. HPCinthecloud ( - is the sister site of HPCwire that focus on covering high-end cloud computing in science, industry and the data center. It's a good source of news, ideas, and inspiration if you have an interest in combining HPC and Cloud.
  4. The Register HPC ( - Brings HPC news & opinions from around the world. Their card-like interface makes it easy to scan for stories that interest you.

Vendor Blogs

  1. Cray Computing Blog ( - Cray has been an important name in supercomputing since - well, since forever - and their blog reflects that heritage. You'll find long, thoughtful posts on a range of supercomputing topics there.
  2. High Performance Computing (HPC) at Dell ( - Dell plays a big role in today's HPC market, with Dell servers bearing the load for many a compute cluster. Their blog tends to be news-oriented, but it also contains a number of good thought leadership pieces too.
  3. Cisco HPC Blog ( - Cisco is a relative newcomer to HPC but their lead HPC blogger, Jeff Squyres, brings a veteran's passion to the task. This blog tends of focus on interesting technical aspects of HPC.
  4. Altair ( - The folks at Altair have a very active blog that covers a range of HPC-related topics. Sometimes it's news about Altair, but often they cover industry news too, and interesting topics such as, "Multiphysics: Towards the Perfect Golf Swing" which talks about golf as an engineering problem. Fun. The modern layout of their page makes it a pleasure to scan.

Other Blogs

  1. Forrester HPC ( - It's not a very active blog, and definitely not the place to look for news, but if you're looking for a straightforward analytic viewpoint on HPC, Forrester's blog is a good place to look.
  2. ISC HPC Blog ( - This community blog is hosted by the folks that put on the International Supercomputing Conference. It's a good source for thought-provoking articles about the science of supercomputing.
  3. HPC Notes ( is a good source for news and information about HPC. It's run by the Vice President of HPC at a consulting firm, but his coverage makes a point of being independent.
  4. Marc Hamilton's Blog ( - This is a personal blog that captures Marc Haminton's interests in HPC and….well…running. So while it's not strictly an HPC blog, I've included it here because his HPC-related posts can be interesting and varied. Just be prepared to skip over his occasional posts about running shoes.

There you have it. That's our list of go-to blogs about HPC. How does it line up with yours? Did I miss any good ones? Are there some on our list you think don't deserve to be there? Let me know.

[Jan 14, 2014] HPC Lessons for the Wider Enterprise World

Is HPC so specialized that the lessons learned from large-scale infrastructure (at all layers) are not transferrable to mirrored challenges in large-scale enterprise settings?

Put another way, are the business-critical problems that companies tackle really so vastly different than the associated hardware and software issues that large supercomputing centers have already faced and in many areas, overcome? Granted, there is already a significant amount of HPC to be found in enterprise datacenters worldwide in a number of areas-oil and gas, financial services, the life sciences, government and more. But as everything in technology seems bent on convergence, is there not a wider application for HPC-driven technologies in an expanding set of markets?

This is the first part of a series of focused pieces around these framing questions about HPC's map into the wider world. The sections of our extended special feature will target HPC-to-enterprise lessons in terms of hardware and infrastructure; software and applications; management at scale; cloud computing; big data; accelerators and more. But to kick things off, we wanted to build consensus around some of the main themes and ideas behind any movement that's happening (or needs to) as HPC lessons trickle into the scale, efficiency, performance and data-conscious world of the modern enterprise.

In some circles, HPC is viewed from afar as an academic-only landscape, dotted with rare peaks representing actual enterprise use. Of course, those inside supercomputing know that this portrait is limited-that HPC has a strong foothold in the areas mentioned above, and tremendous potential to reshape new areas that either thought HPC was out of reach or are using HPC but simply don't use the term. What is needed is a comprehensive view of how HPC can be broadly useful to critical segments enterprise IT…and that's what we intend to offer over the next couple of weeks.

The answer about whether or not there are a multitude of lessons HPC can teach the wider enterprise world, at least according to those we've spoken with for our the series on this subject, is resounding and positive. If there's any disagreement, it's on how those lessons translate, which are truly unique in the HPC experience, and of course, which hold the most promise for improved productivity, competitiveness or even application area.

Addison Snell, CEO of Intersect360 Research, whose research group follows the overlap between enterprise and HPC, made some parallels to put the question in context. "Traditionally, one of the characteristics that separated HPC from enterprise computing was that HPC featured jobs that would run to completion, and there would be a benefit in completing them faster, such as running a weather forecast, simulating a crash test, or searching for proteins that fit together with a given molecule." However, he says by contrast, enterprise environments are designed to run in steady state (email systems, CRM databases, etc.). "HPC purchases would tend to be driven by performance, with relatively faster adoption of new technologies, while enterprise computing was driven by reliability and new technology adoption with slower technology adoption."

"Early adopters and bellwethers in high performance computing are always the first to encounter new challenges as they push the limits of computation and data management," Herb Schultz from IBM's Technical Computing and Analytics group argued. He says that many of the challenges faced in the world of high performance computing "later come to haunt the broader commercial IT community." "How first movers respond to challenges with new technologies and improved techniques establishes a proven foundation that the next waves of users can exploit."

As Fritz Ferstl, CTO at Univa told us, there are essentially three "divisions" of in the HPC industry. There are the national labs and big science organizations; enterprise commercial HPC (as found in the expected verticals, including oil and gas, financial services, life sciences, etc.); and there is "a third not often recognized as HPC but rather as data-centric analysis, also known as big data."

Ferstl says that while the lab-level HPC category is "specific in that its leading edge requires tightly coupled architectures with the densest network interconnects, which drive up cost and complexity. They are geared toward running few ultra-large applications that demand aggregate memory and would take unacceptable amounts of runtime if not executed on such large systems." One step away from this is the commercial sectors that rely on HPC for their competitive edge. Of these, Ferstl notes whether its new reservoirs of oil and gas being explored, next generation products like cars or airplanes being designed and tested, or innovative drugs being discovered, "there would be no progress in any of these cases and many more if it wasn't for HPC as a key instrument for investigation, design, development, experimentation and validation."

But final on his list-and crucial to the enterprise transition (and HPC's lessons to teach it) is the heavy subject of data. What's really driving this forward motion of HPC tech into the enterprise is that buzzword we just can't get away from these days. Some might argue that the trend has actually been one of the best things that's happened for HPC's ability to propel into the wider enterprise world.

Snell commented that, "today, especially with big data analytics, more companies are encountering performance-sensitive applications that run to completion-at least in terms of iterations." He said his research has revealed that new categories of non-HPC enterprise users are emerging, all of whom are considering performance and scalability as top purchase criteria. "In some cases," he said, "these enterprises can be just as likely to explore new technologies as HPC users have been for years."

Some argue that in general, aside from being a question of data pressures, business need, and competitive edge, the real lessons HPC can teach are about talent and R&D capability. As Paul Dlugosch, Automata product director at Micron described, "One of the first lessons that come to mind is that people matter. While the HPC industry often celebrates our accomplishments on the basis of technical and performance benchmarks, the cost of achieving those benchmarks are often not discussed. The cost of system and semiconductor development can be easy enough to quantify. It is far more difficult, though, to determine the 'use' cost of advanced technologies. "While the raw power of our semiconductors and systems is immense it is the organic part of the system, the human being– that is emerging as a significant bottleneck," said Dlugosch.

"Fully exploiting the parallelism that exists in many high performance computing systems continues to absorb incredible amounts of human resources," he argued. "Given the large scale of commercial/enterprise data centers, it is just as important to pay close attention to this human factor. The HPC industry is certainly aware of this problem and is developing new architectures, tools and methodologies to improve human productivity. As commercial and enterprise data centers grow in capability and scale it will become just as important to consider the productivity of the humans involved in system programming, management and scaling."

It should be noted that on any level of this question, it's not a clear matter of teaching from the top to bottom. While HPC has solved a number of problems in some of the most challenging data and compute environment, especially in terms of scale, data movement, application complexity and elsewhere, there are elements that can filter from the enterprise setting to HPC-even that "big national lab" variety Ferstl describes.

There is general agreement that there are multiple lessons that high performance computing can carry into mainstream enterprise environments, no matter what vertical is involved. But on the flipside, there has been general agreement that many innovations are spinning out of the new class of enterprise environments-that the web scale companies with their bare-bones hardware running open source, natively developed, and purpose-built, nimble applications-have something to offer the supercomputing world as well.

Jason Stowe, CEO of HPC cloud company, Cycle Computing put it best when he told us, "We in HPC pay attention to the fastest systems in the world: the fastest CPUs, interconnects, and benchmarks. From petaflops to petabytes, we [in HPC] publish and analyze these numbers unlike any other industry…While we'll continue to measure things like LINPACK, utilization, and queue wait times, we're now looking at things like Dollars per Unit Science, and Dollar per Simulation, which ironically, are lessons that has been learned from enterprise."

From the people who power both enterprise and HPC systems to the functional elements of the machines and how they differ, there are just as many new questions that emerge from the first-what can HPC lend to large-scale business operations?

Stay tuned over the next two weeks as this series expands and hones in on specific issues and topics that influence how enterprises will look to HPC for answers to solving scale, data, management and other challenges.

[Dec 17, 2013] Cambridge U Deploys UK's Fastest Academic-Based Supercomputer By Leila Meyer

12/11/13 |

The University of Cambridge in England has deployed the fastest academic-based supercomputer in the United Kingdom as part of the new Square Kilometer Array (SKA) Open Architecture Lab, a multinational organization that is building the world's largest radio telescope.

The university built the new supercomputer, named Wilkes, in partnership with Dell, NVIDIA, and Mellanox. The system consists of 128 Dell T620 servers and 256 NVIDIA K20 GPUs (graphics processing units) connected by 256 Mellanox Connect IB cards. The system has a computational performance of 240 teraFLOPS (floating-point operations per second) and ranked 166th on the November 2013 Top500 list of supercomputers.

The Wilkes system also has a performance of 3,631 megaFLOPS per watt and ranked second in the November 2013 Green500 list that ranks supercomputers by energy efficiency. According to the university, this extreme energy efficiency is the result of the very high performance per watt provided by the NVIDIA K20 GPUs and the energy efficiency of the Dell T620 servers.

The system uses Mellanox's FDR InfiniBand solution as the interconnect. The dual-rail network was built using Mellanox's Connect-IB adapter cards, which provide throughput of 100 gigabits per second (Gbps) with a message rate of 137 million messages per second. The system also uses NVIDIA RDMA communication acceleration to significantly increase the systems' parallel efficiency.

The Wilkes supercomputer is partly funded by the Science and Technology Facilities Council (STFC) to drive the Square Kilometer Array computing system development in the SKA Open Architecture Lab. According to Gilad Shainer, vice president of marketing at Mellanox, the supercomputer will "enable fundamental advances in many areas of astrophysics and cosmology."

The Cambridge High Performance Computing Service (HPCS) is home to another supercomputer, named Darwin, which ranked 234th on the November 2013 Top500 list of supercomputers.

Recommended Links

Softpanorama hot topic of the month

Softpanorama Recommended

Top articles


Get products and technologies



Other Cockcroft columns at


System Performance Tuning

Oracle and Unix Performance Tuning ~ Usually ships in 24 hours
Ahmed Alomari / Paperback / Published 1997
Amazon price: $35.96 ~ You Save: $8.99 (20%)
Aix Performance Tuning ~ Usually ships in 2-3 days
Frank Waters / Paperback / Published 1996
Amazon price: $63.00
Optimizing Unix for Performance ~ Usually ships in 24 hours
Amir H. Majidimehr / Paperback / Published 1995
Amazon price: $40.00
Solaris Performance Administration : Performance Measurement, Fine Tuning, and Capacity Planning for Releases 2.5.1 and 2.6 ~ Usually ships in 24 hours
H. Frank Cervone / Paperback / Published 1998
Amazon price: $35.96 ~ You Save: $8.99 (20%)
Sun Performance and Tuning : Java and the Internet ~ Usually ships in 24 hours
Adrian Cockcroft, et al / Paperback / Published 1998
Amazon price: $40.80 ~ You Save: $10.20 (20%)
System Performance Tuning (Nutshell Handbooks) ~ Usually ships in 2-3 days
Michael Kosta Loukides, Mike Loukides / Paperback / Published 1991
Amazon price: $23.96 ~ You Save: $5.99 (20%)
UNIX Performance Tuning; Sys Admin-Essential Reference Series ~ Usually ships in 2-3 days
Sys Admin Magazine(Editor) / Paperback / Published 1997
Amazon price: $23.96 ~ You Save: $5.99 (20%)
Hp-Ux Tuning and Performance : Concepts, Tools and Methods (Hewlett-Packard Professional Books)
Robert F. Sauers, Peter Weygant / Paperback / Published 1999
Amazon price: $45.00 (Not Yet Published -- On Order)
Sun Performance and Tuning : Sparc & Solaris
Adrian Cockcroft / Paperback / Published 1994
(Publisher Out Of Stock)
Taming UNIX : UNIX Performance Management Series
Robert A. Lund / Spiral-bound / Published 1997
Amazon price: $59.95 (Special Order)


FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  


Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy


War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes


Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law


Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least

Copyright © 1996-2016 by Dr. Nikolai Bezroukov. was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case is down you can use the at


The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: February 13, 2018