Bright Cluster Manager
While it is called cluster manager, this is a essentially a pretty generic Linux configuration management
system with cluster tilt. It allows "bare metal" reimaging of nodes from
the a set of images (server can enrolled into a group, each of which can be assigned an image; if
node is not assigned to anygroup the default image is used), which typically is stored on the headnode
or special provisioning server (for really large clusters). There can be multiple images, one for each type
of the node.
This is a distribution-agnostic tool that is able to
support heterogeneous hardware, not only HPC clusters. This allows the deployment of any kind of
Linux distribution (standard or customized) to any kind of target machine.
Other then HCP cluster suitable deployment might include computer labs and render farms.
Like SystemImager, Bright CM works with file based (rather than block
based) system images using rsync. An image is stored as a directory hierarchy of files
representing a snapshot of a machine, containing all the files and directories from the root of that
machine's file system. Images can be acquired in multiple ways, including retrieval from a sample
system.
One method of image creation is using a pre-installed node (the
golden-client). In this way, the user can customize and tweak the golden-client's configuration
according to his needs, verify it's proper operation. this helps to assure that the image, once deployed,
will behave in the same way as
the golden-client. Incremental updates are possible by syncing an updated golden-client to the
image, then syncing that image to deployed machines.
Images are hosted in a central repository on the headnode of the cluster or on a special server called the image-server, and they
are
distributed among the nodes using rsync.
This is a commercial software developed by Bright computing. Development office is in Amsterdam,
NL. Backed by ING Bank as shareholder.
Along with supplying valuable (albeit complex) package they also provide value as a
pretty capable packagers of important software like cluster managers (SGE, Torque, PBSpro ), etc. This part
of their added value should probably be taken into account when you purchase the license because
almost identical to the functionality of System Imager (which became abandonware in 2015).
With Red Hat introducing Red Hat for HPC (later renamed Red Hat for scientific computing) Red Classic computer node
license model is no longer works. And means that if node has this flavor of RHEL installed,
you need to customize each node as registration is now separate for each node. The idea of working with images based on DHCP and
identical images for all nodes Bright Manager relied upon is broken. You also can't patch the image on the headnode via chroot, because now your nodes are
licensed a different distribution which is only a subset of Enterprise. As if this is a different flavor
of OS.
This problem is not fatal -- you can always create the image from a node instead (working with
the selected node as a golden image), but this is a
new situation as now you need to preserve licensing information for each and every node separately and
populate is as an additional step after the installation.
As many complex Unix management systems, Bright CM modifies many system files in a way you do not understand
and that make integration of new software more complex and troubleshooting almost impossible.
Look for example at definition of parallel environment
in SGE, it contains references to some CM scripts. SGE environment is loaded via modules which are are
also in SM directories. To change certain parameters in SGE (for example the default number of
cores on the node) actually requires changing them in CM too. It is unclear why, but that's
how it its. See also
Why is
my workload manager configuration changed by Bright?
You can learn some useful staff from those modifications, but they create unique troubleshooting
problems. The problem is that you need to guess that CM is the culprit. It is not obvious in the case
I described. I, for example, guessed it due to return to previous behaviour after I
changed manually SCE configuration to correct incorrect count of cores of the servers
If you manage a small cluster (let's say less than 50 nodes, at some point you might ask yourself
the question: whether the game is worth the candles. The key attraction of CM -- the ability seamlessly
restore computational node from the image can be implemented in several other ways. Beyond that Bright
cluster manager does not provide any indispensable functionality for small clusters.
It has three typical problems inherent in such systems:
- It creates an additional, pretty complex layer that obstruct viewing and understanding lower
layers. This is especially important for troubleshooting, which is badly affected if you
need to debug issues that touch CM functionality. For example after CM is installed on the
headnode you can't change the hostname easily. Also default solution when nodes use specific
private network is suboptimal is cases you need to connect nodes to external env during computations
(unless you have extra interface, which is often not the case for blades; you probably can use virtual
interfaces, though).
- It introduces custom command language, which if you use it episodically is a pain in the
neck and can be used only on the base of previously documented examples. As the language is not used often, you
do need a cheat sheet to use most typical commands.
They are not intuitive and the syntax sometimes is pretty weird. For anything more complex then typical
operations, you depend on CM support, which, actually, is pretty good.
- Documentation is weak. Typically no attempts to provide most common usage
examples. Or explain the ideas behind particular feature. Too much focus of description of
full capabilities of the system which are impressive -- it can manage really big clusters,
including exotic features needed only or mainly for large clusters, etc. In this sense CM documentation
is really bad. Node provisioning chapter, the chapter about the feature that is the crown
jewel
of the package is poorly written and does not have examples of typical usage. for example, in no
way you can figure out from it how to create a new image (hint: you need first to clone an
existing image and then overwrite it). No explanation of key ideas no detailed examples of
typical usage. Nothing. Those guys simply can't understand the importance of documentation and instead
are engaged in a rat race of adding feature (malignant featurism ;-)
There are several interesting parts of Bright CM. Among them:
- Working with images. this is the most impressive part. Bright created pretty elaborate
system of working with image.
- Different images can be assigned to groups of nodes.
- Different mode of synchronization with the image can be assigned for each node
- Image can be created from any node. so instead of working with the image in chroot mode you
can work with the real node and recreate an image from it.
- Schedulers offered by Bright (SGE, Torque, etc) provide considerable additional value,
helping to offset the licensing cost for the manager itself. In is not that easy to
install correctly Torque (RPM are often junk) or SGE (only Son of SGE is usable free
distribution) and Bright provides some support for version it bundle with the cluster manager. They also can
be used in production env outside typical clusters.
- Boot process. the idea to associate different ways of booting the server with different
functionality of the manager is pretty slick. For PXE mode in addition you can specify verious ways
of synchronization with the image.
- The ability to work directly with Dell DRAC. for example you can reboot the server, or
a group fop servers directly from Bright CM, without of a hassle of opening DRAC for each such node.
- Environment modules integration. Provisioning of several free schedulers (SGE, Torque,
etc)
- Usable, well designed GUI-client based on Firefox. many operation on nodes can be performed
using it instead less user-friendly cmsh. It has some useful monitoring capabilities as as such provides
additional value to the package.
- The ability to update Dell firmware automatically.
- Bright cluster management daemon monitors several important aspects of functioning of every
node, and reports any problems it detects in the software or the hardware, so that you can take
action.
- Provisioning of Puppet.
- Power management capabilities
The working with images part is the most interesting part of Bright Cluster Manager. Few systems
implement it as consistently as CM. Here the designers demonstrated some original thinking (for example
the role of "boot record" as indicator of how the node should behave). You can create image from
the node and distribute it to other nodes. CM takes care of all customarization needed. If node is configured
for network book (or if boot record is absent) CM automatically reimage the node, or synchronized it
with image if image already exists. Otherwise you have a regular book. that means that inserting/removing
boot image changes the behaviors of a group of the server in a very useful way.
You can have multiple images which are stored as actual directory tree of files.
Managing images is done using chroot and is not very convenient, but as there is a possibility to
creating image from a node you can do everything on a selected node instead, then create an image from
this node and distribute it to other nodes.
Bright CM can be used with non-cluster blade enclosures (set of 16 or more real servers), if you
can justify the cost of the license. This is an interesting pandable and scalable turnkey solution
for managing blades, which is especially attractive for Dell blades as here Bright interacts well with
DRAC. For example, for webserver farm. Or DMZ servers.
It also installs a lot of useful software, such as pdsh and environment modules. The latter are installed
with integrated examples of package which can serve as a framework for developing you own set of environmental
modules. Generally the environment modules supplied are of high quality.
By default, nodes boot from the network when using Bright Cluster Manager. This is called
a network boot, or sometimes a PXE boot. The head node (or some other node, called provising node) runs
a tftpd server from within xinetd. It supplies the boot loader for the default or
assigned to the node software image.
You can also install regular boot record on the node and use PXE boot only as needed.
Bright can provision and sync nodes only via PXE boot.
Aspects of power management in Bright Cluster Manager include:
- managing the main power supply to nodes through the use of power distribution units, baseboard
management controllers, or CMDaemon. mainly with Dell.
- monitoring power consumption over time
- setting CPU scaling governors for power-saving
- setting power-saving options in workload managers
- ensuring the passive head node can safely take over from the active head during failover
- allowing cluster burn tests to be carried out
That creates some opportunities for power savings, which is extremely important in large clusters. You
can for example shut down inactive nodes and bring them back if there are jobs in queue that wait for
resources.
00 sAs clusters often are used by a large number of researcher user management presents some problems.
Bright CM allow (via The usernodelogin setting of cmsh) to restricts direct user logins from outside
the HPC scheduler, and is thus one way of preventing the user from using node resources in an unaccountable
manner. The usernodelogin setting is applicable to node categories only, not to individual nodes.
# cmsh
[bright71]% category use default
[bright71->category[default]]% set usernodelogin onlywhenjob
[bright71->category*[default*]]% commit
The attributes for usernodelogin are:
- always (the default): This allows all users to ssh directly
into a node at any time.
- never: This allows no user other than root to directly
ssh into the node.
- onlywhenjob: This allows the user to ssh directly into
the node when a job is running on it.
Bright Cluster Manager runs its own LDAP service to manage users, rather than using unix user and
group files. That means that users and groups are managed via the centralizing LDAP database server
running (assesble via cmgui) on the head node, and not via entries in /etc/passwd or /etc/group files.
You can use cmsh too. for example
[root@bright71 ~]# cmsh
[bright71]% user
[bright71->user]%
[bright71->user]% add user maureen
[bright71->user*[maureen*]]%
[bright71->user*[maureen*]]% commit
[bright71->user[maureen]]% show
You can set user and group properties via the set command. Typing set and then either using tab
to see the possible completions, or following it up with the enter key, suggests several parameters
that can be set, one of which is password:
Example
[bright71->user[maureen]]% set
Name:
set - Set specific user or group property
Usage:
set <parameter>
set user <name> <parameter>
set group <name> <parameter>
You can editing groups with append and remove from commands. They are used to add extra
users to, and remove extra users from a group. For example, it may be useful to have a compiler group
so that several users can share access to the intel compiler.
Dell BIOS management in Bright Cluster Manager means that for nodes that run on Dell hardware, the
BIOS settings and BIOS firmware updates can be managed via the standard Bright front end utilities to
CMDaemon, cmgui and cmsh.
In turn, CMDaemon configures the BIOS settings and applies firmware updates to each node via a standard
Dell utility called racadm. The racadm utility is part of the Dell OpenManage software stack. The Dell
hardware supported includes R430, R630, R730, R730XD, R930 FC430, FC630, FC830 M630, M830 and C6320
The utility racadm must be present on the Bright Cluster Manager head node. The utility is installed
on the head node if Dell is selected as the node hardware manufacturer during Bright Cluster Manager
installation. IPMI must be working on all of the servers. This means that it should be possible to communicate
out-of-band from the head node to all of the compute nodes, via the IPMI IP address.
That's typical for complex software packages. But still this is pretty annoying. Truth be told
that cmsh has buil-in help which parcially compasate absence of deataied documentation on it. But absence
fo a typical usage examples is really bad.
Important nuances are not mentioned. Generally this documentation is useful only in one case:
if you never read it and rely on CM customer support. If they point you to the documentation, just ignore
it. Just record how they solved the problem and create you custom documentation from it. After several
tickets you will have a valuable private database.
ther also run knoledge base that might contin valubale information.
CM changes the behavior of some components for example SGE in a way that complicates troubleshooting.
for example in one case it enforced wrong number of cores on the servers. And if you correct it in SGE
all.q after a while it returns to the incorrect number.
If initial configuration is incorrect you are in trouble in more the n one way. for example with
SGE I noticed a very interesting bug: if you server has 24 cores and in all.q mistakenly initially configured
with the number of slot equal to 12 cores, you are in trouble. You change it via qconf
command in SGE is think that you are done. Wrong. After a while it returns to the incorrect number.
At this moment you want to kill CM designers because they are clearly amateurs.
Another case I already mentioned: if the node does not have a boot record it can be reimages from
the image and if you have differences between the current state of the node and image all differences
are lost. In ideal case you should not. But life is far from ideal.
NOTE: this kind of Microsoft style advertising of the product. They present a nice GUI, but forget
to mention that GUI is not everything and you can't manage the cluster from it.
== quote ==
The sophisticated node provisioning and image management system in Bright Cluster Manager®
allows you to do the following:
-
- Install individual nodes or complete clusters from bare metal within minutes. This applies to
big data clusters and OpenStack private clouds in addition to HPC clusters.
- Create, manage and use as many node images as required.
- Create, manage and use images that are very different (for example, based on different Linux
kernels or distributions of Linux, Apache Hadoop and OpenStack).
- Create or change images substantially without breaking compatibility with application software.
- Assign images to individual nodes or groups of nodes with a single command or mouse click.
- Make changes to node images on the head node, without having to login to regular nodes.
- Synchronize a regular node image on the head node from a hard disk on a regular node.
- Apply RPM package commands
to node images, manually or automatically (for example, using
Yum).
- Update images incrementally, only transferring changes to the nodes.
- Update images live, without having to reboot nodes.
- Configure how disks should be partitioned (optionally using software
RAID and/or
LVM ).
- Protect disks or disk partitions from being overwritten.
- Provision images to memory and run nodes diskless.
- Use revision control to keep track of changes to node images.
- Return to a previously stored node image if and when required.
- Backup all node images by backing up only the head node.
- Automatically update BIOS images or change BIOS configurations without keyboard or console access
to the nodes.
Bright Computing engineers will be on hand to demonstrate all the 7.1 updates that enable customers
to deploy, manage, use, and maintain complete HPC clusters over bare metal or in the cloud even more
effectively. Leading the list of enhancements is fully integrated support for
Intel® Enterprise Edition for Lustre (IEEL), integrated Dell BIOS operations, and
open source
Puppet. Improved integration with several workload managers and a refactored Web portal round out
the exciting enhancements.
Those who need to deploy, use and maintain a POSIX-compliant parallel file system will find the integrated
IEEL support lets them do so efficiently and with the well-known Bright Cluster Manager interface. Fully
integrated support for Puppet ensures the right services are up and running on the right platforms,
through enforced configurations. With integrated support for Dell BIOS firmware and configuration settings,
users can deploy and maintain supported Dell servers from the BIOS level, using Bright's familiar interface.
Broader and deeper support for Slurm, Sun Grid Engine, and Univa Grid Engine ensures that Bright
Cluster Manager for HPC fully integrates the capability to optimally manage HPC workloads. Users can
quickly and easily monitor their HPC workloads through the updated web interface provided by Bright's
user portal. Version 7.1 also incorporates refactored internals for improved performance, as well as
finer-grained management control that includes customized kernels.
"We are excited to share the latest updates and enhancements we've made to Bright Cluster Manager
for HPC. Collectively, they further reduce the complexity of on-premise HPC and help our customers extend
their on-premise HPC environment into the cloud," said Matthijs van Leeuwen, Bright Computing Founder
and CEO. "The latest version allows our customers to manage their HPC environment alongside their platforms
for Big Data Analytics, based on Apache Hadoop and Apache Spark, from a single management interface."
For more information, visit
http://www.brightcomputing.com/Solutions-HPC
- 20170713 : How do I know when a clone operation has completed? ( How do I know when a clone operation has completed?, Jul 13, 2017 )
- 20170713 : How do I upgrade a Torque package? ( How do I upgrade a Torque package?, )
- 20170713 : How do I integrate a custom torque installation with Bright Cluster Manager? ( How do I integrate a custom torque installation with Bright Cluster Manager?, )
- 20170713 : Installing bacula installs pbspro, which I don't want. What should I do? ( Installing bacula installs pbspro, which I don't want. What should I do?, )
- 20170713 : How do I add a Bright ISO as a YUM repository? ( How do I add a Bright ISO as a YUM repository?, )
- 20170713 : When does a node need to be restarted? ( When does a node need to be restarted?, )
- 20170713 : How can I flash update all my nodes (from Linux)? ( How can I flash update all my nodes (from Linux)?, )
- 20170619 : How to easily install configure the Torque-Maui open source scheduler in Bright by Robert Stober ( Jun 19, 2017 , www.brightcomputing.com )
- 20170619 : OpenStack Neutron Mellanox ML2 Driver Configuration in Bright ( OpenStack Neutron Mellanox ML2 Driver Configuration in Bright, )
- 20170619 : Bright Cluster Manager 7 for HPC - New ( Bright Cluster Manager 7 for HPC - New, )
- 200102 : How do I set up a local Bright repository? ( How do I set up a local Bright repository?, )
We would like to know when exactly the clone of an image has
completed. This is so we can automate some image update and test processes. Ie: we clone an
image, apply updates to the clone, assign that updated image to a category, and reboot a
node for testing the updated image.
However, the current "clone/commit" process goes into the background. This makes
programmatically determining when it finished rather difficult. Can we make the commit of
an image clone wait for completion in the cmsh shell so our script will wait before
attempting to apply updates?
In 6.0 the
--wait option to the commit command
makes cmsh wait for any background task
to complete. A list of tasks that are waiting for
completion can be seen with cmsh -A -c "task list"
For versions of BCM prior to 6.0, the following technique can be used:
The CMDaemon will not start the background copy operation if the target directory already
exists. So what you can do from a bash script is something like this:
cp -a /cm/images/default-image /cm/images/new-image
cmsh -c "softwareimage; clone default-image new-image; commit"
The first line guarantees the copy is done (and exits after the cp is done). That means
that the second line does pretty much nothing except for housekeeping, which lets cmd then
know of new-image. In particular for the second line, cloning, which normally runs in the
background to carry out the copy, doesn't do any copying because that was already done.
Applying updates to the images can then be carried out without needing to test if the
clone has completed.
When does a node need to be restarted? Why does a node need
to be restarted? Can I ignore it? How do I clear that status?Can I ignore it?
Not really, unless you really know what you are doing. You can see if a node needs
restarting from the device status command (alias: ds):
In cmsh:
bright60% device status
apc01 .................... [ UP ] health check failed
devhp .................... [ UP ] health check failed
node001 .................. [UP ] restart-required
node002 .................. [ UP ] health check failed
Or from cmgui -> nodes[node001] -> hostname[state]: restart-required.
When does a node need to be restarted?
A restart-required flag is set when a commit is done on a node that changes the state of:
category/image/ip/hostname/diskSetup/pxelabel/initialize script/finalize script/install
boot record.
Similar rules apply for category and image commit.
These settings all have fields used by the node-installer.
It is possible to get false positives. For example adding a newline to a script will mark the
node as restart-required.
There are however potentially many things that can differ when changes are made, and no guarantee
that all settings from the new category have been applied until you reboot the node. The reason
why a restart-required message is there, is to warn you that the node may be in a weird state
(e.g., if moving a node from category B to a new category A, it may still be using the software
image that has been set for category B).
Why does a node need to be restarted?
The reason for the failure is often given within parentheses:
bright60% device status
node060 .................. [ UP ] (eth0 changed) restart-required
node061 .................. [ UP ] (category changed) restart-required
Sometimes the info message gives a clue on the reason for failure:
[bright60->device]% status node001
node001 .................. [ DOWN ] pingable, restart-required, health check failed
In which case you can investigate the reason further. Eg, check the health checks with.
[bright60->device]% latesthealthdata node001
Health Check
Severity Value Age (sec.)
Info Message
---------------------------- -------- ---------------- ---------- ----------------------------------------
nanchecker
10 FAIL
1090
DeviceIsUp
40 FAIL
10
ssh2node
0 PASS
1090 Not UP according to CMDaemon
[bright60->device]%
How do I clear that status?
You can clear the install-required flag without a reboot in cmsh by closing and opening
the node:
device open --reset -n node001..node100
How can I flash update all my nodes (from Linux)?(For
update via DOS, see
/faq/index.php?action=artikel&cat=20&id=94
)
Some manufacturers like Dell provide a flash BIOS upgrade utility that is run from within
Linux. Such a utility typically requires the node to reboot from the hard drive after running
the utility, and only then will the upgrade be complete. It typically therefore does not work
with the nodes of a cluster, because nodes by default do a PXE boot.
Because the flash upgrade utility is usually a binary, it is unclear how it works. The procedure
to make it work described next is based on some commonsense guesswork. The manufacturer should
be contacted to confirm how their utility works, before trying out the procedure described
next.
The procedure described next should be
done with care because a node that has a damaged BIOS may not function at all. Such a node
is called a node that is "bricked" because it may be as much use as a brick for its intended
purpose.
The procedure is based on the likelihood that the utility modifies something on the local
drive, probably a service which is loaded on system startup.
The trouble with the firmware update utility doing that (a modifcation that is to run on
startup) is that a regular node in the cluster normally PXE boots from a software image, and
not from the regular node hard drive that the utility has modified. So this is why, in a default
cluster, updating a flash bios from Linux will not succeed for regular nodes.
For such a case, the utility can however usually be made to work by simply setting the node
to non-sync on the next install. For example:
cmsh -c "device use node001; set nextinstallmode
nosync; commit"
After running the firmware installation utility, and if the node has the updated bios on
it after reboot and if the node is behaving ok, following procedure can then install the firmware
to the remaining nodes:
1. Make sure the firmware binary is on all nodes by placing it in your software image, or
by using pcopy in cmsh.
Download the firmware binary and save it to /opt on the head node
[bright60 ~]# chmod 755 /C6220_BIOS_15R41_LN32_1.1.9.BIN
[bright60 ~]# cmsh
[bright60]% device pcopy -n node002..node200
/opt/C6220_BIOS_15R41_LN32_1.1.9.BIN /opt
[bright60]% device commit
2. set the nextinstallmode to NOSYNC. example:
[bright60]% device foreach -n node002..node200
(set nextinstallmode nosync)
[bright60]% device commit
3. Use pexec to call the firmware upgrade utility on all nodes. Example:
[bright60]% device pexec -n node002..node200
" /opt/C6220_BIOS_15R41_LN32_1.1.9.BIN -q"
Note that by setting a node's 'nextinstallmode' to NOSYNC, you are telling it to skip image
re-synchronization on the next boot. After this boot, everything will be back to normal. It
is probably wise to do the update on a small number of nodes first (e.g. 5-10) so that all
the nodes are not bricked if something goes wrong. Starting cautiously by doing it on one node
is probably a good idea.
How to easily install & configure the
Torque/Maui open source scheduler in
Bright
| August 14, 2012 |
workload manager
,
HPC job scheduler
,
Maui
,
Torque
Bright Cluster Manager makes most
cluster management tasks very easy to
perform, and installing workload
managers is one of them. There are
many workload managers that are
pre-configured, admin-selectable
options when you install Bright,
including PBS Pro,
SLURM
, LSF, openlava, Torque, and
Grid Engine
.
The open source
scheduler Maui is not pre-configured,
but it's really easy to install and
configure this software in Bright
Cluster Manager. This article shows
you how.
The process is
to download and install the Maui
scheduler, then to configure Bright
to use Maui to schedule torque jobs.
Getting Started
Step1:
Download
the Maui scheduler from the Adaptive
Computing website: You will need to
register on their site before you can
download it.
Step 2: Install it as shown below.
This command will overwrite the
Bright zero-length Maui placeholder
file.
# cp -f maui-3.3.1.tar.gz /usr/src/redhat/SOURCES/maui-3.3.1.tar.gz
Step 3: Build the Maui RPM.
# rpmbuild -bb /usr/src/redhat/SPECS/maui.spec
Step 4: Install the RPM.
# rpm -ivh /usr/src/redhat/RPMS/x86_64/maui-3.3.1-59_cm6.0.x86_64.rpm
Preparing...
###########################################
[100%]
1:maui
###########################################
[100%]
Select the node that is running the
Torque server (usually the head node)
resource, then the "roles" tab.
Configure the "scheduler" property of
the Torque Server role to use the
Maui scheduler.
Step 5. Load the Torque and Maui
modules. This adds the Maui commands
to your PATH in the current shell.
$ module load torque
$ module load maui
The "initadd" command adds the
Torque and Maui modules to your
environment so that next time you log
in they're automatically loaded.
$ module initadd torque maui
Step 6. Submit a simple Torque job.
$ qsub stresscpu.sh
5.torque-head.cm.cluster
The job has been submitted and is
running.
$ qstat
Job id Name User Time Use S Queue
-------------------------
---------------- ---------------
-------- - -----
5.torque-head stresscpu rstober 0 R
shortq
The Maui showq command displays
information about active, eligible,
blocked, and/or recently completed
jobs. Since Torque is not actually
scheduling jobs, the showq command
displays the actual job ordering.
$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
5 rstober Running 1 99:23:59:28 Thu
Aug 9 11:40:45
1
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0
The Maui checkjob displays detailed
job information for queued, blocked,
active, and recently completed jobs.
$ checkjob 5
checking job 5
State: Running
Creds: user:rstober group:rstober class:shortq qos:DEFAULT
WallTime: 00:01:31 of 99:23:59:59
SubmitTime: Thu Aug 9 11:40:44
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
StartTime: Thu Aug 9 11:40:45
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Allocated Nodes:
[node003.cm.cluster:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '5' (-00:01:31 -> 99:23:58:28 Duration: 99:23:59:59)
PE: 1.00 StartPriority: 1
Download the latest Mellanox OFED package for Centos/RHEL 6.5
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
The package name looks like this: MLNX_OFED_LINUX-<version>-rhel6.5-x86_64 (The package can be
download either as an ISO or a tarball).
The OFED package is to be copied (one way or another) to all the compute hosts which require an
upgrade of the firmware. (Note, only during a later stage of the article we will be describing
the actual installation of the OFED in the package into the software images. Right now we only want
the file on the live node)
An efficient way to upgrade the firmware on multiple hosts would be to extract (in case of tar.gz
file) or copy (in case of using a ISO) the OFED package directory to a shared location such as /cm/shared
(which is mounted on compute nodes by default).
Then we can use the pdsh tool in combination with category names to parallelize the upgrade.
In our example we extract the OFED package to /cm/shared/ofed.
Before we begin the upgrade we need to remove the cm-config-intelcompliance-slave package to avoid
conflicts:
[root@headnode ~]# pdsh -g category=openstack-compute-hosts-mellanox "yum remove -y cm-config-intelcompliance-slave"
(For now we will only remove it from live nodes. We will remove it from the software image later
in the article. Do not forget to also run this command on the headnode)
In some cases the package 'qlgc-ofed.x86_64' may also need to be removed. In such case the mlnxofed
install will not proceed. A log of the installer can always be viewed in /tmp/MLNX_OFED_LINUX-<version>.<num>.logs/ofed_uninstall.log
to determine which package is conflicting and remove it manually.
And then run the firmware upgrade:
[root@headnode ~]# pdsh -g category=openstack-compute-hosts-mellanox "cd /cm/shared/ofed/MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64/
&& echo \"y\" | ./mlnxofedinstall --enable-sriov" | tee -a /tmp/mlnx-firmware-upgrade.log
(Do not forget to execute these two steps on the network node and the headnode)
Note that we are outputting both to the screen and to a temporary file (/tmp/mlnx-firmware-upgrade.log).
This can help spotting any errors that might occur during the upgrade.Running the 'mlnxofedinstall
--enable-sriov' utility does two things:
- installs OFED on the live nodes
- updates the firmware on the InfiniBand cards and enables the SR-IOV functionality.
Notice, that in the case of compute nodes (node001-node003) at this point we're mostly interested
in the latter (firmware update and enabling SR-IOV). Since we've run this command on the live node,
the filesystem changes have not been propagated to the software image used by the nodes (i.e. at
this point they would be lost on reboot). We will take care of that later on in this article by installing
the OFED also to the software image.
In the case of headnode, however, running this command also effectively installs OFED and
update firmware, which is exactly what we want.
Bright Cluster Manager 7
Bright Cluster Manager for HPC lets customers deploy complete HPC clusters on bare metal and
manage them effectively. It provides single-pane-of-glass management for the hardware, operating
system, HPC software, and users. With Bright Cluster Manager for HPC, system administrators can get
their clusters up and running quickly and keep them running reliably throughout their life cycle
– all with the ease and elegance of a fully featured, enterprise-grade cluster manager.
With the latest release, we've added some great new features that make Bright Cluster Manager
for HPC even more powerful.
New Feature Highlights
Image Revision Control – We've added revision control capability which means
you can track changes to software images using standardized methods.
Integrated Cisco UCS Support – With the new integrated support for Cisco UCS
rack servers, you can rapidly introduce flexible, multiplexed servers into your HPC environment.
Native AWS Storage Service Support – Bright Cluster Manager 7 now supports native
AWS storage which means that you can use inexpensive, secure, durable, flexible and simple storage
services for data use, archiving and backup in the AWS cloud.
Intelligent Dynamic Cloud Provisioning – By only instantiating compute resources
in AWS when they're actually needed – such as after the data to be processed has been uploaded, or
when on-site workloads reach a certain threshold – Bright Cluster Manager 7 can save you money.
Bright Cluster Manager Images
-
The Cluster Management GUI of Bright Cluster Manager 7 illustrating queued jobs. Some jobs are
running on compute nodes that have been dynamically provisioned in the AWS cloud.
-
The Cluster Management GUI of Bright Cluster Manager 7 capturing a summarized description
How do I set up a local Bright repository?
1. Copy the Bright yum repo file,
/etc/yum.repos.d/cm.repo, from the
head node to the server where you're going to create the local mirror.
2. Get the repository ID:
(on the mirror server)
#
yum clean all
# yum
repo list
[...]
cm-rhel6-7.0 Cluster Manager 7.0 for Red Hat Enterprise Linux 6 301+8
cm-rhel6-7.0-updates Cluster Manager 7.0 for Red Hat Enterprise Linux 6 - Updates 371
[...]
3. Sync the repository locally:
#
mkdir -p /path/to/local/yum/repo/cm-rhel6-7.0
#
reposync --gpgcheck -l --repoid=cm-rhel6-7.0 -n
#
createrepo -v /path/to/local/yum/repo/cm-rhel6-7.0
#
mkdir -p /path/to/local/yum/repo/cm-rhel6-7.0-updates
#
reposync --gpgcheck -l --repoid=cm-rhel6-7.0 -n
#
createrepo -v /path/to/local/yum/repo/cm-rhel6-7.0-updates
4. You may need to create local repositories for
ceph-* and
epel as well since some Bright
packages may have some dependencies which are provided by these repositories.
Softpanorama Recommended
Society
Groupthink :
Two Party System
as Polyarchy :
Corruption of Regulators :
Bureaucracies :
Understanding Micromanagers
and Control Freaks : Toxic Managers :
Harvard Mafia :
Diplomatic Communication
: Surviving a Bad Performance
Review : Insufficient Retirement Funds as
Immanent Problem of Neoliberal Regime : PseudoScience :
Who Rules America :
Neoliberalism
: The Iron
Law of Oligarchy :
Libertarian Philosophy
Quotes
War and Peace
: Skeptical
Finance : John
Kenneth Galbraith :Talleyrand :
Oscar Wilde :
Otto Von Bismarck :
Keynes :
George Carlin :
Skeptics :
Propaganda : SE
quotes : Language Design and Programming Quotes :
Random IT-related quotes :
Somerset Maugham :
Marcus Aurelius :
Kurt Vonnegut :
Eric Hoffer :
Winston Churchill :
Napoleon Bonaparte :
Ambrose Bierce :
Bernard Shaw :
Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient
markets hypothesis :
Political Skeptic Bulletin, 2013 :
Unemployment Bulletin, 2010 :
Vol 23, No.10
(October, 2011) An observation about corporate security departments :
Slightly Skeptical Euromaydan Chronicles, June 2014 :
Greenspan legacy bulletin, 2008 :
Vol 25, No.10 (October, 2013) Cryptolocker Trojan
(Win32/Crilock.A) :
Vol 25, No.08 (August, 2013) Cloud providers
as intelligence collection hubs :
Financial Humor Bulletin, 2010 :
Inequality Bulletin, 2009 :
Financial Humor Bulletin, 2008 :
Copyleft Problems
Bulletin, 2004 :
Financial Humor Bulletin, 2011 :
Energy Bulletin, 2010 :
Malware Protection Bulletin, 2010 : Vol 26,
No.1 (January, 2013) Object-Oriented Cult :
Political Skeptic Bulletin, 2011 :
Vol 23, No.11 (November, 2011) Softpanorama classification
of sysadmin horror stories : Vol 25, No.05
(May, 2013) Corporate bullshit as a communication method :
Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000):
the triumph of the US computer engineering :
Donald Knuth : TAoCP
and its Influence of Computer Science : Richard Stallman
: Linus Torvalds :
Larry Wall :
John K. Ousterhout :
CTSS : Multix OS Unix
History : Unix shell history :
VI editor :
History of pipes concept :
Solaris : MS DOS
: Programming Languages History :
PL/1 : Simula 67 :
C :
History of GCC development :
Scripting Languages :
Perl history :
OS History : Mail :
DNS : SSH
: CPU Instruction Sets :
SPARC systems 1987-2006 :
Norton Commander :
Norton Utilities :
Norton Ghost :
Frontpage history :
Malware Defense History :
GNU Screen :
OSS early history
Classic books:
The Peter
Principle : Parkinson
Law : 1984 :
The Mythical Man-Month :
How to Solve It by George Polya :
The Art of Computer Programming :
The Elements of Programming Style :
The Unix Hater’s Handbook :
The Jargon file :
The True Believer :
Programming Pearls :
The Good Soldier Svejk :
The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society :
Ten Commandments
of the IT Slackers Society : Computer Humor Collection
: BSD Logo Story :
The Cuckoo's Egg :
IT Slang : C++ Humor
: ARE YOU A BBS ADDICT? :
The Perl Purity Test :
Object oriented programmers of all nations
: Financial Humor :
Financial Humor Bulletin,
2008 : Financial
Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related
Humor : Programming Language Humor :
Goldman Sachs related humor :
Greenspan humor : C Humor :
Scripting Humor :
Real Programmers Humor :
Web Humor : GPL-related Humor
: OFM Humor :
Politically Incorrect Humor :
IDS Humor :
"Linux Sucks" Humor : Russian
Musical Humor : Best Russian Programmer
Humor : Microsoft plans to buy Catholic Church
: Richard Stallman Related Humor :
Admin Humor : Perl-related
Humor : Linus Torvalds Related
humor : PseudoScience Related Humor :
Networking Humor :
Shell Humor :
Financial Humor Bulletin,
2011 : Financial
Humor Bulletin, 2012 :
Financial Humor Bulletin,
2013 : Java Humor : Software
Engineering Humor : Sun Solaris Related Humor :
Education Humor : IBM
Humor : Assembler-related Humor :
VIM Humor : Computer
Viruses Humor : Bright tomorrow is rescheduled
to a day after tomorrow : Classic Computer
Humor
The Last but not Least Technology is dominated by
two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt.
Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org
was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP)
without any remuneration. This document is an industrial compilation designed and created exclusively
for educational use and is distributed under the Softpanorama Content License.
Original materials copyright belong
to respective owners. Quotes are made for educational purposes only
in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains
copyrighted material the use of which has not always been specifically
authorized by the copyright owner. We are making such material available
to advance understanding of computer science, IT technology, economic, scientific, and social
issues. We believe this constitutes a 'fair use' of any such
copyrighted material as provided by section 107 of the US Copyright Law according to which
such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free)
site written by people for whom English is not a native language. Grammar and spelling errors should
be expected. The site contain some broken links as it develops like a living tree...
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or
referenced source) and are
not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness
of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be
tracked by Google please disable Javascript for this site. This site is perfectly usable without
Javascript.
Last modified: March, 12, 2019