Bright Cluster Manager

News	Unix Configuration Management Tools	Recommended Links	Working with images	Heterogeneous Unix server farms	Environment Modules	Grid engine
Parallel command execution	Config files distribution: copying a file to multiple hosts	Slurping: copying a file from multiple hosts	Configuration Files Generation	PDSH -- a parallel remote shell	C3 Tools	rdist
Provisioning	rsync	Software and configuration management using RPM	Building RPMs	cdist	Expect	SSH for System Administrators
Software configuration Management	puppet	etckeeper
Unix System Monitoring	git	Midnight Commander Tips and Tricks	IBM Remote System Management Tool	Webmin	Baseliners
Software Distribution	Simple Unix Backup Tools	Enterprise Job schedulers	Tips for working with Bright Manager	Sysadmin Horror Stories	Humor	Etc

Introduction
Red Hat for HPC and Bright Cluster Manager
Additional complexity introduced
Interesting solutions that were found in CM
Use knowledge base in addition to creation your own database of the tickets you posted
Road to hell is paved with good intentions
Supplement 1: Vendor information about the package

Introduction

While it is called cluster manager, this is a essentially a pretty generic Linux configuration management system with cluster tilt. It allows "bare metal" reimaging of nodes from the a set of images (server can enrolled into a group, each of which can be assigned an image; if node is not assigned to anygroup the default image is used), which typically is stored on the headnode or special provisioning server (for really large clusters). There can be multiple images, one for each type of the node.

This is a distribution-agnostic tool that is able to support heterogeneous hardware, not only HPC clusters. This allows the deployment of any kind of Linux distribution (standard or customized) to any kind of target machine. Other then HCP cluster suitable deployment might include computer labs and render farms.

Like SystemImager, Bright CM works with file based (rather than block based) system images using rsync. An image is stored as a directory hierarchy of files representing a snapshot of a machine, containing all the files and directories from the root of that machine's file system. Images can be acquired in multiple ways, including retrieval from a sample system.

One method of image creation is using a pre-installed node (the golden-client). In this way, the user can customize and tweak the golden-client's configuration according to his needs, verify it's proper operation. this helps to assure that the image, once deployed, will behave in the same way as the golden-client. Incremental updates are possible by syncing an updated golden-client to the image, then syncing that image to deployed machines.

Images are hosted in a central repository on the headnode of the cluster or on a special server called the image-server, and they are distributed among the nodes using rsync.

This is a commercial software developed by Bright computing. Development office is in Amsterdam, NL. Backed by ING Bank as shareholder.

Along with supplying valuable (albeit complex) package they also provide value as a pretty capable packagers of important software like cluster managers (SGE, Torque, PBSpro ), etc. This part of their added value should probably be taken into account when you purchase the license because almost identical to the functionality of System Imager (which became abandonware in 2015).

Red Hat for HPC and Bright Cluster Manager

With Red Hat introducing Red Hat for HPC (later renamed Red Hat for scientific computing) Red Classic computer node license model is no longer works. And means that if node has this flavor of RHEL installed, you need to customize each node as registration is now separate for each node. The idea of working with images based on DHCP and identical images for all nodes Bright Manager relied upon is broken. You also can't patch the image on the headnode via chroot, because now your nodes are licensed a different distribution which is only a subset of Enterprise. As if this is a different flavor of OS.

This problem is not fatal -- you can always create the image from a node instead (working with the selected node as a golden image), but this is a new situation as now you need to preserve licensing information for each and every node separately and populate is as an additional step after the installation.

Additional complexity introduced

As many complex Unix management systems, Bright CM modifies many system files in a way you do not understand and that make integration of new software more complex and troubleshooting almost impossible.

Look for example at definition of parallel environment in SGE, it contains references to some CM scripts. SGE environment is loaded via modules which are are also in SM directories. To change certain parameters in SGE (for example the default number of cores on the node) actually requires changing them in CM too. It is unclear why, but that's how it its. See also Why is my workload manager configuration changed by Bright?

You can learn some useful staff from those modifications, but they create unique troubleshooting problems. The problem is that you need to guess that CM is the culprit. It is not obvious in the case I described. I, for example, guessed it due to return to previous behaviour after I changed manually SCE configuration to correct incorrect count of cores of the servers

If you manage a small cluster (let's say less than 50 nodes, at some point you might ask yourself the question: whether the game is worth the candles. The key attraction of CM -- the ability seamlessly restore computational node from the image can be implemented in several other ways. Beyond that Bright cluster manager does not provide any indispensable functionality for small clusters.

It has three typical problems inherent in such systems:

It creates an additional, pretty complex layer that obstruct viewing and understanding lower layers. This is especially important for troubleshooting, which is badly affected if you need to debug issues that touch CM functionality. For example after CM is installed on the headnode you can't change the hostname easily. Also default solution when nodes use specific private network is suboptimal is cases you need to connect nodes to external env during computations (unless you have extra interface, which is often not the case for blades; you probably can use virtual interfaces, though).
It introduces custom command language, which if you use it episodically is a pain in the neck and can be used only on the base of previously documented examples. As the language is not used often, you do need a cheat sheet to use most typical commands. They are not intuitive and the syntax sometimes is pretty weird. For anything more complex then typical operations, you depend on CM support, which, actually, is pretty good.
Documentation is weak. Typically no attempts to provide most common usage examples. Or explain the ideas behind particular feature. Too much focus of description of full capabilities of the system which are impressive -- it can manage really big clusters, including exotic features needed only or mainly for large clusters, etc. In this sense CM documentation is really bad. Node provisioning chapter, the chapter about the feature that is the crown jewel of the package is poorly written and does not have examples of typical usage. for example, in no way you can figure out from it how to create a new image (hint: you need first to clone an existing image and then overwrite it). No explanation of key ideas no detailed examples of typical usage. Nothing. Those guys simply can't understand the importance of documentation and instead are engaged in a rat race of adding feature (malignant featurism ;-)

Interesting solutions that were found in CM

There are several interesting parts of Bright CM. Among them:

Working with images. this is the most impressive part. Bright created pretty elaborate system of working with image.
- Different images can be assigned to groups of nodes.
- Different mode of synchronization with the image can be assigned for each node
- Image can be created from any node. so instead of working with the image in chroot mode you can work with the real node and recreate an image from it.
Schedulers offered by Bright (SGE, Torque, etc) provide considerable additional value, helping to offset the licensing cost for the manager itself. In is not that easy to install correctly Torque (RPM are often junk) or SGE (only Son of SGE is usable free distribution) and Bright provides some support for version it bundle with the cluster manager. They also can be used in production env outside typical clusters.
Boot process. the idea to associate different ways of booting the server with different functionality of the manager is pretty slick. For PXE mode in addition you can specify verious ways of synchronization with the image.
The ability to work directly with Dell DRAC. for example you can reboot the server, or a group fop servers directly from Bright CM, without of a hassle of opening DRAC for each such node.
Environment modules integration. Provisioning of several free schedulers (SGE, Torque, etc)
Usable, well designed GUI-client based on Firefox. many operation on nodes can be performed using it instead less user-friendly cmsh. It has some useful monitoring capabilities as as such provides additional value to the package.
The ability to update Dell firmware automatically.
Bright cluster management daemon monitors several important aspects of functioning of every node, and reports any problems it detects in the software or the hardware, so that you can take action.
Provisioning of Puppet.
Power management capabilities

The working with images part is the most interesting part of Bright Cluster Manager. Few systems implement it as consistently as CM. Here the designers demonstrated some original thinking (for example the role of "boot record" as indicator of how the node should behave). You can create image from the node and distribute it to other nodes. CM takes care of all customarization needed. If node is configured for network book (or if boot record is absent) CM automatically reimage the node, or synchronized it with image if image already exists. Otherwise you have a regular book. that means that inserting/removing boot image changes the behaviors of a group of the server in a very useful way.

You can have multiple images which are stored as actual directory tree of files.

Managing images is done using chroot and is not very convenient, but as there is a possibility to creating image from a node you can do everything on a selected node instead, then create an image from this node and distribute it to other nodes.

Possibility of usage with non-cluster blade enclosures

Bright CM can be used with non-cluster blade enclosures (set of 16 or more real servers), if you can justify the cost of the license. This is an interesting pandable and scalable turnkey solution for managing blades, which is especially attractive for Dell blades as here Bright interacts well with DRAC. For example, for webserver farm. Or DMZ servers.

It also installs a lot of useful software, such as pdsh and environment modules. The latter are installed with integrated examples of package which can serve as a framework for developing you own set of environmental modules. Generally the environment modules supplied are of high quality.

Boot process

By default, nodes boot from the network when using Bright Cluster Manager. This is called a network boot, or sometimes a PXE boot. The head node (or some other node, called provising node) runs a tftpd server from within xinetd. It supplies the boot loader for the default or assigned to the node software image.

You can also install regular boot record on the node and use PXE boot only as needed.

Bright can provision and sync nodes only via PXE boot.

Power management

Aspects of power management in Bright Cluster Manager include:

managing the main power supply to nodes through the use of power distribution units, baseboard management controllers, or CMDaemon. mainly with Dell.
monitoring power consumption over time
setting CPU scaling governors for power-saving
setting power-saving options in workload managers
ensuring the passive head node can safely take over from the active head during failover
allowing cluster burn tests to be carried out

That creates some opportunities for power savings, which is extremely important in large clusters. You can for example shut down inactive nodes and bring them back if there are jobs in queue that wait for resources.

User management

00 sAs clusters often are used by a large number of researcher user management presents some problems. Bright CM allow (via The usernodelogin setting of cmsh) to restricts direct user logins from outside the HPC scheduler, and is thus one way of preventing the user from using node resources in an unaccountable manner. The usernodelogin setting is applicable to node categories only, not to individual nodes.

# cmsh
[bright71]% category use default
[bright71->category[default]]% set usernodelogin onlywhenjob
[bright71->category*[default*]]% commit

The attributes for usernodelogin are:

always (the default): This allows all users to ssh directly into a node at any time.
never: This allows no user other than root to directly ssh into the node.
onlywhenjob: This allows the user to ssh directly into the node when a job is running on it.

Bright Cluster Manager runs its own LDAP service to manage users, rather than using unix user and group files. That means that users and groups are managed via the centralizing LDAP database server running (assesble via cmgui) on the head node, and not via entries in /etc/passwd or /etc/group files.

You can use cmsh too. for example
[root@bright71 ~]# cmsh
[bright71]% user
[bright71->user]%
[bright71->user]% add user maureen
[bright71->user*[maureen*]]%
[bright71->user*[maureen*]]% commit
[bright71->user[maureen]]% show
You can set user and group properties via the set command. Typing set and then either using tab to see the possible completions, or following it up with the enter key, suggests several parameters that can be set, one of which is password:
Example
[bright71->user[maureen]]% set
Name:
set - Set specific user or group property
Usage:

set <parameter>

set user <name> <parameter>

set group <name> <parameter>

You can editing groups with append and remove from commands. They are used to add extra users to, and remove extra users from a group. For example, it may be useful to have a compiler group so that several users can share access to the intel compiler.

Dell BIOS Management

Dell BIOS management in Bright Cluster Manager means that for nodes that run on Dell hardware, the BIOS settings and BIOS firmware updates can be managed via the standard Bright front end utilities to CMDaemon, cmgui and cmsh.

In turn, CMDaemon configures the BIOS settings and applies firmware updates to each node via a standard Dell utility called racadm. The racadm utility is part of the Dell OpenManage software stack. The Dell hardware supported includes R430, R630, R730, R730XD, R930 FC430, FC630, FC830 M630, M830 and C6320 The utility racadm must be present on the Bright Cluster Manager head node. The utility is installed on the head node if Dell is selected as the node hardware manufacturer during Bright Cluster Manager installation. IPMI must be working on all of the servers. This means that it should be possible to communicate out-of-band from the head node to all of the compute nodes, via the IPMI IP address.

Use knowledge base in addition to creation your own database of the tickets you posted

That's typical for complex software packages. But still this is pretty annoying. Truth be told that cmsh has buil-in help which parcially compasate absence of deataied documentation on it. But absence fo a typical usage examples is really bad.

Important nuances are not mentioned. Generally this documentation is useful only in one case: if you never read it and rely on CM customer support. If they point you to the documentation, just ignore it. Just record how they solved the problem and create you custom documentation from it. After several tickets you will have a valuable private database.

ther also run knoledge base that might contin valubale information.

Road to hell is paved with good intentions

CM changes the behavior of some components for example SGE in a way that complicates troubleshooting. for example in one case it enforced wrong number of cores on the servers. And if you correct it in SGE all.q after a while it returns to the incorrect number.

If initial configuration is incorrect you are in trouble in more the n one way. for example with SGE I noticed a very interesting bug: if you server has 24 cores and in all.q mistakenly initially configured with the number of slot equal to 12 cores, you are in trouble. You change it via qconfcommand in SGE is think that you are done. Wrong. After a while it returns to the incorrect number. At this moment you want to kill CM designers because they are clearly amateurs.

Another case I already mentioned: if the node does not have a boot record it can be reimages from the image and if you have differences between the current state of the node and image all differences are lost. In ideal case you should not. But life is far from ideal.

Supplement 1: Vendor information about the package

NOTE: this kind of Microsoft style advertising of the product. They present a nice GUI, but forget to mention that GUI is not everything and you can't manage the cluster from it.

== quote ==

The sophisticated node provisioning and image management system in Bright Cluster Manager® allows you to do the following:

Install individual nodes or complete clusters from bare metal within minutes. This applies to big data clusters and OpenStack private clouds in addition to HPC clusters.
Create, manage and use as many node images as required.
Create, manage and use images that are very different (for example, based on different Linux kernels or distributions of Linux, Apache Hadoop and OpenStack).
Create or change images substantially without breaking compatibility with application software.
Assign images to individual nodes or groups of nodes with a single command or mouse click.
Make changes to node images on the head node, without having to login to regular nodes.
Synchronize a regular node image on the head node from a hard disk on a regular node.
Apply RPM package commands to node images, manually or automatically (for example, using Yum).
Update images incrementally, only transferring changes to the nodes.
Update images live, without having to reboot nodes.
Configure how disks should be partitioned (optionally using software RAID and/or LVM ).
Protect disks or disk partitions from being overwritten.
Provision images to memory and run nodes diskless.
Use revision control to keep track of changes to node images.
Return to a previously stored node image if and when required.
Backup all node images by backing up only the head node.
Automatically update BIOS images or change BIOS configurations without keyboard or console access to the nodes.

Bright Computing engineers will be on hand to demonstrate all the 7.1 updates that enable customers to deploy, manage, use, and maintain complete HPC clusters over bare metal or in the cloud even more effectively. Leading the list of enhancements is fully integrated support for Intel® Enterprise Edition for Lustre (IEEL), integrated Dell BIOS operations, and open source Puppet. Improved integration with several workload managers and a refactored Web portal round out the exciting enhancements.

Those who need to deploy, use and maintain a POSIX-compliant parallel file system will find the integrated IEEL support lets them do so efficiently and with the well-known Bright Cluster Manager interface. Fully integrated support for Puppet ensures the right services are up and running on the right platforms, through enforced configurations. With integrated support for Dell BIOS firmware and configuration settings, users can deploy and maintain supported Dell servers from the BIOS level, using Bright's familiar interface.

Broader and deeper support for Slurm, Sun Grid Engine, and Univa Grid Engine ensures that Bright Cluster Manager for HPC fully integrates the capability to optimally manage HPC workloads. Users can quickly and easily monitor their HPC workloads through the updated web interface provided by Bright's user portal. Version 7.1 also incorporates refactored internals for improved performance, as well as finer-grained management control that includes customized kernels.

"We are excited to share the latest updates and enhancements we've made to Bright Cluster Manager for HPC. Collectively, they further reduce the complexity of on-premise HPC and help our customers extend their on-premise HPC environment into the cloud," said Matthijs van Leeuwen, Bright Computing Founder and CEO. "The latest version allows our customers to manage their HPC environment alongside their platforms for Big Data Analytics, based on Apache Hadoop and Apache Spark, from a single management interface."

For more information, visit http://www.brightcomputing.com/Solutions-HPC

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20170713 : How do I know when a clone operation has completed? ( How do I know when a clone operation has completed?, Jul 13, 2017 )
20170713 : How do I upgrade a Torque package? ( How do I upgrade a Torque package?, )
20170713 : How do I integrate a custom torque installation with Bright Cluster Manager? ( How do I integrate a custom torque installation with Bright Cluster Manager?, )
20170713 : Installing bacula installs pbspro, which I don't want. What should I do? ( Installing bacula installs pbspro, which I don't want. What should I do?, )
20170713 : How do I add a Bright ISO as a YUM repository? ( How do I add a Bright ISO as a YUM repository?, )
20170713 : When does a node need to be restarted? ( When does a node need to be restarted?, )
20170713 : How can I flash update all my nodes (from Linux)? ( How can I flash update all my nodes (from Linux)?, )
20170619 : How to easily install configure the Torque-Maui open source scheduler in Bright by Robert Stober ( Jun 19, 2017 , www.brightcomputing.com )
20170619 : OpenStack Neutron Mellanox ML2 Driver Configuration in Bright ( OpenStack Neutron Mellanox ML2 Driver Configuration in Bright, )
20170619 : Bright Cluster Manager 7 for HPC - New ( Bright Cluster Manager 7 for HPC - New, )
200102 : How do I set up a local Bright repository? ( How do I set up a local Bright repository?, )

Old News ;-)

[Jul 13, 2017] How do I know when a clone operation has completed?

We would like to know when exactly the clone of an image has completed. This is so we can automate some image update and test processes. Ie: we clone an image, apply updates to the clone, assign that updated image to a category, and reboot a node for testing the updated image.
However, the current "clone/commit" process goes into the background. This makes programmatically determining when it finished rather difficult. Can we make the commit of an image clone wait for completion in the cmsh shell so our script will wait before attempting to apply updates?
In 6.0 the --wait option to the commit command makes cmsh wait for any background task to complete. A list of tasks that are waiting for completion can be seen with cmsh -A -c "task list"

For versions of BCM prior to 6.0, the following technique can be used:

The CMDaemon will not start the background copy operation if the target directory already exists. So what you can do from a bash script is something like this:
cp -a /cm/images/default-image /cm/images/new-image cmsh -c "softwareimage; clone default-image new-image; commit"
The first line guarantees the copy is done (and exits after the cp is done). That means that the second line does pretty much nothing except for housekeeping, which lets cmd then know of new-image. In particular for the second line, cloning, which normally runs in the background to carry out the copy, doesn't do any copying because that was already done.

Applying updates to the images can then be carried out without needing to test if the clone has completed.

How do I upgrade a Torque package?

How do I integrate a custom torque installation with Bright Cluster Manager?

Installing bacula installs pbspro, which I don't want. What should I do?

How do I add a Bright ISO as a YUM repository?

When does a node need to be restarted?

When does a node need to be restarted? Why does a node need to be restarted? Can I ignore it? How do I clear that status?
Can I ignore it?

Not really, unless you really know what you are doing. You can see if a node needs restarting from the device status command (alias: ds):

In cmsh:

bright60% device status

apc01 .................... [ UP ] health check failed
devhp .................... [ UP ] health check failed
node001 .................. [UP ] restart-required
node002 .................. [ UP ] health check failed

Or from cmgui -> nodes[node001] -> hostname[state]: restart-required.

When does a node need to be restarted?
A restart-required flag is set when a commit is done on a node that changes the state of:

category/image/ip/hostname/diskSetup/pxelabel/initialize script/finalize script/install boot record.
Similar rules apply for category and image commit.
These settings all have fields used by the node-installer.
It is possible to get false positives. For example adding a newline to a script will mark the node as restart-required.
There are however potentially many things that can differ when changes are made, and no guarantee that all settings from the new category have been applied until you reboot the node. The reason why a restart-required message is there, is to warn you that the node may be in a weird state (e.g., if moving a node from category B to a new category A, it may still be using the software image that has been set for category B).

Why does a node need to be restarted?

The reason for the failure is often given within parentheses:

bright60% device status

node060 .................. [ UP ] (eth0 changed) restart-required
node061 .................. [ UP ] (category changed) restart-required

Sometimes the info message gives a clue on the reason for failure:

[bright60->device]% status node001
node001 .................. [ DOWN ] pingable, restart-required, health check failed

In which case you can investigate the reason further. Eg, check the health checks with.

[bright60->device]% latesthealthdata node001
Health Check Severity Value Age (sec.) Info Message
---------------------------- -------- ---------------- ---------- ----------------------------------------
nanchecker 10 FAIL 1090
DeviceIsUp 40 FAIL 10
ssh2node 0 PASS 1090 Not UP according to CMDaemon
[bright60->device]%

How do I clear that status?

You can clear the install-required flag without a reboot in cmsh by closing and opening the node:
device open --reset -n node001..node100

How can I flash update all my nodes (from Linux)?

How can I flash update all my nodes (from Linux)?
(For update via DOS, see /faq/index.php?action=artikel&cat=20&id=94 )

Some manufacturers like Dell provide a flash BIOS upgrade utility that is run from within Linux. Such a utility typically requires the node to reboot from the hard drive after running the utility, and only then will the upgrade be complete. It typically therefore does not work with the nodes of a cluster, because nodes by default do a PXE boot.

Because the flash upgrade utility is usually a binary, it is unclear how it works. The procedure to make it work described next is based on some commonsense guesswork. The manufacturer should be contacted to confirm how their utility works, before trying out the procedure described next.

The procedure described next should be done with care because a node that has a damaged BIOS may not function at all. Such a node is called a node that is "bricked" because it may be as much use as a brick for its intended purpose.

The procedure is based on the likelihood that the utility modifies something on the local drive, probably a service which is loaded on system startup.

The trouble with the firmware update utility doing that (a modifcation that is to run on startup) is that a regular node in the cluster normally PXE boots from a software image, and not from the regular node hard drive that the utility has modified. So this is why, in a default cluster, updating a flash bios from Linux will not succeed for regular nodes.

For such a case, the utility can however usually be made to work by simply setting the node to non-sync on the next install. For example:

cmsh -c "device use node001; set nextinstallmode nosync; commit"

After running the firmware installation utility, and if the node has the updated bios on it after reboot and if the node is behaving ok, following procedure can then install the firmware to the remaining nodes:

1. Make sure the firmware binary is on all nodes by placing it in your software image, or by using pcopy in cmsh.
Download the firmware binary and save it to /opt on the head node

[bright60 ~]# chmod 755 /C6220_BIOS_15R41_LN32_1.1.9.BIN

[bright60 ~]# cmsh

[bright60]% device pcopy -n node002..node200 /opt/C6220_BIOS_15R41_LN32_1.1.9.BIN /opt

[bright60]% device commit

2. set the nextinstallmode to NOSYNC. example:

[bright60]% device foreach -n node002..node200 (set nextinstallmode nosync)
[bright60]% device commit
3. Use pexec to call the firmware upgrade utility on all nodes. Example:

[bright60]% device pexec -n node002..node200 " /opt/C6220_BIOS_15R41_LN32_1.1.9.BIN -q"

Note that by setting a node's 'nextinstallmode' to NOSYNC, you are telling it to skip image re-synchronization on the next boot. After this boot, everything will be back to normal. It is probably wise to do the update on a small number of nodes first (e.g. 5-10) so that all the nodes are not bricked if something goes wrong. Starting cautiously by doing it on one node is probably a good idea.

[Jun 19, 2017] How to easily install configure the Torque-Maui open source scheduler in Bright by Robert Stober

Jun 19, 2017 | www.brightcomputing.com
How to easily install & configure the Torque/Maui open source scheduler in Bright | August 14, 2012 | workload manager , HPC job scheduler , Maui , Torque Bright Cluster Manager makes most cluster management tasks very easy to perform, and installing workload managers is one of them. There are many workload managers that are pre-configured, admin-selectable options when you install Bright, including PBS Pro, SLURM , LSF, openlava, Torque, and Grid Engine .
The open source scheduler Maui is not pre-configured, but it's really easy to install and configure this software in Bright Cluster Manager. This article shows you how. The process is to download and install the Maui scheduler, then to configure Bright to use Maui to schedule torque jobs.

Getting Started

Step1: Download the Maui scheduler from the Adaptive Computing website: You will need to register on their site before you can download it.
Step 2: Install it as shown below. This command will overwrite the Bright zero-length Maui placeholder file.

# cp -f maui-3.3.1.tar.gz /usr/src/redhat/SOURCES/maui-3.3.1.tar.gz

Step 3: Build the Maui RPM.

# rpmbuild -bb /usr/src/redhat/SPECS/maui.spec
Step 4: Install the RPM.

# rpm -ivh /usr/src/redhat/RPMS/x86_64/maui-3.3.1-59_cm6.0.x86_64.rpm

Preparing... ########################################### [100%]

1:maui ########################################### [100%]

Select the node that is running the Torque server (usually the head node) resource, then the "roles" tab. Configure the "scheduler" property of the Torque Server role to use the Maui scheduler.

Step 5. Load the Torque and Maui modules. This adds the Maui commands to your PATH in the current shell.

$ module load torque

$ module load maui

The "initadd" command adds the Torque and Maui modules to your environment so that next time you log in they're automatically loaded.

$ module initadd torque maui

Step 6. Submit a simple Torque job.

$ qsub stresscpu.sh

5.torque-head.cm.cluster

The job has been submitted and is running.

$ qstat

Job id Name User Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

5.torque-head stresscpu rstober 0 R shortq

The Maui showq command displays information about active, eligible, blocked, and/or recently completed jobs. Since Torque is not actually scheduling jobs, the showq command displays the actual job ordering.

$ showq

ACTIVE JOBS--------------------

JOBNAME USERNAME STATE PROC REMAINING STARTTIME

5 rstober Running 1 99:23:59:28 Thu Aug 9 11:40:45
 1 
IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0
The Maui checkjob displays detailed job information for queued, blocked, active, and recently completed jobs.
$ checkjob 5

checking job 5

State: Running
Creds:  user:rstober  group:rstober  class:shortq  qos:DEFAULT
WallTime: 00:01:31 of 99:23:59:59
SubmitTime: Thu Aug  9 11:40:44
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Thu Aug  9 11:40:45
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Allocated Nodes:
[node003.cm.cluster:1]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '5' (-00:01:31 -> 99:23:58:28  Duration: 99:23:59:59)
PE:  1.00  StartPriority:  1

OpenStack Neutron Mellanox ML2 Driver Configuration in Bright

Download the latest Mellanox OFED package for Centos/RHEL 6.5

http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers

The package name looks like this: MLNX_OFED_LINUX-<version>-rhel6.5-x86_64 (The package can be download either as an ISO or a tarball).

The OFED package is to be copied (one way or another) to all the compute hosts which require an upgrade of the firmware. (Note, only during a later stage of the article we will be describing the actual installation of the OFED in the package into the software images. Right now we only want the file on the live node)

An efficient way to upgrade the firmware on multiple hosts would be to extract (in case of tar.gz file) or copy (in case of using a ISO) the OFED package directory to a shared location such as /cm/shared (which is mounted on compute nodes by default).
Then we can use the pdsh tool in combination with category names to parallelize the upgrade.

In our example we extract the OFED package to /cm/shared/ofed.

Before we begin the upgrade we need to remove the cm-config-intelcompliance-slave package to avoid conflicts:

[root@headnode ~]# pdsh -g category=openstack-compute-hosts-mellanox "yum remove -y cm-config-intelcompliance-slave"
(For now we will only remove it from live nodes. We will remove it from the software image later in the article. Do not forget to also run this command on the headnode)
In some cases the package 'qlgc-ofed.x86_64' may also need to be removed. In such case the mlnxofed install will not proceed. A log of the installer can always be viewed in /tmp/MLNX_OFED_LINUX-<version>.<num>.logs/ofed_uninstall.log to determine which package is conflicting and remove it manually.

And then run the firmware upgrade:

[root@headnode ~]# pdsh -g category=openstack-compute-hosts-mellanox "cd /cm/shared/ofed/MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64/ && echo \"y\" | ./mlnxofedinstall --enable-sriov" | tee -a /tmp/mlnx-firmware-upgrade.log

(Do not forget to execute these two steps on the network node and the headnode)
Note that we are outputting both to the screen and to a temporary file (/tmp/mlnx-firmware-upgrade.log). This can help spotting any errors that might occur during the upgrade.
Running the 'mlnxofedinstall --enable-sriov' utility does two things:

installs OFED on the live nodes

updates the firmware on the InfiniBand cards and enables the SR-IOV functionality.

Notice, that in the case of compute nodes (node001-node003) at this point we're mostly interested in the latter (firmware update and enabling SR-IOV). Since we've run this command on the live node, the filesystem changes have not been propagated to the software image used by the nodes (i.e. at this point they would be lost on reboot). We will take care of that later on in this article by installing the OFED also to the software image.

In the case of headnode, however, running this command also effectively installs OFED and update firmware, which is exactly what we want.

Bright Cluster Manager 7 for HPC - New

Bright Cluster Manager 7
Bright Cluster Manager for HPC lets customers deploy complete HPC clusters on bare metal and manage them effectively. It provides single-pane-of-glass management for the hardware, operating system, HPC software, and users. With Bright Cluster Manager for HPC, system administrators can get their clusters up and running quickly and keep them running reliably throughout their life cycle – all with the ease and elegance of a fully featured, enterprise-grade cluster manager.

With the latest release, we've added some great new features that make Bright Cluster Manager for HPC even more powerful.
New Feature Highlights
Image Revision Control – We've added revision control capability which means you can track changes to software images using standardized methods.

Integrated Cisco UCS Support – With the new integrated support for Cisco UCS rack servers, you can rapidly introduce flexible, multiplexed servers into your HPC environment.

Native AWS Storage Service Support – Bright Cluster Manager 7 now supports native AWS storage which means that you can use inexpensive, secure, durable, flexible and simple storage services for data use, archiving and backup in the AWS cloud.

Intelligent Dynamic Cloud Provisioning – By only instantiating compute resources in AWS when they're actually needed – such as after the data to be processed has been uploaded, or when on-site workloads reach a certain threshold – Bright Cluster Manager 7 can save you money.
Bright Cluster Manager Images

The Cluster Management GUI of Bright Cluster Manager 7 illustrating queued jobs. Some jobs are running on compute nodes that have been dynamically provisioned in the AWS cloud.

The Cluster Management GUI of Bright Cluster Manager 7 capturing a summarized description

How do I set up a local Bright repository?

How do I set up a local Bright repository?

1. Copy the Bright yum repo file, /etc/yum.repos.d/cm.repo, from the head node to the server where you're going to create the local mirror.
2. Get the repository ID:

(on the mirror server)

# yum clean all
# yum repo list
[...]
cm-rhel6-7.0 Cluster Manager 7.0 for Red Hat Enterprise Linux 6 301+8
cm-rhel6-7.0-updates Cluster Manager 7.0 for Red Hat Enterprise Linux 6 - Updates 371
[...]

3. Sync the repository locally:

# mkdir -p /path/to/local/yum/repo/cm-rhel6-7.0
# reposync --gpgcheck -l --repoid=cm-rhel6-7.0 -n
# createrepo -v /path/to/local/yum/repo/cm-rhel6-7.0
# mkdir -p /path/to/local/yum/repo/cm-rhel6-7.0-updates
# reposync --gpgcheck -l --repoid=cm-rhel6-7.0 -n
# createrepo -v /path/to/local/yum/repo/cm-rhel6-7.0-updates

4. You may need to create local repositories for ceph-* and epel as well since some Bright packages may have some dependencies which are provided by these repositories.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019