|News||OpenPBS, PBSpro and Torque||Recommended Links||PBS Professional||Torque||Installation of open source version of PBSpro||PBS User Guide|
|Torque installation from EPEL RPMs||Maui Scheduler||Perl Admin Tools and Scripts||Grid engine||OAR||Humor||Etc|
There is a strong initial tendency of CentOS and RHEL users is to rely on RPMs for initial installation of new packages. For small or widely used packages this approach usually works OK. For complex and rarely used packaged you are often to nasty surprises. Typically you face "libraries hell" problem. Also some packagers are pretty perverted and add to the package additional dependencies or configure is is a completely bizarre way. They are typically volunteers and nobody controls what they are doing.
As the result you can spend the amount of time that vastly exceed the amount of time and effort of compiling executables from the source.
There are RPMs fro Torque available from Fedora ELEL repository but those RMS are fools gold: the current version for RHEL 6.x is broken due to SNAFU committed by maintainer.
As usually for semi-open source packages, installation and configuration documentation is almost non existent. This page might slightly compensate for that.
First you need to download the correct version of RPMs for installation of RHEL/CentOs 6.x. The one that works. The version that yum picks up from EPEL repository does not work. The package maintainer recklessly enabled NUMA memory and screwed the application, instead of creating a separate set of packages for NUMA-enabled Torque. NUMA is not needed for typical Intel boxes. So this was a typical package maintainer perversion, for which users paid dearly.
Sh*t happens, especially with complex open source packages, which do not have adequate manpower for development, testing or packaging, but this was a real SNAFU that affected many naive users with real and pretty large clusters:
Gavin Nelson 2016-04-06 13:04:57 EDT
Please remove the NUMA support from this package group, or create an alternate package group.
My cluster has been dead for almost 2 weeks and the scientists are getting cranky. This feature does not play well with the MAUI scheduler and, apparently, not at all with the built-in scheduler (http://www.clusterresources.com/pipermail/torqueusers/2013-September/016136.html). Requiring this feature means having to introduce a whole host of changes to the Torque environment as well as forcing recompile of OpenMPI (last I checked epel version of openmpi does not have Torque support) and MAUI, which then means recompiling all the analysis applications, etc... I've tried...I really have.
I even tried rebuilding the package group from the src rpm, but when I remove the enable-numa switch from the torque.spec file it still builds with numa support (not sure what I'm missing there). ;
You need to download pre-numa version (see Bug 1321154 Ė numa enabled torque don't work )
nucleo 2016-03-24 15:33:31 EDT
Description of problem:
After updating from torque-4.2.10-5.el6 to torque-4.2.10-9.el6 pbs_mom service don't stat.
Version-Release number of selected component (if applicable):
Actual results:pbs_mom.9607;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in read_layout_file, Unable to read the layout file in /var/lib
If I create empty file /var/lib/torque/mom_priv/mom.layout then pbs_mom service starts but never connects to to torque server, so node shown as down.
pbs_mom service should start and work correctly after update without creating any additional files such as mom.layout.
After downgrading to torque-4.2.10-5.el6 pbs_mom works fine without mom.layout file.
NUMA support enabled in 4.2.10-6, so last working version is 4.2.10-5. It can be downloaded from https://kojipkgs.fedoraproject.org//packages/torque/4.2.10/5.el6/
Access to EPEL should be configured before installation starts.
Packages/libraries hell is the most distinct feature of all Linux distributions. In this case you may be able get thru and not get burned. You need to install the following packages downloaded from the link https://kojipkgs.fedoraproject.org//packages/torque/4.2.10/5.el6/x86_64/ not from EPEL:
-rw-r--r-- 1 root root 82428 Jun 30 2015 torque-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 332860 Jun 30 2015 torque-client-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 3548332 Jun 30 2015 torque-debuginfo-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 200576 Jun 30 2015 torque-devel-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 34792 Jun 30 2015 torque-drmaa-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 41852 Jun 30 2015 torque-drmaa-devel-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 243432 Jun 30 2015 torque-gui-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 128116 Jun 30 2015 torque-libs-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 252312 Jun 30 2015 torque-mom-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 19832 Jun 30 2015 torque-pam-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 75084 Jun 30 2015 torque-scheduler-4.2.10-5.el6.x86_64.rpm -rw-r--r-- 1 root root 314052 Jun 30 2015 torque-server-4.2.10-5.el6.x86_64.rpm
torque.x86_64 0:4.2.10-5.el6 --> Dependency: munge for package: torque-4.2.10-5.el6.x86_64 torque-client.x86_64 0:4.2.10-5.el6 will be installed torque-debuginfo.x86_64 0:4.2.10-5.el6 will be installed Package torque-devel.x86_64 0:4.2.10-5.el6 will be installed Package torque-drmaa.x86_64 0:4.2.10-5.el6 will be installed Package torque-drmaa-devel.x86_64 0:4.2.10-5.el6 will be installed Package torque-libs.x86_64 0:4.2.10-5.el6 will be installed Package torque-mom.x86_64 0:4.2.10-5.el6 will be installed Package torque-pam.x86_64 0:4.2.10-5.el6 will be installed Package torque-scheduler.x86_64 0:4.2.10-5.el6 will be installed Package torque-server.x86_64 0:4.2.10-5.el6 will be installed
If you use more then one node installation of the client is not necessary, but it is useful for testing, as it is easier to make the client work on the same server as the headnode. You can de-install it later.
After you downloaded all the necessary packages into some directory you can install them using the command
yum localinstall *.rpm
[root@centos x86_64]# yum localinstall *.rpm Loaded plugins: fastestmirror, refresh-packagekit, security Setting up Local Package Process Examining torque-4.2.10-5.el6.x86_64.rpm: torque-4.2.10-5.el6.x86_64 Marking torque-4.2.10-5.el6.x86_64.rpm to be installed Loading mirror speeds from cached hostfile * base: mirrors.centos.webair.com * epel: epel.mirror.constant.com * extras: mirror.cs.vt.edu * updates: mirror.cc.columbia.edu Examining torque-client-4.2.10-5.el6.x86_64.rpm: torque-client-4.2.10-5.el6.x86_64 Marking torque-client-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-debuginfo-4.2.10-5.el6.x86_64.rpm: torque-debuginfo-4.2.10-5.el6.x86_64 Marking torque-debuginfo-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-devel-4.2.10-5.el6.x86_64.rpm: torque-devel-4.2.10-5.el6.x86_64 Marking torque-devel-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-drmaa-4.2.10-5.el6.x86_64.rpm: torque-drmaa-4.2.10-5.el6.x86_64 Marking torque-drmaa-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-drmaa-devel-4.2.10-5.el6.x86_64.rpm: torque-drmaa-devel-4.2.10-5.el6.x86_64 Marking torque-drmaa-devel-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-libs-4.2.10-5.el6.x86_64.rpm: torque-libs-4.2.10-5.el6.x86_64 Marking torque-libs-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-mom-4.2.10-5.el6.x86_64.rpm: torque-mom-4.2.10-5.el6.x86_64 Marking torque-mom-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-pam-4.2.10-5.el6.x86_64.rpm: torque-pam-4.2.10-5.el6.x86_64 Marking torque-pam-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-scheduler-4.2.10-5.el6.x86_64.rpm: torque-scheduler-4.2.10-5.el6.x86_64 Marking torque-scheduler-4.2.10-5.el6.x86_64.rpm to be installed Examining torque-server-4.2.10-5.el6.x86_64.rpm: torque-server-4.2.10-5.el6.x86_64 Marking torque-server-4.2.10-5.el6.x86_64.rpm to be installed Resolving Dependencies --> Running transaction check ---> Package torque.x86_64 0:4.2.10-5.el6 will be installed --> Processing Dependency: munge for package: torque-4.2.10-5.el6.x86_64 ---> Package torque-client.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-debuginfo.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-devel.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-drmaa.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-drmaa-devel.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-libs.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-mom.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-pam.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-scheduler.x86_64 0:4.2.10-5.el6 will be installed ---> Package torque-server.x86_64 0:4.2.10-5.el6 will be installed --> Running transaction check ---> Package munge.x86_64 0:0.5.10-1.el6 will be installed --> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================== Package Arch Version Repository Size ======================================================================================================================== Installing: torque x86_64 4.2.10-5.el6 /torque-4.2.10-5.el6.x86_64 178 k torque-client x86_64 4.2.10-5.el6 /torque-client-4.2.10-5.el6.x86_64 680 k torque-debuginfo x86_64 4.2.10-5.el6 /torque-debuginfo-4.2.10-5.el6.x86_64 16 M torque-devel x86_64 4.2.10-5.el6 /torque-devel-4.2.10-5.el6.x86_64 421 k torque-drmaa x86_64 4.2.10-5.el6 /torque-drmaa-4.2.10-5.el6.x86_64 51 k torque-drmaa-devel x86_64 4.2.10-5.el6 /torque-drmaa-devel-4.2.10-5.el6.x86_64 40 k torque-libs x86_64 4.2.10-5.el6 /torque-libs-4.2.10-5.el6.x86_64 280 k torque-mom x86_64 4.2.10-5.el6 /torque-mom-4.2.10-5.el6.x86_64 563 k torque-pam x86_64 4.2.10-5.el6 /torque-pam-4.2.10-5.el6.x86_64 8.3 k torque-scheduler x86_64 4.2.10-5.el6 /torque-scheduler-4.2.10-5.el6.x86_64 115 k torque-server x86_64 4.2.10-5.el6 /torque-server-4.2.10-5.el6.x86_64 701 k Installing for dependencies: munge x86_64 0.5.10-1.el6 epel 111 k Transaction Summary ======================================================================================================================== Install 12 Package(s) Total size: 19 M Total download size: 111 k Installed size: 19 M Is this ok [y/N]: y Downloading Packages: munge-0.5.10-1.el6.x86_64.rpm | 111 kB 00:00 Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : munge-0.5.10-1.el6.x86_64 1/12 Installing : torque-libs-4.2.10-5.el6.x86_64 2/12 Installing : torque-4.2.10-5.el6.x86_64 3/12 Installing : torque-devel-4.2.10-5.el6.x86_64 4/12 Installing : torque-drmaa-4.2.10-5.el6.x86_64 5/12 Installing : torque-drmaa-devel-4.2.10-5.el6.x86_64 6/12 Installing : torque-server-4.2.10-5.el6.x86_64 7/12 Installing : torque-mom-4.2.10-5.el6.x86_64 8/12 Installing : torque-client-4.2.10-5.el6.x86_64 9/12 Installing : torque-scheduler-4.2.10-5.el6.x86_64 10/12 Installing : torque-pam-4.2.10-5.el6.x86_64 11/12 Installing : torque-debuginfo-4.2.10-5.el6.x86_64 12/12 Verifying : torque-4.2.10-5.el6.x86_64 1/12 Verifying : torque-drmaa-devel-4.2.10-5.el6.x86_64 2/12 Verifying : torque-libs-4.2.10-5.el6.x86_64 3/12 Verifying : torque-debuginfo-4.2.10-5.el6.x86_64 4/12 Verifying : torque-server-4.2.10-5.el6.x86_64 5/12 Verifying : torque-devel-4.2.10-5.el6.x86_64 6/12 Verifying : torque-mom-4.2.10-5.el6.x86_64 7/12 Verifying : torque-pam-4.2.10-5.el6.x86_64 8/12 Verifying : torque-drmaa-4.2.10-5.el6.x86_64 9/12 Verifying : torque-client-4.2.10-5.el6.x86_64 10/12 Verifying : torque-scheduler-4.2.10-5.el6.x86_64 11/12 Verifying : munge-0.5.10-1.el6.x86_64 12/12 Installed: torque.x86_64 0:4.2.10-5.el6 torque-client.x86_64 0:4.2.10-5.el6 torque-debuginfo.x86_64 0:4.2.10-5.el6 torque-devel.x86_64 0:4.2.10-5.el6 torque-drmaa.x86_64 0:4.2.10-5.el6 torque-drmaa-devel.x86_64 0:4.2.10-5.el6 torque-libs.x86_64 0:4.2.10-5.el6 torque-mom.x86_64 0:4.2.10-5.el6 torque-pam.x86_64 0:4.2.10-5.el6 torque-scheduler.x86_64 0:4.2.10-5.el6 torque-server.x86_64 0:4.2.10-5.el6 Dependency Installed: munge.x86_64 0:0.5.10-1.el6
torque.x86_64 torque-libs.x86_64 torque-mom
trqauthd and pbs_mom daemons should run as root in the client node.
The following requirements should be met before you start
Make sure that
/etc/hostson all of the boxes in the cluster contains the hostnames of every PC in the cluster. Ensure that hostname of the server and nodes are identical in /etc/hosts.
Never use localhost as the name of you headnode and execution node. Use hostname defined in /etc/host for the main interface.
Be sure to open TCP for all machines using TORQUE or disable the firewall. The pbs_server (server) and pbs_mom (client) by default use TCP and UDP ports 15001-15004. pbs_mom (client) also uses UDP ports 1023 and below if privileged ports are configured (the default).
Unlike SGE, one does not need to use NFS with PBS, but doing so simplifies the installation of packages on the nodes. Usually cluster have a shared filesystem for all nodes, so this requirement is automatically met.
The instructions below use the env. variable $PBS_HOME ? This is the base directory for configuration directories. Defaults to /var/lib/torque
Like SGE, PBS rely of certain environment variables to operate. But RPM does not provide such and environment setting file for /etc/profile.d. In case of Fedora RPMs PBS_HOME and other critical environment variables are hardwired directly in each init script. For example, as I mentioned before PBS_HOME is set to /var/lib/torque via instruction inside each init script:
You need to set this variable before the configuration.
After that three configuration files need to be created or updated.
The following lines can serve as an example:
master np=8 node01 np=8
The following lines can serve as an example:
# Configuration for pbs_mom. $pbsserver master $logevent 0x0ff
The $pbsserver directive tells each Mom where the headnode is. The default suitable for minimal configuration when server and client share the same server is localhost.
In our case the server is called master.
The $logevent directive specifies what information should be logged during operation. A value of 0x0ff causes all messages except debug messages to be logged, while 0x1ff causes all messages, including debug messages, to be logged.
You can initialize serverdb in two different ways:
/usr/sbin/pbs_server -D -t create
Warning: this will remove any existing serverdb
file located at /var/lib/torque/server_priv/serverdb
You need to Ctrl^C the pbs_server after it started: it will only take a couple of seconds to create this file.
The script is located in /usr/share/doc/torque-4.2.10/
 root@centos: # cd /usr/share/doc/torque-4.2.10/ 17/07/03 00:34 /usr/share/doc/torque-4.2.10 ============================centos root@centos: # ll total 220 drwxr-xr-x 2 root root 4096 Jul 3 00:17 ./ drwxr-xr-x. 845 root root 36864 Jul 3 00:17 ../ -rw-r--r-- 1 root root 143903 Mar 19 2015 CHANGELOG -rw-r--r-- 1 root root 4123 Mar 19 2015 PBS_License_2.3.txt -rw-r--r-- 1 root root 4123 Mar 19 2015 PBS_License.txt -rw-r--r-- 1 root root 2066 Jun 30 2015 README.Fedora -rw-r--r-- 1 root root 1541 Mar 19 2015 README.torque -rw-r--r-- 1 root root 3351 Mar 19 2015 Release_Notes -rw-r--r-- 1 root root 1884 Mar 19 2015 torque.setup
sudo su -
cd /usr/share/doc/torque-4.2.10/torque.setup ./torque.setup root localhost
NOTE: As a side effect this will wipe out you $PBS_HOME/server_priv/nodes file
NOTE: all daemons should be started as root.
For some reason this version of EPEL Torque is built with munge support. As if trqauthd is not enough. This package should be installed on all nodes. The munge package should already be installed and configured before you start Torque. Configuration consists of distribution of the key from the headnode to all computational nodes. First you need to create a munge key on the headnode using the command:
service munge start
service trqauthd start
service trqauthd start
service pbs_mom start
service pbs_server start
ATTENTION: If you did not use ./torque.setup script before that point (see above), the first time you run pbs_server, you need to start it with the -t create flag to initialize the server configuration. In this case do not use init script. Use the command line invocation (you need to set up environment first):
pbs_server -t create PBS_Server master: Create mode and server database exists, do you wish to continue y/(n)?y
Create a new queue which we name batch :
qmgr -c "create queue batch queue_type=execution" qmgr -c "set queue batch enabled=true" qmgr -c "set queue batch started=true" qmgr -c "set server scheduling=True"
or (taken from torque.setup script):
qmgr -c "s s scheduling=true" qmgr -c "c q batch queue_type=execution" qmgr -c "s q batch started=true" qmgr -c "s q batch enabled=true" qmgr -c "s q batch resources_default.nodes=1" qmgr -c "s q batch resources_default.walltime=3600" qmgr -c "s s default_queue=batch" qmgr -c "c n master" # Add one batch worker to your pbs_server. If this is a single server that will be master
# pbsnodes centos68 state = free np = 2 ntype = cluster status = rectime=1499059096,varattr=,jobs=,state=free,netload=878975846,gres=,loadave=0.00,ncpus=2,physmem=1878356kb,availmem=5699504kb,totmem=6072656kb,idletime=5350,nusers=0,nsessions=0,uname=Linux centos68 2.6.32-642.el6.x86_64 #1 SMP Tue May 10 17:27:01 UTC 2016 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003
If pbsnodes command works OK, use chkconfig to start the services at boot time.
/sbin/chkconfig munge on /sbin/chkconfig trqauthd on /sbin/chkconfig pbs_mom on /sbin/chkconfig pbs_server on /sbin/chkconfig pbs_sched on
To check the status of the cluster, issue the following:
$ pbsnodes -a
A trivial test is to simply run sleep:
$ echo "sleep 30" | qsub
bezroun@centos68: $ qstat Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 3.centos68 STDIN bezroun 0 R batch
the job should be visible in queue for 30 sec and then disappear. To check the queue use qstat command
The second test should produce some output
Note: STDOUT and STDERR for a queued job will be logged by default in the form text files corresponding to the respective outputs pid.o and pid.e and will be written to the path from which the qsub command was issued.
For example you try the script test.sh containing the following lines :
#!/bin/bash #PBS -l walltime=00:1:00 #PBS -l nice=19 #PBS -q test date sleep 10 date
Now submit it via qsub command
This should run during 10 seconds. Check if the job is inside the queue using qstat. Torque should produce also a test.sh.e# and test.sh.o# file as output:
$ ls test* test.sh test.sh.e4 test.sh.o4Output should look like:
$ cat test.sh.o4 Mon Jul 3 02:11:16 EDT 2017 Mon Jul 3 02:11:26 EDT 2017
As a user not as root run the following
qsub <<EOF hostname echo "Hi I am a batch job running in torque" sleep 10 EOF
Monitor the state of that job with qstat.
qstat is used to check work status.
-n switch to see which nodes are doing which jobs.
Torque-Maui install rpm - Physics Department, Princeton University
- pbs_server.conf file to copy at /var/spool/PBS and run qmgr < /var/spool/PBS/pbs_server.conf
- copy config /var/spool/PBS/mom_priv/config
- no need to change the server_name on nodes. 'head' is default which is recognized
#service pbs_server start
Starting TORQUE Server: PBS_Server: LOG_ERROR::Permission denied (13) in chk_file_sec, Security
violation with "/var/spool/PBS/spool/" - /var/spool/PBS/$
PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed
solution:/var/spool/PBS/spool others permission as writable(missing)
You will need to customize these scripts to match your system.
These options can be added to the self-extracting packages. For more details, see the INSTALL file.
Install torque on a single node Centos 6.4 - Discngine
PBS is widely used as queuing system on small size to huge clusters. This is just a little post to resume somehow all installation steps necessary to get a single node torque server running on Centos 6.4 (64bit). Don't ask yourself about the use of this. You can definitely manage intelligently a heavy workload on a single machine, but here we want to do this mainly for having a development framework around torque / PBS for bigger clusters...so it'll just be a sort of testing environment. Torque is a fork of PBS and as such is very similar to the widely used PBS Pro.
Resolving dependencies :
I considered a self compiled installation of torque, so for this a few dependencies are necessary :[root]# yum install libxml2-devel openssl-devel gcc gcc-c++ boost-devel
Configuring your firewall :
Open the following ports for tcp on your firewall : 15003, 15001 (you can use the graphical firewall setup tool available in CENTOS to do that or go through iptables).
Building torque :
First download the latest torque release from the adaptive computing website or via command line :wget http://www.adaptivecomputing.com/downloading/?file=/torque/torque-4.2.9.tar.gz tar -xzvf torque-4.2.9.tar.gz cd torque-4.2.9 Next lets consider a default installation, where binaries and libraries will be installed to /usr/local../configure [root]# make [root]# make install
If not already done so add /usr/local/bin and /usr/local/sbin to your user and root PATH variables (add it to your .bashrc or .cshrc).
Next you need to install and start the torque authorization daemon and we can also copy all files to start torque as a server afterwards :[root]# cp contrib/init.d/trqauthd /etc/init.d/ [root]# cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom [root]# cp contrib/init.d/pbs_server /etc/init.d/pbs_server [root]# cp contrib/init.d/pbs_sched /etc/init.d/pbs_sched
[root]# chkconfig --add trqauthd [root]# chkconfig --add pbs_mom [root]# chkconfig --add pbs_server [root]# chkconfig --add pbs_sched
[root]# echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf [root]# ldconfig [root]# service trqauthd start
Add the servername hosting the torque server to /var/spool/torque/server_name. Next set the library path to torque.conf :[root]# echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf [root]# ldconfig
Initialize the serverdb by executing the following as root :[root]# ./torque.setup root
Add the compute node (the server itself) to the nodes file. This can be done by adding the following into the /var/spool/torque/server_priv/nodes file :MYMACHINENAME np=4
where MYMACHINENAME is the name of your node and np indicates the number of available CPU's for the queue. Adapt this to your system.
You also need to define the server by adding the following to the /var/spool/torque/mom_priv/config file :$pbsserver MYMACHINENAME $logeven 255
Here again MYMACHINENAME indicates the name of the server issuing jobs. As the node and the server is the same in our configuration, specify the same name as in the previous nodes file.
Finish the configuration with :qterm -t quick pbs_server pbs_mom (normally only on the node)
Check if you can see your nodes by issuing the pbsnodes -a command.
Start the scheduler on the server using :pbs_sched
As a user login at least once onto the server via ssh from the server itself to add the server to the known hosts file :
Create a new queue which we name test here :qmgr -c "create queue test queue_type=execution" qmgr -c "set queue test enabled=true" qmgr -c "set queue test started=true" qmgr -c "set server scheduling=True"
First test job submission
Create a sample job submission file called test.sh containing the following lines :#!/bin/bash #PBS -l walltime=00:1:00 #PBS -l nice=19 #PBS -q test date sleep 10 date
This should run during 10 seconds. Check if the job is inside the queue using qstat. Torque should produce also a test.sh.e# and test.sh.o# file as output.
All the following needs to be done as root on the box that will act as single-node cluster. First, of course, one needs to install the necessary packages. This can be done easily, with the caveat that you get Torque v2.4.16, which at this point is at end of life. I do not want to bother with non-packaged installs, as that would make my life harder later, so here goes.apt-get install torque-server torque-client torque-mom torque-pam
Installing the packages also sets up torque with a default setup that is in no way helpful. So next you'll need to stop all torque services and recreate a clean setup./etc/init.d/torque-mom stop /etc/init.d/torque-scheduler stop /etc/init.d/torque-server stop pbs_server -t create
You'll need to answer 'yes' here to overwrite the existing database. Next, kill the just-started server instance so we can set a few things manually.killall pbs_server
If you don't kill the server, many things you do below will be overwritten the next time the server stops. Next, let's set up the server process; in the following, replace 'SERVER.DOMAIN' with your box's fully-qualified domain name [Note: see just below if your machine doesn't have an official FQDN]. I prefer to use FQDN's so that it's easier later to add other compute nodes, job submission nodes, etc. The following also sets up the server process to allow user 'root' to change configurations in the database. This bit seemed missing from the default install, and it took me a while to figure it out (again).echo SERVER.DOMAIN > /etc/torque/server_name echo SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/acl_hosts echo [email protected] > /var/spool/torque/server_priv/acl_svr/operators echo [email protected] > /var/spool/torque/server_priv/acl_svr/managers
If the machine you're installing Torque on doesn't have an official FQDN, a simple work-around is to invent one and assign it to the machine's network IP. For example, if eth0 is assigned to 192.168.1.1, we can add the following line to /etc/hosts.192.168.1.1 SERVER.DOMAIN
The FQDN itself can be anything you want, but ideally choose something that cannot exist in reality, so something with a non-existent top-level domain.
A cluster is nothing without some compute nodes, so next we tell the server process that the box itself is a compute node (with 4 cores, below Ė change this to suit your requirements).echo "SERVER.DOMAIN np=4" > /var/spool/torque/server_priv/nodes
We also need to tell the MOM process (i.e. the compute node handler) which server to contact for work.echo SERVER.DOMAIN > /var/spool/torque/mom_priv/config
Once this is done, we can restart all processes again./etc/init.d/torque-server start /etc/init.d/torque-scheduler start /etc/init.d/torque-mom start
If you get any errors at this point, I'd suggest stopping any running processes, and restart them one by one in this order. Check the logs (under /var/spool/torque) for whatever is failing. Otherwise, the next step is to start the scheduler.# set scheduling properties qmgr -c 'set server scheduling = true' qmgr -c 'set server keep_completed = 300' qmgr -c 'set server mom_job_sync = true'
At this point, if you get the dreaded Unauthorized Request error, it's critical to figure out why this is happening. Usually it is because the commands look like they're coming from an unauthorized user/machine, that is anything different from the string '[email protected]'. You can check this with the following command.grep Unauthorized /var/spool/torque/server_logs/*
We also need to create a default queue (here called 'batch' Ė you can change this to whatever you want). We'll set this up with default 1-hour time limit and single-node requirement, but you don't have to.# create default queue qmgr -c 'create queue batch' qmgr -c 'set queue batch queue_type = execution' qmgr -c 'set queue batch started = true' qmgr -c 'set queue batch enabled = true' qmgr -c 'set queue batch resources_default.walltime = 1:00:00' qmgr -c 'set queue batch resources_default.nodes = 1' qmgr -c 'set server default_queue = batch'
Finally, we need to configure the server to allow submissions from itself. This one stumped me for a while. Note that the submit_hosts lists cannot be made up to FQDNs! This will not work if you do that, as the comparison is done after truncating the name of the submitting host!# configure submission pool qmgr -c 'set server submit_hosts = SERVER' qmgr -c 'set server allow_node_submit = true'
For example, if above you use SERVER.DOMAIN instead of just SERVER, you'll get an error like the following the next time you try to submit:qsub: Bad UID for job execution MSG=ruserok failed validating USER/USER from SERVER.DOMAIN
where 'USER' will be the uid of whichever (non-root) user you try to submit the job as. The solution is simply to list the submission hosts as unqualified names. To test the system you need to try to submit a job (here an interactive one) as a non-root user from the same box.qsub -I
If this works, you'll get into a shell on the same box (as if you ssh'ed into itself). Note that you'll need authorised SSH keys set up for this user to allow password-less ssh.
Eventually I'll need to set up additional boxes as submission hosts. I'll write about that process once it's done. [Note: you can now find this here.] Meanwhile, if you have any questions about the above, or if there's anything that could do with some clarification, feel free to let me know in the comments.
linux - Torque installation on RHEL 6.5 - Stack Overflow
1down votefavorite I want to install TORQUE on a RHEL 6 single machine (32 CPUs).
I followed every instructions of the manual to install it, but I am facing an error in the end. Here are all the steps I followed:
First step, make sure that libxml2-devel openssl-devel gcc gcc-c++ are installed and up-to-date:
# yum install libxml2-devel openssl-devel gcc gcc-c++ Setting up Install Process Package libxml2-devel-2.7.6-14.el6.x86_64 already installed and latest version Package openssl-devel-1.0.1e-16.el6_5.x86_64 already installed and latest version Package gcc-4.4.7-4.el6.x86_64 already installed and latest version Package gcc-c++-4.4.7-4.el6.x86_64 already installed and latest version Nothing to do
Then I downloaded and extracted the last version. Then I ran the default configure:
I ran make and make install:
# make # make install
With no errors.
I configured the trqauthd daemon to start automatically at system boot:
# cp contrib/init.d/trqauthd /etc/init.d/ # chkconfig --add trqauthd # echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf # ldconfig # service trqauthd start Starting TORQUE Authorization Daemon: hostname: x6540 Currently no servers active. Default server will be listed as active server. Error 15133 Active server name: x6540 pbs_server port is: 15001 trqauthd daemonized - port 15005 [ OK ]
There is the first error there.
The error code means:
PBSE_SERVER_NOT_FOUND 15133 Could not connect to batch server
I continued the installation until the end anyway, and I managed to start mom and server services, but finally end up with:
# pbsnodes localhost state = down np = 30 properties = CIS ntype = cluster mom_service_port = 15002 mom_manager_port = 15003
Can you help me? I can provide you with all logs/info needed. Thanks!!
Oops! I didn't mean to do this.
Just run trqauthd and pbs_mom as root in the client node.
share|improve this answer answered Jan 5 '15 at 9:37
up vote0down vote This is not an error actually, it just tells you that it cannot find any active pbs_server process. Later when you start pbs_server process, everything will work as normal.
Or if you run "service pbs_server start" first, you will not see the error.
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haterís Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to to buy a cup of coffee for authors of this site|
Last modified: March, 12, 2019