Torque installation from EPEL RPMs

News OpenPBS, PBSpro and Torque Recommended Links PBS Professional Torque Installation of open source version of PBSpro PBS User Guide
Torque installation from EPEL RPMs Maui Scheduler Perl Admin Tools and Scripts Grid engine OAR Humor Etc

Introduction

There is a strong initial tendency of CentOS and RHEL users is to rely on RPMs for initial installation of new packages. For small or widely used packages this approach usually works OK. For complex and rarely used packaged you are often to nasty surprises. Typically you face "libraries hell" problem. Also some packagers are pretty perverted and add to the package additional dependencies or configure is is a completely bizarre way.  They are typically volunteers and nobody controls what they are doing.

As the result you can spend the amount of time that vastly exceed the amount of time and effort of compiling executables from the source.

There are RPMs fro Torque available from Fedora  ELEL repository but those RMS are fools gold: the current version for RHEL 6.x is broken due to SNAFU committed by maintainer.

As usually for semi-open source packages, installation and configuration documentation is almost non existent. This page might slightly compensate for that.

First you need to download the correct version of RPMs for installation of RHEL/CentOs 6.x. The one that works. The version that yum picks up from EPEL repository does not work. The package maintainer recklessly enabled NUMA memory and screwed the application, instead of creating a separate set of packages for NUMA-enabled Torque.  NUMA is not needed for typical Intel boxes. So this was a typical package maintainer perversion, for which users paid dearly.

Sh*t happens, especially with complex open source packages, which do not have adequate manpower for development, testing or packaging, but this was a real SNAFU that affected many naive users with real and pretty large clusters:

Gavin Nelson 2016-04-06 13:04:57 EDT

Please remove the NUMA support from this package group, or create an alternate package group.

My cluster has been dead for almost 2 weeks and the scientists are getting cranky. This feature does not play well with the MAUI scheduler and, apparently, not at all with the built-in scheduler (http://www.clusterresources.com/pipermail/torqueusers/2013-September/016136.html). Requiring this feature means having to introduce a whole host of changes to the Torque environment as well as forcing recompile of OpenMPI (last I checked epel version of openmpi does not have Torque support) and MAUI, which then means recompiling all the analysis applications, etc... I've tried...I really have.

I even tried rebuilding the package group from the src rpm, but when I remove the enable-numa switch from the torque.spec file it still builds with numa support (not sure what I'm missing there). ;

You need to download pre-numa version (see Bug 1321154 – numa enabled torque don't work )

nucleo 2016-03-24 15:33:31 EDT

Description of problem:

After updating from torque-4.2.10-5.el6 to torque-4.2.10-9.el6 pbs_mom service don't stat.

Version-Release number of selected component (if applicable):
torque-mom-4.2.10-9.el6.x86_64

Actual results:

pbs_mom.9607;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in read_layout_file, Unable to 
   read the layout file in /var/lib
/torque/mom_priv/mom.layout

If I create empty file /var/lib/torque/mom_priv/mom.layout then pbs_mom service starts but never connects to to torque server, so node shown as down.

Expected results:

pbs_mom service should start and work correctly after update without creating any additional files such as mom.layout.

Additional info:

After downgrading to torque-4.2.10-5.el6 pbs_mom works fine without mom.layout file.

NUMA support enabled in 4.2.10-6, so last working version is 4.2.10-5. It can be downloaded from  https://kojipkgs.fedoraproject.org//packages/torque/4.2.10/5.el6/

Packages required

Access to EPEL should be configured before installation starts.

Packages/libraries hell is the most distinct feature of all Linux distributions. In this case you may be able get thru and not get burned. You need to install the following packages downloaded from the link https://kojipkgs.fedoraproject.org//packages/torque/4.2.10/5.el6/x86_64/ not from EPEL:

-rw-r--r-- 1 root root   82428 Jun 30  2015 torque-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root  332860 Jun 30  2015 torque-client-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root 3548332 Jun 30  2015 torque-debuginfo-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root  200576 Jun 30  2015 torque-devel-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root   34792 Jun 30  2015 torque-drmaa-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root   41852 Jun 30  2015 torque-drmaa-devel-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root  243432 Jun 30  2015 torque-gui-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root  128116 Jun 30  2015 torque-libs-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root  252312 Jun 30  2015 torque-mom-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root   19832 Jun 30  2015 torque-pam-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root   75084 Jun 30  2015 torque-scheduler-4.2.10-5.el6.x86_64.rpm
-rw-r--r-- 1 root root  314052 Jun 30  2015 torque-server-4.2.10-5.el6.x86_64.rpm

Packages that are prerequisite

RPMs that should be installed on the headnode

torque.x86_64 0:4.2.10-5.el6 
--> Dependency: munge for package: torque-4.2.10-5.el6.x86_64
torque-client.x86_64 0:4.2.10-5.el6 will be installed
torque-debuginfo.x86_64 0:4.2.10-5.el6 will be installed
Package torque-devel.x86_64 0:4.2.10-5.el6 will be installed
Package torque-drmaa.x86_64 0:4.2.10-5.el6 will be installed
Package torque-drmaa-devel.x86_64 0:4.2.10-5.el6 will be installed
Package torque-libs.x86_64 0:4.2.10-5.el6 will be installed
Package torque-mom.x86_64 0:4.2.10-5.el6 will be installed
Package torque-pam.x86_64 0:4.2.10-5.el6 will be installed
Package torque-scheduler.x86_64 0:4.2.10-5.el6 will be installed
Package torque-server.x86_64 0:4.2.10-5.el6 will be installed

If you use more then one node installation of the client is not necessary, but it is useful for testing, as it is easier to make the client work on the same server as the headnode. You can  de-install it later.

After you downloaded all the necessary packages into some directory you can install them using  the command

yum localinstall *.rpm

Here is how it looks like:

[root@centos x86_64]# yum localinstall *.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Local Package Process
Examining torque-4.2.10-5.el6.x86_64.rpm: torque-4.2.10-5.el6.x86_64
Marking torque-4.2.10-5.el6.x86_64.rpm to be installed
Loading mirror speeds from cached hostfile
 * base: mirrors.centos.webair.com
 * epel: epel.mirror.constant.com
 * extras: mirror.cs.vt.edu
 * updates: mirror.cc.columbia.edu
Examining torque-client-4.2.10-5.el6.x86_64.rpm: torque-client-4.2.10-5.el6.x86_64
Marking torque-client-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-debuginfo-4.2.10-5.el6.x86_64.rpm: torque-debuginfo-4.2.10-5.el6.x86_64
Marking torque-debuginfo-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-devel-4.2.10-5.el6.x86_64.rpm: torque-devel-4.2.10-5.el6.x86_64
Marking torque-devel-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-drmaa-4.2.10-5.el6.x86_64.rpm: torque-drmaa-4.2.10-5.el6.x86_64
Marking torque-drmaa-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-drmaa-devel-4.2.10-5.el6.x86_64.rpm: torque-drmaa-devel-4.2.10-5.el6.x86_64
Marking torque-drmaa-devel-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-libs-4.2.10-5.el6.x86_64.rpm: torque-libs-4.2.10-5.el6.x86_64
Marking torque-libs-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-mom-4.2.10-5.el6.x86_64.rpm: torque-mom-4.2.10-5.el6.x86_64
Marking torque-mom-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-pam-4.2.10-5.el6.x86_64.rpm: torque-pam-4.2.10-5.el6.x86_64
Marking torque-pam-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-scheduler-4.2.10-5.el6.x86_64.rpm: torque-scheduler-4.2.10-5.el6.x86_64
Marking torque-scheduler-4.2.10-5.el6.x86_64.rpm to be installed
Examining torque-server-4.2.10-5.el6.x86_64.rpm: torque-server-4.2.10-5.el6.x86_64
Marking torque-server-4.2.10-5.el6.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package torque.x86_64 0:4.2.10-5.el6 will be installed
--> Processing Dependency: munge for package: torque-4.2.10-5.el6.x86_64
---> Package torque-client.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-debuginfo.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-devel.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-drmaa.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-drmaa-devel.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-libs.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-mom.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-pam.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-scheduler.x86_64 0:4.2.10-5.el6 will be installed
---> Package torque-server.x86_64 0:4.2.10-5.el6 will be installed
--> Running transaction check
---> Package munge.x86_64 0:0.5.10-1.el6 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

========================================================================================================================
 Package                    Arch           Version                Repository                                       Size
========================================================================================================================
Installing:
 torque                     x86_64         4.2.10-5.el6           /torque-4.2.10-5.el6.x86_64                     178 k
 torque-client              x86_64         4.2.10-5.el6           /torque-client-4.2.10-5.el6.x86_64              680 k
 torque-debuginfo           x86_64         4.2.10-5.el6           /torque-debuginfo-4.2.10-5.el6.x86_64            16 M
 torque-devel               x86_64         4.2.10-5.el6           /torque-devel-4.2.10-5.el6.x86_64               421 k
 torque-drmaa               x86_64         4.2.10-5.el6           /torque-drmaa-4.2.10-5.el6.x86_64                51 k
 torque-drmaa-devel         x86_64         4.2.10-5.el6           /torque-drmaa-devel-4.2.10-5.el6.x86_64          40 k
 torque-libs                x86_64         4.2.10-5.el6           /torque-libs-4.2.10-5.el6.x86_64                280 k
 torque-mom                 x86_64         4.2.10-5.el6           /torque-mom-4.2.10-5.el6.x86_64                 563 k
 torque-pam                 x86_64         4.2.10-5.el6           /torque-pam-4.2.10-5.el6.x86_64                 8.3 k
 torque-scheduler           x86_64         4.2.10-5.el6           /torque-scheduler-4.2.10-5.el6.x86_64           115 k
 torque-server              x86_64         4.2.10-5.el6           /torque-server-4.2.10-5.el6.x86_64              701 k
Installing for dependencies:
 munge                      x86_64         0.5.10-1.el6           epel                                            111 k

Transaction Summary
========================================================================================================================
Install      12 Package(s)

Total size: 19 M
Total download size: 111 k
Installed size: 19 M
Is this ok [y/N]: y
Downloading Packages:
munge-0.5.10-1.el6.x86_64.rpm                                                                    | 111 kB     00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : munge-0.5.10-1.el6.x86_64                                                                           1/12
  Installing : torque-libs-4.2.10-5.el6.x86_64                                                                     2/12
  Installing : torque-4.2.10-5.el6.x86_64                                                                          3/12
  Installing : torque-devel-4.2.10-5.el6.x86_64                                                                    4/12
  Installing : torque-drmaa-4.2.10-5.el6.x86_64                                                                    5/12
  Installing : torque-drmaa-devel-4.2.10-5.el6.x86_64                                                              6/12
  Installing : torque-server-4.2.10-5.el6.x86_64                                                                   7/12
  Installing : torque-mom-4.2.10-5.el6.x86_64                                                                      8/12
  Installing : torque-client-4.2.10-5.el6.x86_64                                                                   9/12
  Installing : torque-scheduler-4.2.10-5.el6.x86_64                                                               10/12
  Installing : torque-pam-4.2.10-5.el6.x86_64                                                                     11/12
  Installing : torque-debuginfo-4.2.10-5.el6.x86_64                                                               12/12
  Verifying  : torque-4.2.10-5.el6.x86_64                                                                          1/12
  Verifying  : torque-drmaa-devel-4.2.10-5.el6.x86_64                                                              2/12
  Verifying  : torque-libs-4.2.10-5.el6.x86_64                                                                     3/12
  Verifying  : torque-debuginfo-4.2.10-5.el6.x86_64                                                                4/12
  Verifying  : torque-server-4.2.10-5.el6.x86_64                                                                   5/12
  Verifying  : torque-devel-4.2.10-5.el6.x86_64                                                                    6/12
  Verifying  : torque-mom-4.2.10-5.el6.x86_64                                                                      7/12
  Verifying  : torque-pam-4.2.10-5.el6.x86_64                                                                      8/12
  Verifying  : torque-drmaa-4.2.10-5.el6.x86_64                                                                    9/12
  Verifying  : torque-client-4.2.10-5.el6.x86_64                                                                  10/12
  Verifying  : torque-scheduler-4.2.10-5.el6.x86_64                                                               11/12
  Verifying  : munge-0.5.10-1.el6.x86_64                                                                          12/12

Installed:
  torque.x86_64 0:4.2.10-5.el6            torque-client.x86_64 0:4.2.10-5.el6  torque-debuginfo.x86_64 0:4.2.10-5.el6
  torque-devel.x86_64 0:4.2.10-5.el6      torque-drmaa.x86_64 0:4.2.10-5.el6   torque-drmaa-devel.x86_64 0:4.2.10-5.el6
  torque-libs.x86_64 0:4.2.10-5.el6       torque-mom.x86_64 0:4.2.10-5.el6     torque-pam.x86_64 0:4.2.10-5.el6
  torque-scheduler.x86_64 0:4.2.10-5.el6  torque-server.x86_64 0:4.2.10-5.el6

Dependency Installed:
  munge.x86_64 0:0.5.10-1.el6

RPMs that should be installed on the computational nodes:

torque.x86_64
torque-libs.x86_64
torque-mom

trqauthd and pbs_mom daemons should run as root in the client node.

Required changes in your servers configuration

The following requirements should be met before you start

  1. Passwordless ssh from the headnode to the clients should be enabled

  2. /etc/hosts for servers that constitute the cluster should be identical on all hosts

    Make sure that /etc/hosts on all of the boxes in the cluster contains the hostnames of every PC in the cluster. Ensure that hostname of the server and nodes are identical in /etc/hosts.

    Never use localhost as the name of you headnode and execution node. Use hostname defined in /etc/host for the main interface.

  3. Firewall should be temporarily shut down. Installation should be done with firewall daemon disabled; you will have  enough problems without it ;-)

    Be sure to open TCP for all machines using TORQUE or disable the firewall. The pbs_server (server) and pbs_mom (client) by default use TCP and UDP ports 15001-15004. pbs_mom (client) also uses UDP ports 1023 and below if privileged ports are configured (the default).

  4. NFS

    is desirable

    Unlike SGE, one does not need to use NFS with PBS, but doing so simplifies the installation of packages on the nodes. Usually cluster have a shared filesystem for all nodes, so this requirement is automatically met.

Configuration

The instructions below use the  env. variable $PBS_HOME ? This is the base directory for configuration directories. Defaults to /var/lib/torque

Like SGE, PBS rely of certain environment variables to operate. But RPM does not provide such and environment setting file for /etc/profile.d. In case of Fedora RPMs PBS_HOME and other critical environment variables are hardwired directly in each init script. For example, as I mentioned before PBS_HOME is set to /var/lib/torque via instruction inside each init script:

export PBS_HOME=/var/lib/torque

You need to set this variable before the configuration.

After that  three configuration files need to be created or updated.

  1. $PBS_HOME/server_name Each node needs to know what machine is running the server. This is conveyed through the $PBS_HOME/server_name file, which, for our configuration, should contain the result of execution of the command hostname --long on the headnode. For example: master
    vi $PBS_HOME/server_name
  2. $PBS_HOME/server_priv/nodes  The pbs_server daemon must know which nodes are available for executing jobs. This information is kept in a file called $PBS_HOME/server_priv/nodes, and the file reside on the headnode only. You can set various properties for each node listed in the nodes file, but for this simple configuration, only the number of processors is included. $PBS_HOME/server_priv/nodes

    The following lines can serve as an example:

    master np=8
    node01 np=8
    vi $PBS_HOME/server_priv/nodes
  3. $PBS_HOME/mom_priv/config Each pbs_mom daemon needs some basic information to participate in the batch system. This configuration information is contained in $PBS_HOME/mom_priv/config on every node.
    vi $PBS_HOME/mom_priv/config

    The following lines can serve as an example:

    # Configuration for pbs_mom.
    $pbsserver master
    $logevent  0x0ff

    The $pbsserver directive tells each Mom where the headnode is. The default suitable for minimal configuration when server and client share the same server is localhost.

    In our case the server is called master.

    The $logevent directive specifies what information should be logged during operation. A value of 0x0ff causes all messages except debug messages to be logged, while 0x1ff causes all messages, including debug messages, to be logged. 

After that you also need to create the initial configuration for the server. The server configuration is maintained in a file named serverdb, located in $PBS_HOME/server_priv.  The serverdb file contains all parameters pertaining to the operation of Torque plus all of the queues that are in the configuration. If you have a previous configuration that you want to save you need to save this file.

You can initialize serverdb in two different ways:

  1. The preferable way is to use pbs_server -t create (see -t option)
    /usr/sbin/pbs_server -D -t create

    Warning: this will remove any existing serverdb file located at /var/lib/torque/server_priv/serverdb

    You need to Ctrl^C the pbs_server after it started: it will only take a couple of seconds to create this file.

     
  2. As root, execute ./torque.setup script, which in addition to setting up the  /var/lib/torque/server_priv/serverdb also creates the initial queue, called batch. The drawback is that in this particular version  will wipe out you $PBS_HOME/server_priv/nodes file and you need to restore it. You must view the torque.setup  and extract the command to create the queue for further use.

    The script needs to be executed as a root user. You might need to include the path of torque binaries and libs into corresponding variables (root user doesn't use the path variable of normal users).

The script is located in /usr/share/doc/torque-4.2.10/

[1] root@centos: # cd /usr/share/doc/torque-4.2.10/
17/07/03 00:34 /usr/share/doc/torque-4.2.10 ============================centos
[0]root@centos: # ll
total 220
drwxr-xr-x    2 root root   4096 Jul  3 00:17 ./
drwxr-xr-x. 845 root root  36864 Jul  3 00:17 ../
-rw-r--r--    1 root root 143903 Mar 19  2015 CHANGELOG
-rw-r--r--    1 root root   4123 Mar 19  2015 PBS_License_2.3.txt
-rw-r--r--    1 root root   4123 Mar 19  2015 PBS_License.txt
-rw-r--r--    1 root root   2066 Jun 30  2015 README.Fedora
-rw-r--r--    1 root root   1541 Mar 19  2015 README.torque
-rw-r--r--    1 root root   3351 Mar 19  2015 Release_Notes
-rw-r--r--    1 root root   1884 Mar 19  2015 torque.setup
sudo su -
cd /usr/share/doc/torque-4.2.10/torque.setup
./torque.setup root localhost

NOTE: As a side effect this will wipe out you $PBS_HOME/server_priv/nodes file

Starting Up PBS

NOTE: all daemons should be started as root.

For some reason this version of EPEL Torque is built with munge support. As if  trqauthd is not enough. This package should be installed on all nodes. The munge package should already be installed and configured before you start Torque. Configuration consists of distribution of the key from the headnode to all computational nodes. First you need to create a munge key on the headnode using the command:

/usr/sbin/create-munge-key
  1. Start munge service on the headnode
    service munge start
  2. Start trqauthd on the headnode by running init script created by RPMs.  This should be done on the headnode and all nodes.
    service trqauthd start 
    service trqauthd start
  3. Start MPM daemon on all nodes (or of the headnode if this the only one that is configured). It’s better if the MOM daemons on computational nodes (and the headnode if you use it for computations too) are started first to be ready to communicate with the server daemon on the headnode once it’s launched. 
    service pbs_mom start
  4. Start pbs_server and pbs_sched deamons.
    service pbs_server start

    ATTENTION: If you did not use ./torque.setup script before that point (see above), the first time you run pbs_server, you need to start it with the -t create flag to initialize the server configuration. In this case do not use init script. Use the command line  invocation (you need to set up environment first): 

    pbs_server -t create
    PBS_Server master: Create mode and server database exists,
    do you wish to continue y/(n)?y
  5. Configure the queue batch

    Create a new queue which we name batch :

    qmgr -c "create queue batch queue_type=execution"
    qmgr -c "set queue batch enabled=true"
    qmgr -c "set queue batch started=true"
    qmgr -c "set server scheduling=True" 

    or (taken from torque.setup script):

    qmgr -c "s s scheduling=true"
    qmgr -c "c q batch queue_type=execution"
    qmgr -c "s q batch started=true"
    qmgr -c "s q batch enabled=true"
    qmgr -c "s q batch resources_default.nodes=1"
    qmgr -c "s q batch resources_default.walltime=3600"
    qmgr -c "s s default_queue=batch"
    qmgr -c "c n master" # Add one batch worker to your pbs_server. If this is a single server that will be master
  6. Check ,if node is visible and is listed as free
    # pbsnodes
    centos68
         state = free
         np = 2
         ntype = cluster
         status = rectime=1499059096,varattr=,jobs=,state=free,netload=878975846,gres=,loadave=0.00,ncpus=2,physmem=1878356kb,availmem=5699504kb,totmem=6072656kb,idletime=5350,nusers=0,nsessions=0,uname=Linux centos68 2.6.32-642.el6.x86_64 #1 SMP Tue May 10 17:27:01 UTC 2016 x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
    

Register daemons via chkconfig

If pbsnodes command works OK, use chkconfig to start the services at boot time.

/sbin/chkconfig munge on
/sbin/chkconfig trqauthd on
/sbin/chkconfig pbs_mom on
/sbin/chkconfig pbs_server on
/sbin/chkconfig pbs_sched on 

Verifying cluster status

To check the status of the cluster, issue the following:

$ pbsnodes -a

First test job submission

Switch to a regular user. Tests should be run as a regular user, not as root.

A trivial test is to simply run sleep:

$ echo "sleep 30" | qsub
[0]bezroun@centos68: $ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
3.centos68                 STDIN            bezroun                0 R batch

the job should be visible in queue for 30 sec and then disappear. To check the queue use qstat command

The second test should produce some output

Note: STDOUT and STDERR for a queued job will be logged by default in the form text files corresponding to the respective outputs pid.o and pid.e and will be written to the path from which the qsub command was issued.

For example you try the script test.sh containing the following lines :

#!/bin/bash
#PBS -l walltime=00:1:00
#PBS -l nice=19
#PBS -q test
date
sleep 10
date

Now submit it via qsub command

qsub test.sh

This should run during 10 seconds. Check if the job is inside the queue using qstat. Torque should produce also a test.sh.e# and test.sh.o# file as output:

$ ls test*
test.sh  test.sh.e4  test.sh.o4
Output should look like:
$ cat test.sh.o4
Mon Jul  3 02:11:16 EDT 2017
Mon Jul  3 02:11:26 EDT 2017

As a user not as root run the following

qsub <<EOF
hostname
echo "Hi I am a batch job running in torque"
sleep 10
EOF

 Monitor the state of that job with qstat.

Checking job status

qstat is used to check work status.

Append the -n switch to see which nodes are doing which jobs.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

Torque-Maui install rpm - Physics Department, Princeton University

RPM installed

on server:

torque.x86_64
torque-client.x86_64
torque-devel.x86_64
torque-libs.x86_64
torque-scheduler.x86_64
torque-server.x86_64

maui.x86_64
maui-client.x86_64
maui-server.x86_64

on nodes:

torque.x86_64
torque-libs.x86_64
torque-mom

Other changes

- pbs_server.conf file to copy at /var/spool/PBS and run qmgr < /var/spool/PBS/pbs_server.conf

on server:

on nodes:

- copy config /var/spool/PBS/mom_priv/config

- no need to change the server_name on nodes. 'head' is default which is recognized

Error encountered

#service pbs_server start

Starting TORQUE Server: PBS_Server: LOG_ERROR::Permission denied (13) in chk_file_sec, Security violation with "/var/spool/PBS/spool/" - /var/spool/PBS/$
PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed

solution:/var/spool/PBS/spool others permission as writable(missing)

You will need to customize these scripts to match your system.

These options can be added to the self-extracting packages. For more details, see the INSTALL file.

Install torque on a single node Centos 6.4 - Discngine

PBS is widely used as queuing system on small size to huge clusters. This is just a little post to resume somehow all installation steps necessary to get a single node torque server running on Centos 6.4 (64bit). Don't ask yourself about the use of this. You can definitely manage intelligently a heavy workload on a single machine, but here we want to do this mainly for having a development framework around torque / PBS for bigger clusters...so it'll just be a sort of testing environment. Torque is a fork of PBS and as such is very similar to the widely used PBS Pro.

Resolving dependencies :

I considered a self compiled installation of torque, so for this a few dependencies are necessary :

[root]# yum install libxml2-devel openssl-devel gcc gcc-c++ boost-devel

Configuring your firewall :

Open the following ports for tcp on your firewall : 15003, 15001 (you can use the graphical firewall setup tool available in CENTOS to do that or go through iptables).

Building torque :

First download the latest torque release from the adaptive computing website or via command line :

wget http://www.adaptivecomputing.com/downloading/?file=/torque/torque-4.2.9.tar.gz
tar -xzvf torque-4.2.9.tar.gz
cd torque-4.2.9
Next lets consider a default installation, where binaries and libraries will be installed to /usr/local. 
./configure
[root]# make
[root]# make install

If not already done so add /usr/local/bin and /usr/local/sbin to your user and root PATH variables (add it to your .bashrc or .cshrc).

Next you need to install and start the torque authorization daemon and we can also copy all files to start torque as a server afterwards :

[root]# cp contrib/init.d/trqauthd /etc/init.d/
[root]# cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom
[root]# cp contrib/init.d/pbs_server /etc/init.d/pbs_server
[root]# cp contrib/init.d/pbs_sched /etc/init.d/pbs_sched

[root]# chkconfig --add trqauthd
[root]# chkconfig --add pbs_mom
[root]# chkconfig --add pbs_server
[root]# chkconfig --add pbs_sched

[root]# echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf
[root]# ldconfig
[root]# service trqauthd start

Configuring torque

Add the servername hosting the torque server to /var/spool/torque/server_name. Next set the library path to torque.conf :

[root]# echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf
[root]# ldconfig

Initialize the serverdb by executing the following as root :

[root]# ./torque.setup root

Add the compute node (the server itself) to the nodes file. This can be done by adding the following into the /var/spool/torque/server_priv/nodes file :

MYMACHINENAME np=4

where MYMACHINENAME is the name of your node and np indicates the number of available CPU's for the queue. Adapt this to your system.

You also need to define the server by adding the following to the /var/spool/torque/mom_priv/config file :

$pbsserver MYMACHINENAME
$logeven 255

Here again MYMACHINENAME indicates the name of the server issuing jobs. As the node and the server is the same in our configuration, specify the same name as in the previous nodes file.

Finish the configuration with :

qterm -t quick
pbs_server
pbs_mom (normally only on the node)

Check if you can see your nodes by issuing the pbsnodes -a command.

Start the scheduler on the server using :

pbs_sched

As a user login at least once onto the server via ssh from the server itself to add the server to the known hosts file :

ssh username@MYMACHINENAME

Queue configuration

Create a new queue which we name test here :

qmgr -c "create queue test queue_type=execution"
qmgr -c "set queue test enabled=true"
qmgr -c "set queue test started=true"
qmgr -c "set server scheduling=True" 

First test job submission

Create a sample job submission file called test.sh containing the following lines :

#!/bin/bash
#PBS -l walltime=00:1:00
#PBS -l nice=19
#PBS -q test
date
sleep 10
date

This should run during 10 seconds. Check if the job is inside the queue using qstat. Torque should produce also a test.sh.e# and test.sh.o# file as output.

Muddy Boots

All the following needs to be done as root on the box that will act as single-node cluster. First, of course, one needs to install the necessary packages. This can be done easily, with the caveat that you get Torque v2.4.16, which at this point is at end of life. I do not want to bother with non-packaged installs, as that would make my life harder later, so here goes.

apt-get install torque-server torque-client torque-mom torque-pam

Installing the packages also sets up torque with a default setup that is in no way helpful. So next you'll need to stop all torque services and recreate a clean setup.

/etc/init.d/torque-mom stop
/etc/init.d/torque-scheduler stop
/etc/init.d/torque-server stop
pbs_server -t create

You'll need to answer 'yes' here to overwrite the existing database. Next, kill the just-started server instance so we can set a few things manually.

killall pbs_server

If you don't kill the server, many things you do below will be overwritten the next time the server stops. Next, let's set up the server process; in the following, replace 'SERVER.DOMAIN' with your box's fully-qualified domain name [Note: see just below if your machine doesn't have an official FQDN]. I prefer to use FQDN's so that it's easier later to add other compute nodes, job submission nodes, etc. The following also sets up the server process to allow user 'root' to change configurations in the database. This bit seemed missing from the default install, and it took me a while to figure it out (again).

echo SERVER.DOMAIN > /etc/torque/server_name
echo SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/acl_hosts
echo [email protected] > /var/spool/torque/server_priv/acl_svr/operators
echo [email protected] > /var/spool/torque/server_priv/acl_svr/managers

If the machine you're installing Torque on doesn't have an official FQDN, a simple work-around is to invent one and assign it to the machine's network IP. For example, if eth0 is assigned to 192.168.1.1, we can add the following line to /etc/hosts.

192.168.1.1 SERVER.DOMAIN

The FQDN itself can be anything you want, but ideally choose something that cannot exist in reality, so something with a non-existent top-level domain.

A cluster is nothing without some compute nodes, so next we tell the server process that the box itself is a compute node (with 4 cores, below – change this to suit your requirements).

echo "SERVER.DOMAIN np=4" > /var/spool/torque/server_priv/nodes

We also need to tell the MOM process (i.e. the compute node handler) which server to contact for work.

echo SERVER.DOMAIN > /var/spool/torque/mom_priv/config

Once this is done, we can restart all processes again.

/etc/init.d/torque-server start
/etc/init.d/torque-scheduler start
/etc/init.d/torque-mom start

If you get any errors at this point, I'd suggest stopping any running processes, and restart them one by one in this order. Check the logs (under /var/spool/torque) for whatever is failing. Otherwise, the next step is to start the scheduler.

# set scheduling properties
qmgr -c 'set server scheduling = true'
qmgr -c 'set server keep_completed = 300'
qmgr -c 'set server mom_job_sync = true'

At this point, if you get the dreaded Unauthorized Request error, it's critical to figure out why this is happening. Usually it is because the commands look like they're coming from an unauthorized user/machine, that is anything different from the string '[email protected]'. You can check this with the following command.

grep Unauthorized /var/spool/torque/server_logs/*

We also need to create a default queue (here called 'batch' – you can change this to whatever you want). We'll set this up with default 1-hour time limit and single-node requirement, but you don't have to.

# create default queue
 qmgr -c 'create queue batch'
 qmgr -c 'set queue batch queue_type = execution'
 qmgr -c 'set queue batch started = true'
 qmgr -c 'set queue batch enabled = true'
 qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
 qmgr -c 'set queue batch resources_default.nodes = 1'
 qmgr -c 'set server default_queue = batch'

Finally, we need to configure the server to allow submissions from itself. This one stumped me for a while. Note that the submit_hosts lists cannot be made up to FQDNs! This will not work if you do that, as the comparison is done after truncating the name of the submitting host!

# configure submission pool
 qmgr -c 'set server submit_hosts = SERVER'
 qmgr -c 'set server allow_node_submit = true'

For example, if above you use SERVER.DOMAIN instead of just SERVER, you'll get an error like the following the next time you try to submit:

qsub: Bad UID for job execution MSG=ruserok failed validating USER/USER from SERVER.DOMAIN

where 'USER' will be the uid of whichever (non-root) user you try to submit the job as. The solution is simply to list the submission hosts as unqualified names. To test the system you need to try to submit a job (here an interactive one) as a non-root user from the same box.

qsub -I

If this works, you'll get into a shell on the same box (as if you ssh'ed into itself). Note that you'll need authorised SSH keys set up for this user to allow password-less ssh.

Eventually I'll need to set up additional boxes as submission hosts. I'll write about that process once it's done. [Note: you can now find this here.] Meanwhile, if you have any questions about the above, or if there's anything that could do with some clarification, feel free to let me know in the comments.

linux - Torque installation on RHEL 6.5 - Stack Overflow

1down votefavorite I want to install TORQUE on a RHEL 6 single machine (32 CPUs).

I followed every instructions of the manual to install it, but I am facing an error in the end. Here are all the steps I followed:

First step, make sure that libxml2-devel openssl-devel gcc gcc-c++ are installed and up-to-date:

    # yum install libxml2-devel openssl-devel gcc gcc-c++
    Setting up Install Process
    Package libxml2-devel-2.7.6-14.el6.x86_64 already installed and latest version
    Package openssl-devel-1.0.1e-16.el6_5.x86_64 already installed and latest version
    Package gcc-4.4.7-4.el6.x86_64 already installed and latest version
    Package gcc-c++-4.4.7-4.el6.x86_64 already installed and latest version
    Nothing to do

Then I downloaded and extracted the last version. Then I ran the default configure:

    # ./configure

I ran make and make install:

    # make
    # make install

With no errors.

I configured the trqauthd daemon to start automatically at system boot:

    # cp contrib/init.d/trqauthd /etc/init.d/
    # chkconfig --add trqauthd
    # echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf
    # ldconfig
    # service trqauthd start
    Starting TORQUE Authorization Daemon: hostname: x6540
    Currently no servers active. Default server will be listed as active server. Error  15133
    Active server name: x6540  pbs_server port is: 15001
    trqauthd daemonized - port 15005
                                                       [  OK  ]

There is the first error there.

The error code means:

    PBSE_SERVER_NOT_FOUND   15133   Could not connect to batch server

I continued the installation until the end anyway, and I managed to start mom and server services, but finally end up with:

    # pbsnodes
    localhost
         state = down
         np = 30
         properties = CIS
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003

Can you help me? I can provide you with all logs/info needed. Thanks!!

Just run trqauthd and pbs_mom as root in the client node.
share|improve this answeranswered Jan 5 '15 at 9:37

Chenming Zhang

8761125

Oops! I didn't mean to do this.

up vote0down voteThis is not an error actually, it just tells you that it cannot find any active pbs_server process. Later when you start pbs_server process, everything will work as normal.

Or if you run "service pbs_server start" first, you will not see the error.

Recommended Links

...



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019