The best way for managing licenses in SGE is the use of
consumable resources (CR). Floating licenses
can easily be managed with a global CR. The classic example of built-in consumable resource in SGE are
slots.
The SGE batch scheduling system allows for arbitrary "consumable resources" to be created that users
can then make requests against. Thus they can be used to limit access to software licenses based on
availability of license tokens. What a job that uses a special software package starts it they request
one (or more) license from SGE and consumable resource will decrement the counter for that license pool.
If no more resources are available (i.e. the internal counter is at 0), then the job will be delayed
until a currently-used resource is freed up.
Types of consumables
The consumable parameter can have three values:
'yes' ('y' abbreviated). Only for numeric attributes (INT,
DOUBLE, MEMORY, TIME - see
type above).
'no' ('n')
'JOB'
('j'). Only for numeric attributes (INT, DOUBLE, MEMORY,
TIME - see
type above).
It can be set to 'yes' and 'JOB' only for numeric attributes (INT, DOUBLE,
MEMORY, TIME - see
type above). If set to 'yes' or 'JOB' the consumption of the corresponding resource can be managed
by Sun Grid Engine internal bookkeeping. In this case Sun Grid Engine accounts for the consumption of
this resource for all running jobs and ensures that jobs are only dispatched if the Sun Grid Engine
internal bookkeeping indicates enough available consumable resources. Consumables are an efficient means
to manage limited resources such a available memory, free space on a file system, network bandwidth
or floating software licenses.
There are two types of consumables: per job and per slot.
A consumable defined by 'y' is a per slot consumable. Which means the limit is multiplied
by the number of slots being used by the job before being applied.
In case of 'j' the consumable is a per job consumable. This resource is debited as requested
(without multiplication) from the allocated master queue. The resource needs not be available for
the slave task queues.
Consumables can be combined with default or user defined load parameters (see sge_conf(5) and host_conf(5)),
i.e. load values can be reported for consumable attributes or the consumable flag can be set for load
attributes.
The Sun Grid Engine consumable resource management takes both the load (measuring availability
of the resource) and the internal bookkeeping into account in this case, and makes sure that neither
of both exceeds a given limit.
To enable consumable resource management the basic availability of a resource has to be defined.
This can be done on a cluster global, per host and per queue basis while these categories may supersede
each other in the given order (i.e. a host can restrict availability of a cluster resource and a
queue can restrict host and cluster resources).
Defining consumables
The definition of resource availability is performed with the complex_values entry in host_conf(5)
and queue_conf(5).
Basically a complex is a resource of value that can be requested by a
job with the -l switch to qsub By setting a complex to be consumable, it
means that when a job requests that complex the number available is decreased.
The complex_values definition of the "global" host specifies cluster
global consumable settings. To each consumable complex attribute in a complex_values list a value
is assigned which denotes the maximum available amount for that resource. The internal bookkeeping will
subtract from this total the assumed resource consumption by all running jobs as expressed through the
jobs' resource requests.
Notes:
Jobs can be forced to request a resource and thus to specify their assumed consumption
via the 'force' value of the requestable parameter (see above).
A default resource consumption value can be pre-defined by the administrator
for consumable attributes not explicitly requested by the job (see the default parameter
below). This is meaningful only if requesting the attribute is not enforced as explained above.
See the Sun Grid Engine Installation and Administration Guide for examples on the usage
of the consumable resources facility.
Here is hoe how to achieve "license tokens" consumption in SGE (aka licence token management)
Add consumable
(configure a consumable "per job" complex attribute)
% qconf -mc
#name shortcut type relop requestable consumable default urgency
accel accel INT <= YES JOB 0 0
add the total tokens to the "global" host
% qconf -me global
complex_values accel=19
submit job requesting slots and license tokens:
% qsub -l accel=10 -pe mpi 8 <myjob.sh>
The "per job" setting ensure that the requested tokens are *not*
multiplied with the number of requested slots.
First create/modify a complex called "global" (the name is reserved, like the complexes which are
managing resources on a per host/queue basis are called "host" and "queue"). This can be found by clicking
the "Complexes Configuration" button in qmon.
Enter the following values for the complex (verilog is used in this example):
#name shortcut type value relop requestable consumable default
#-------------------------------------------------------------
verilog vl INT 0 <= YES YES 0
The above says: there is a complex attribute called "verilog" with the shortcut name "vl" and it is
of type integer. The "value" for consumable resources has no meaning here (therefore it is 0). This
resource is requestable (YES), and it is consumable (YES).
The "default" field should be set to 0 (it is a default value for users who don't request anything,
but for a global value it is not useful here).
When using qmon, do not forget to press the "Add" button to add the new complex definition to the
table below before applying with the "Ok" button.
After the complex is configured, it can be viewed by running the following command at the prompt:
% qconf -sc global
Step 2: Configure the "global" host
Since a global consumable resource is being created (all hosts have access to this resource), the
pseudo host "global" must be configured.
Using qmon:
qmon -> Host Configuration -> Execution host
Select the "global" host and click on "Modify". Select the tab titled "Consumable/Fixed Attributes".
It is correct that the "global" complex does not show in the window (the global host has it by default,
just as a host has the "host" complex by default).
Now click on the "Name/Value" title bar on the right (above the trash bin icon). A window pops up
and there will be the resource "verilog". Select OK and verilog will be added to the first column of
the table. Now enter the number of licenses of verilog in the second column.
Press "Ok" and the new resource and number in the will appear in the "Consumables/Fixed Attributes"
window. Click the "Done" button to close this window.
We're using SGE (Sun Grid Manager). We have some limitations on
the total number of concurrent jobs from all users.
I would
like to know if it's possible to set a temporary, voluntary
limit on the number of concurrent running jobs for a specific
user.
For example user dave is about to submit 500
jobs, but he would like no more than 100 to run concurrently,
e.g. since he knows the jobs do lots of I/O which stuck the
filesytem (true story, unfortunately).
You can define a complex with qconf -mc. Call it
something like high_io or whatever you'd like, and
set the consumable field to YES. Then in either the
global configuration with qconf -me global or in a
particular queue with qconf -mq <queue name> set
high_io=500 in the complex values. Now tell your
users to specify -l high_io=1 or however many
"tokens" you'd like them to use. This will limit the number of
concurrent jobs to whatever you set the complex value to.
The other way to do this is with quotas. Add a quota with
qconf -arqs that looks something like:
{
name dave_max_slots
description "Limit dave to 500 slots"
enabled true
limit users {dave} to slots=500
}
Thanks Kamil and
sorry for the late reply. A couple of
follow-ups, since I'm quite new to qconf.
Regarding your first suggestion, could you be a
bit more explicit? What is "consumable"? After
configuring as mentioned, fo I simply tell the
user to qsub with -l high_io=1?
–
David B
Sep 28 '10 at 9:39
Basically a complex
is a resource of value that can be requested by
a job with the -l switch to
qsub. By setting a complex to be
consumable, it means that when a job requests
that complex the number available is decreased.
So if a queue has 500 of the high_io complex,
and a job requests 20, there will be 480
available for other jobs. You'd request the
complex just as in your example. –
Kamil Kisiel
Sep 28 '10 at 22:42
We have found that for some tasks, it is advantageous to specify the info on required resources
to SGE. It has sense in case an excessive use of RAM/netowrk storage is expected. The limits are
soft and hard (parameters -soft, -hard), the limits themselves are:
-l resource=value
For example, in case a job needs at least 400MB RAM: qsub -l ram_free=400M my_script.sh Another
often requested resource is the space in /tmp: qsub -l tmp_free=10G my_script.sh. Or both:
qsub -l ram_free=400M,tmp_free=10G my_script.sh
Of course, it is possible (and preferable if the number does not change) to use the construction
#$ -l ram_free=400M directly in the script. The actual status of given resource on all nodes can
be obtained by: qstat -F ram_free, or more things by: qstat -F ram_free,tmp_free.
Details on other standard available resources are in /usr/local/share/SGE/doc/load_parameters.asc.
In case you do not specify value for given resource, implicit value will be used (for space on /tmp
it is 1GB, for RAM 100MB)
WARNING: You need to distinguish, if you request resources
that are available at the time of submission (so called non-consumable resources), or if you need
to allocate given resource for the whole runtime of your computation - for example, your program
will need 400MB of memory but in the first 10 min of computation, it will allocate only 100MB. In
case you use the standard resource mem_free, and during the first 10min another jobs will be submitted
to the given node, SGE will interpret it in the following way: you wanted 400MB but you finally
use only 100MB so that the rest of 300MB will be given to someone else (i.e. it will submit another
task requesting this memory).
For these purposes, it is better to use consumable resources, that are computed independently
on the current status of the task - for memory it is ram_free, for disc tmp_free.
For example, resource ram_free does not look at the actual free RAM, but it computes the
occupation of RAM only based on the requests of individual scripts. It works with the size of RAM
of the given machine and subtracts the amount requested by the job that should be run on this machine.
In case the job does not specify ram_free, implicit value of ram_free=100M will
be used.
For the disk space in /tmp (tmp_free), the situation is more tricky: in case a job does not clean
up properly its mess after it finishes, the disk can actually have less space than defined by the
resource. Unfortunately, nothing can be done about this.
Known problems with SGE
Use of paths - for home directory it is necessary to use the official path - i.e. /homes/kazi/...
or /homes/eva (or simply the variable $HOME). In case the path of the internal mountpoint of
the automounter is used - i.e. - /var/mnt/... an error will occur. (this is not an error of
SGE, the internal path is not fully functional for access)
Availability of nodes - due to the existence of nodes with limited access (employees' PCs),
it is necessary to specify a list of nodes, on which your job can run. This can be done using
parameter -q. The machines that are available are nodes in IBM Blades and also some
computer labs in case you turn the machines on over night. The list of queues for -q
must be only on one line even if it is very long. For the availability of given groups of nodes,
the parameter -q can be used in the following way:
#$ -q all.q@@blade,all.q@@PCNxxx,all.q@@servers
Main groups of computers are: @blade, @servers, @speech, @PCNxxx, @PCN2xxx - the full and actual
list can be obtained by qconf -shgrpl
The syntax for access is QUEUE@OBJECT - i.e. all.q@OBJECT. The object is either one computer,
for example all.q@svatava, or a group of computers (which begins also by @ - @blade) i.e. all.q@@blade.
The computers in the labs are sometimes restarted by students during computation - we can't
do much about this. In case you really need the computation to finish (i.e. it is not easy to
re-run a job in case it is brutally killed) use newly defined groups of computers:
@stable - @blade, @servers - servers that run all the time w/o restarting
@PCOxxx, @PCNxxx - computer labs, there is a possibility that any node might be restarted at any time,
a student or someone can shut the machine down by error or "by error". It is more or less sure that these
machines will run smoothly over night and during weekends. There is also a group for each independent lab e.g. @PCN103.
Runnnig other scripts than bash - it is necessary to specify the interpret on the first
line of your script (it is probably already there), for example #!/usr/bin/perl, etc.
Does your script generate a heavy traffic on matyldas ? It is necessary to set -l matyldaX=10,
(for example 10 - i.e. in total 100/10 = 10 concurrent jobs from given matyldaX), where X is
the number of matylda used (in case you use several matyldas, specify -l matyldaX=Y
several times). We have created an SGE resource for each matylda (each matylda has 100 points
in total) and the jobs using -l matyldaX=Y are submitted until given matylda has free
points. This can be used to balance the load of given storage server from the user side. The
same holds for servers scratch0X.
Attention to parameter -cwd, is is not guaranteed that it will work all the time,
better use cd /where/do/i/want at the beginning of your script.
In case a node is restarted, a job will still be shown in SGE, although it is not running
any more. This is because SGE is waiting until the node confirms termination of the computation
(i.e. until it boots Linux again and starts the SGE client). In case you use qdel to
delete a job, it will be only marked by flag d. Jobs marked by this flag are automatically
deleted by the server every hour.
Parallel jobs - OpenMP
For parallel tasks with threads, it is enough to use parallel environment smp and to
set the number of threads:
Open MPI is now fully supported, and it is the default parallel environment (mpirun
is by default Open MPI)
The SGE parallel environment is openmpi
The allocation rule is $fill_in$ which means that the preferred allocation is on the same
machine.
Open MPI is compiled with tight SGE integration:
mpirun will automatically submit to machines reserved by SGE
qdel will automatically clean all MPI stubs
In the parallel task, do not forget (preferably directly in the script) to use parameter
-R y, this will turn on the reservation of slots, i.e. you won't be jumped by processes
requesting less slots.
in case a parallel task is launched using qlogin, there is no variable containing
information on what slots were reserved. A useful tool is then qstat -u `whoami` -g t |
grep QLOGIN, which says what parallel jobs are running.
Listing follows:
#!/bin/bash
# ---------------------------
# our name
#$ -N MPI_Job
#
# use reservation to stop starvation
#$ -R y
#
# pe request
#$ -pe openmpi 2-4
#
# ---------------------------
#
# $NSLOTS
# the number of tasks to be used
echo "Got $NSLOTS slots."
mpirun -n $NSLOTS /full/path/to/your/executable
Added by John Pormann, last edited by John Pormann
Jul 16, 2008
The SGE batch scheduling system allows for arbitrary "consumable resources" to be created that
users can then make requests against. In general, this is used to limit access to a pool of software
licenses or make sure that memory usage is planned for properly. E.g. when a user wants to use a
special software package, they request 1 license from SGE and it will decrement its internal counter
for that license pool. If no more resources are available (i.e. the internal counter is at 0), then
the job will be delayed until a currently-used resource is freed up.
We can also create arbitrary consumable resources to help users self-limit their usage of the
DSCR. We can set up a resource, or counter, that will be decremented every time you submit a job.
This way, you can submit 1000's of jobs to SGE, but you won't be swamping the machines or otherwise
impeding other users.
If a user is given their own job-control resource, say 'cpus_user001', they should then submit
jobs with an extra resource request using the '-l' option:
% qsub -l cpus_user001=1 myjob.q
Before running the job, SGE will make sure that there are sufficient resources. Thus, if there are
100 resources set aside for 'cpus_user001', then the 101st simultaneous job-request will have to
wait for one of the previous jobs to complete ''even if there are empty machines in the cluster.''
Alternately, you can embed this within your SGE submission script. At the top of the file ("myjob.q"
in the above example), you can insert:
I am using a tool called starcluster http://star.mit.edu/cluster
to boot up an SGE configured cluster in the amazon cloud. The problem is that it doesn't seem to
be configured with any pre-set consumable resources, excepts for SLOTS, which I don't seem to be
able to request directly with a qsub -l slots=X. Each time I boot up a cluster, I may
ask for a different type of EC2 node, so the fact that this slot resource is preconfigured is really
nice. I can request a certain number of slots using a pre-configured parallel environment, but the
problem is that it was set up for MPI, so requesting slots using that parallel environment sometimes
grants the job slots spread out across several compute nodes.
Is there a way to either 1) make a parallel environment that takes advantage of the existing
pre-configured HOST=X slots settings that starcluster sets up where you are requesting slots on
a single node, or 2) uses some kind of resource that SGE is automatically aware of? Running
qhost makes me think that even though the NCPU and MEMTOT
are not defined anywhere I can see, that SGE is somehow aware of those resources, are there settings
where I can make those resources requestable without explicitely defining how much of each are available?
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------
arch a RESTRING == YES NO NONE 0
calendar c RESTRING == YES NO NONE 0
cpu cpu DOUBLE >= YES NO 0 0
display_win_gui dwg BOOL == YES NO 0 0
h_core h_core MEMORY <= YES NO 0 0
h_cpu h_cpu TIME <= YES NO 0:0:0 0
h_data h_data MEMORY <= YES NO 0 0
h_fsize h_fsize MEMORY <= YES NO 0 0
h_rss h_rss MEMORY <= YES NO 0 0
h_rt h_rt TIME <= YES NO 0:0:0 0
h_stack h_stack MEMORY <= YES NO 0 0
h_vmem h_vmem MEMORY <= YES NO 0 0
hostname h HOST == YES NO NONE 0
load_avg la DOUBLE >= NO NO 0 0
load_long ll DOUBLE >= NO NO 0 0
load_medium lm DOUBLE >= NO NO 0 0
load_short ls DOUBLE >= NO NO 0 0
m_core core INT <= YES NO 0 0
m_socket socket INT <= YES NO 0 0
m_topology topo RESTRING == YES NO NONE 0
m_topology_inuse utopo RESTRING == YES NO NONE 0
mem_free mf MEMORY <= YES NO 0 0
mem_total mt MEMORY <= YES NO 0 0
mem_used mu MEMORY >= YES NO 0 0
min_cpu_interval mci TIME <= NO NO 0:0:0 0
np_load_avg nla DOUBLE >= NO NO 0 0
np_load_long nll DOUBLE >= NO NO 0 0
np_load_medium nlm DOUBLE >= NO NO 0 0
np_load_short nls DOUBLE >= NO NO 0 0
num_proc p INT == YES NO 0 0
qname q RESTRING == YES NO NONE 0
rerun re BOOL == NO NO 0 0
s_core s_core MEMORY <= YES NO 0 0
s_cpu s_cpu TIME <= YES NO 0:0:0 0
s_data s_data MEMORY <= YES NO 0 0
s_fsize s_fsize MEMORY <= YES NO 0 0
s_rss s_rss MEMORY <= YES NO 0 0
s_rt s_rt TIME <= YES NO 0:0:0 0
s_stack s_stack MEMORY <= YES NO 0 0
s_vmem s_vmem MEMORY <= YES NO 0 0
seq_no seq INT == NO NO 0 0
slots s INT <= YES YES 1 1000
swap_free sf MEMORY <= YES NO 0 0
swap_rate sr MEMORY >= YES NO 0 0
swap_rsvd srsv MEMORY >= YES NO 0 0
qconf -me master output (one of the nodes as an example):
The solution I found is to make a new parallel environment that has the $pe_slots
allocation rule (see man sge_pe). I set the number of slots available to that parallel
environment to be equal to the max since $pe_slots limits the slot usage to per-node.
Since starcluster sets up the slots at cluster bootup time, this seems to do the trick nicely. You
also need to add the new parallel environment to the queue. So just to make this dead simple:
qconf -ap by_node
and here are the contents after I edited the file:
Also modify the queue (called all.q by starcluster) to add this new parallel environment
to the list.
qconf -mq all.q
and change this line:
pe_list make orte
to this:
pe_list make orte by_node
I was concerned that jobs spawned from a given job would be limited to a single node, but this
doesn't seem to be the case. I have a cluster with two nodes, and two slots each.
After a little while, qstat shows both jobs running on different nodes:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
25 0.55500 test root r 10/17/2012 21:42:57 all.q@master 2
26 0.55500 sleep root r 10/17/2012 21:43:12 all.q@node001 2
I also tested submitting 3 jobs at once requesting the same number of slots on a single node,
and only two run at a time, one per node. So this seems to be properly set up!
We have a cluster of machines, each with 4 GPUs. Each job should be able to ask for 1-4 GPUs. Here's
the catch: I would like the SGE to tell each job which GPU(s) it should take. Unlike the
CPU, a GPU works best if only one process accesses it at a time. So I would like to:
Job #1 GPU: 0, 1, 3
Job #2 GPU: 2
Job #4 wait until 1-4 GPUs are avaliable
The problem I've run into, is that the SGE will let me create a GPU resource with 4 units on
each node, but it won't explicitly tell a job which GPU to use (only that it gets 1, or 3, or whatever).
I thought of creating 4 resources (gpu0, gpu1, gpu2, gpu3), but am not sure if the
-l flag will take a glob pattern, and can't figure out how the SGE would tell the job
which gpu resources it received. Any ideas?
Daniel Blezek
2,6521915
When you have multiple GPUs and you want your jobs to request a GPU but the Grid Engine scheduler
should handle and select a free GPUs you can configure a RSMAP (resource map) complex (instead
of a INT). This allows you to specify the amount as well as the names of the GPUs on a specific
host in the host configuration. You can also set it up as a HOST consumable, so that independent
of the slots your request, the amount of GPU devices requested with -l cuda=2 is for each host 2
(even if the parallel job got i.e. 8 slots on different hosts).
Then when requesting -l gpu=1 the Univa Grid Engine scheduler will select GPU2 if GPU1 is already
used by a different job. You can see the actual selection in the qstat -j output. The job gets the
selected GPU by reading out the $SGE_HGR_gpu environment variable, which contains in this case the
chose id/name "GPU2". This can be used for accessing the right GPU without having collisions.
If you have a multi-socket host you can even attach a GPU directly to some CPU cores near the
GPU (near the PCIe bus) in order to speed up communication between GPU and CPUs. This is possible
by attaching a topology mask in the execution host configuration.
Now when the UGE scheduler selects GPU2 it automatically binds the job to all 4 cores (C) of
the second socket (S) so that the job is not allowed to run on the first socket. This does not even
require the -binding qsub param.
Note, that all these features are only available in Univa Grid Engine (8.1.0/8.1.3 and higher),
and not in SGE 6.2u5 and other Grid Engine version (like OGE, Sun of Grid Engine etc.). You can
try it out by downloading the 48-core limited free version from univa.com.
The Last but not LeastTechnology is dominated by
two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt.
Ph.D
FAIR USE NOTICEThis site contains
copyrighted material the use of which has not always been specifically
authorized by the copyright owner. We are making such material available
to advance understanding of computer science, IT technology, economic, scientific, and social
issues. We believe this constitutes a 'fair use' of any such
copyrighted material as provided by section 107 of the US Copyright Law according to which
such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free)
site written by people for whom English is not a native language. Grammar and spelling errors should
be expected. The site contain some broken links as it develops like a living tree...
You can use PayPal to to buy a cup of coffee for authors
of this site
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or
referenced source) and are
not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society.We do not warrant the correctness
of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be
tracked by Google please disable Javascript for this site. This site is perfectly usable without
Javascript.