|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
|
Most commands in SGE are structured using the following template of options
Capitalized argument means ‘read in from file’. Lowercase means ‘do it interactively’.
All SGE commands generally follow this structure
The command-line user interface is a set of ancillary programs (commands) that enable you to do the
following tasks:
alias qcall='qconf -mq all.q' alias qerrors='qstat -f -explain E' alias qsummary='qstat -g c' alias qclear='qmod -c "*"'
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
The default scheduler can keep track why jobs could not be scheduled during the last scheduler run. This parameter enables or disables the observation. The value true enables the monitoring false turns it off.
It is also possible to activate the observation only for certain jobs. This will be done if the parameter is set to job_list followed by a comma separated list of job ids.
If schedd_job_info=true, the user can obtain the collected information with the command
qstat -j job numberOne of the important parameters in this file is schedd_job_info which determined whether qstat -j provides information about jobs (Chris Dagdigian)
In this case the change is that with 6.2 the parameter "schedd_job_info" now defaults to FALSE where in the past it was TRUE.
I *completely* understand why the change happened since the 6.2 design goal was for massive scalability and schedd_job_info can put a massive load on the SGE system particularly in massive clusters like Ranger where 6.2 was tested out.
But ... are most 6.2 deployments going on to systems where the exechost count or job throughput rates means that setting schedd_job_info=FALSE has a measurable performance gain, significant enough to offset the massive loss of end-user-accessible troubleshooting information? I suspect ... not.
The schedd_job_info output appended in the output of "qstat -j" is the single most effective troubleshooting and "why does my job not get dispatched" resource that is available to non SGE administrators. Taking this tool away from users (in my opinion) has a bigger negative impact than any performance gains realized (at least for the types of systems I work on most often).
So -- just like I recommend and tell people to use classic spooling on smaller systems I also plan on telling people to re-enable schedd_job_info feature on their 6.2 systems (if their system and workflow allows).
I'm bringing this up on the list for two reasons:
- Just to see what others think
1 algorithm default 2 schedule_interval 0:0:15 3 maxujobs 0 4 queue_sort_method load 5 job_load_adjustments np_load_avg=0.50 6 load_adjustment_decay_time 0:7:30 7 load_formula np_load_avg 8 schedd_job_info true 9 flush_submit_sec 0 10 flush_finish_sec 0 11 params none 12 reprioritize_interval 0:0:0 13 halftime 168 14 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 15 compensation_factor 5.000000 16 weight_user 0.250000 17 weight_project 0.250000 18 weight_department 0.250000 19 weight_job 0.250000 20 weight_tickets_functional 0 21 weight_tickets_share 0 22 share_override_tickets TRUE 23 share_functional_shares TRUE 24 max_functional_jobs_to_schedule 200 25 report_pjob_tickets TRUE 26 max_pending_tasks_per_job 50 27 halflife_decay_list none 28 policy_hierarchy OFS 29 weight_ticket 0.010000 30 weight_waiting_time 0.000000 31 weight_deadline 3600000.000000 32 weight_urgency 0.100000 33 weight_priority 1.000000 34 max_reservation 0 35 default_duration INFINITY
Typically cluster is fully loaded with job almost all the time and there is a difficulty in submitting small (1-2 min) testing jobs running on more then ne node.
1. Same queue structure as before (see bioteam.net)
2. Attach "slots=2" as a host resource on all nodes
3. Submit test jobs to all queues
The wizard solution:
qconf -aattr exechost complex_values slots=2 <host>
What did we do?
Slot limits "solve" the oversubscription problem
Still have these problems:
FIFO job execution
Priority is handled by OS after SGE scheduling
We can still do better (stay tuned)…
Here are some of the Grid Engine configuration steps we should take on a new install. I recommend doing all of these from the very beginning, to prevent changes that may confuse or break user workflow.
There is one thing we must always do with a new compute cluster, and that is enable hard memory limits. Users are usually not too keen on any kind of limit, because jobs will eventually run into them. Once the realization is made that limits ensures node stability and uptime, users will demand them. Without limits, one bad job can crash a node and bring down many other jobs.
To enable hard memory limits, we modify the complex configuration to make h_vmem requestable.
# qconf -mc h_vmem h_vmem MEMORY <= YES YES 1g 0Once this complex is set, it is a good idea to define a default option for qsub in the $SGE_ROOT/default/common/sge_request file. When enabling h_vmem, we should also set a default value for h_stack. h_vmem sets a limit on virtual memory, while h_stack sets a limit on stack space for binary execution. Without a sufficient value for h_stack, programs like Python, Matlab or IDL will fail to start. Here, we are also binding each job to a single core.-binding linear:1 -q all.q -l h_vmem=1g -l h_stack=128mIf we want to manually set values for each individual node, like slots and memory, a for-loop is very helpful.# qconf -rattr exechost complex_values slots=8,num_proc=8,h_vmem=8g node01 # for ((I=1; I <= 16 ; I++)); do > NODE=`printf "node%02d\n" $I` > MEM=`ssh $NODE 'free -b |grep Mem |cut -d" " -f 5'` > SWAP=`ssh $NODE 'free -b |grep Swap |cut -d" " -f 4'` > VMEM=`echo $MEM+$SWAP|bc` > qconf -rattr exechost complex_values slots=8,num_proc=8,h_vmem=$VMEM $NODE > doneTo submit a job with a 4 gig limit, use the -l command line option.$ qsub -l h_vmem=4g -l h_stack=256m myjob.shTo see available memory, use qstat.$ qstat -F h_vmemIt is also a good idea to place limits on the amount of memory any single process on the login node may allocate, in the /etc/security/limits.conf file. This example will limit any user in the clusterusers group to 4 gigs per process. Anything larger should be ran via qlogin. When adding new users, make sure to add them to this now default group.# limit any process to 4GB = 1024*1024*4KB = 4194304 @clusterusers hard rss 4194304 @clusterusers hard as 4194304There should also be a limit on how many jobs a single user can queue at once. If a user must submit over 2000 jobs simultaneously, they may want to consider a more manageable workflow utilizing array jobs.# qconf -mconf max_u_jobs 2000If we want to limit the number of jobs a single user can have in the running state simultaneously:# qconf -msconf max_reservation 128 maxujobs 128If the queue will be accepting multi-slot parallel jobs, slot reservation should be enabled to prevent starvation. Otherwise, single-slot jobs will constantly fill in space ahead of the big job. This can be done by submitting multi-slot jobs with the "-R y" option.To enable a simple fairshare policy between all users, there are only three options to check:
# qconf -mconf enforce_user auto auto_user_fshare 100 # qconf -msconf weight_tickets_functional 10000To be a bit more verbose, we should collect some job scheduler info.# man sched_conf # qconf -msconf schedd_job_info trueNow we can see why or why not a job is scheduled.$ qstat -j 427997 $ qacct -j 427997If we plan to allow graphical GUI programs in the queue, we must setup a qlogin wrapper script with proper X11 forwarding.# vim /usr/global/sge/qlogin_wrapper # chmod +x /usr/global/sge/qlogin_wrapperqlogin_wrapper:#!/bin/sh HOST=$1 PORT=$2 shift shift echo /usr/bin/ssh -Y -p $PORT $HOST /usr/bin/ssh -Y -p $PORT $HOSTSet the qlogin wrapper and ssh shell:# qconf -mconf qlogin_command /usr/global/sge/qlogin_wrapper qlogin_daemon /usr/sbin/sshd -iIf we have a floating license server with a limited number of seats, we will want to configure a consumable complex resource. When a user submits a job, the qsub option '-l idl=1' must be used. In this example, the number of jobs that specify idl will be limited to 15 at any one time.# qconf -mc matlab ml INT <= YES YES 0 0 idl idl INT <= YES YES 0 0 # qconf -me global complex_values matlab=10,idl=15If we want to have multiple queues across the same hosts, we can define a policy so that nodes do not become oversubscribed.# qconf -arqs { name limit_slots_to_cores_rqs description Prevents core oversubscription across queues. enabled TRUE limit hosts {*} to slots=$num_proc }
Dave Love
2013-08-30Table of Contents
Script Execution
Unix behaviour Modules environment Parallel Environments Heterogeneous/Isolated Node Groups JSVs Wildcarded PEs Checking for Windows-style line endings in job scripts Scheduling Policies Host Group Ordering Fill Up Hosts Avoiding Starvation (Resource Reservation/Backfilling) Fair Share Resource Management Slot Limits Memory Limits Licence Tokens Killing Detached Processes Core Binding Administration Maintenance Periods Rebooting execution hosts Broken/For-testing Hosts
This is a somewhat-random collection of commonly-required configuration recipes. It is written mainly from the point of view of high performance computing clusters, and some of the configuration suggestions may not be relevant in other circumstances or for old versions of gridengine. See also the other howto documents. Suggestions for additions/corrections are welcome (to d.love @ liverpool.ac.uk).
Script Execution
Unix behaviour
Set shell_start_mode to unix_behavior in your queue configurations to get the normally-expected behaviour of starting job scripts with the shell specified in the initial #! line.
Modules environment
A side-effect of unix_behaviour is usually not getting the normal login environment, specifically not with the module command available for those sites that use environment modules. At least for use with the bash shell, add the following to the site sge_request file to avoid users having to source modules.sh etc. in job scripts, assuming sessions from which jobs are submitted have modules available:
-v module -v MODULESHOME -v MODULEPATHThis may not work with other shells
Parallel Environments
Heterogeneous/Isolated Node Groups
Suppose you have various sets of compute nodes over which you want to run parallel jobs, but each job must be confined to a specific set. Possible reasons are that you have
- significantly heterogeneous nodes (different CPU speeds, architectures, core numbers, etc.),
- groups of nodes which have restricted access, e.g. dedicated to a user group, controlled by an ACL,
- or islands of connectivity on your MPI fabric(s) which are either actually isolated or have slow communication boundaries over switch layers.
Then you'll want to define multiple parallel environments and host groups. There will typically be one PE and one host group (with possibly an ACL) for each node type or island. The PEs will all be the same, unless you want a different fixed allocation_rule for each, but with different names. The names need to be chosen so that you can conveniently use wildcard specifications for them. Normally the names will all have the same name base, e.g. mpi-…. As an example, for different architectures, with different numbers of cores which get exclusive use for any tightly integrated MPI:
$ qconf -sp mpi-8 pe_name mpi-8 slots 99999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule 8 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE qsort_args NONE $ qconf -sp mpi-12 pe_name mpi-12 slots 99999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule 12 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE qsort_args NONEwith corresponding host groups @quadcore and @hexcore for each type of dual-processor box. Those PEs are assigned one-to-one to host groups, ensuring that jobs can't run across the groups (since parallel jobs are always granted a unique PE by the scheduler, whereas they can be split across queues).
$ qconf -sq parallel ... seq_no 2,[@quadcore=3],[@hexcore-eth=4],... ... pe_list NONE,[@quadcore=make mpi-8 smp],[@hexcore=make mpi-12 smp],... ... slots 0,[@quadcore=8],[@hexcore=12],... ...Now the PE naming comes in useful, since you can submit to a wildcarded PE, -pe mpi-*, if you're not fussy about the PE you actually get. See Wildcarded PEs for the next step.
Suppose you want to retain the possibility of running across all the PEs (assuming they're not isolated). Then you can define an extra PE, say allmpi, which isn't matched by the wildcard.
Note SGE 8.1.1+ allows general PE wildcards (actually patterns), as documented, fixing the bug which meant that only * was available in older versions. The correct treatment might be useful with such configurations, e.g. selecting mpi-[1-4]. JSVs
See jsv(1) and jsv_script_interface(3) for documentation on job submission verifiers, as well as the examples in $SGE_ROOT/util/resources/jsv/.
See also the Resource Reservation section.
Wildcarded PEs
When you would use a wildcarded PE as above, for convenience and abstraction, you can use a JSV to write the wildcard pattern. This JSV fragment from jsv_on_verify in Bourne shell re-writes -pe mpi to -pe mpi-*:
if [ $(jsv_is_param pe_name) = true ]; then pe=$(jsv_get_param pe_name) ... case $pe in mpi) jsv_set_param pe_name "$pe-*" pe="$pe-*" modified=1 ...Suppose you want to retain the possibility of running across all the PEs (assuming the groups aren't isolated). Then you can define an extra PE, say allmpi, which isn't re-written by the JSV.
Checking for Windows-style line endings in job scripts
Users sometimes transfer job scripts from MS Windows systems in binary mode, so that they end up with CRLF line endings, which typically fail, often with a rather obscure failure to execute if shell_start_mode is set to unix_behavior. The following fragment from jsv_on_verify in a shell script JSV prevents submitting such job scripts
# Avoid having to use literal ^M ctrlM=$(printf "\15") ... jsv_on_verify () { ... cmd=$(jsv_get_param CMDNAME) case $(jsv_get_param b) in y|yes) binary=y;; esac [ "$cmd" != NONE -a "$cmd" != STDIN -a "$binary" != y ] && [ -f "$cmd" ] && grep -q "$ctrlM\$" "$cmd" && # Can't use multi-line messages, unfortunately. jsv_reject "\ Script has Windows-style line endings; transfer in text mode or use dos2unix" ...Scheduling Policies
See sched_conf(5) for detailed information on the scheduling configuration.
Host Group Ordering
To change scheduling so that hosts in different host groups are preferentially used in some defined order, set queue_sort_method to seqno:
$ qconf -ssconf ... queue_sort_method seqno ...and define the ordering in the relevant queue(s) as required with seq_no):
$ qconf -sq ... ... seq_no 10,[@group1=4],[@group2=3],... ...It is possible to use seqno, for instance, to schedule serial jobs preferentially to one 'end' of the hosts and parallel jobs to the other 'end'.
Fill Up Hosts
To schedule preferentially to hosts which are already running a job, as opposed to the default of roughly round robin according to the load level, change the load_formula:
$ qconf -ssconf ... queue_sort_method load load_formula slots ...assuming the slots consumable is defined on each node.
Reasons for compacting jobs onto a few nodes as possible include avoiding fragmentation (so that parallel jobs which require complete nodes have a better chance of being fitted in), or being able to power down complete unused nodes.
Note Scheduling is done according to load values reported at scheduling time, without lookahead, so that it only takes effect over time. Since the load formula is used to determine scheduling when hosts are equal according to queue_sort_method, you can schedule to the preferred host groups by seqno as above, and still compact jobs onto the nodes using slots in the load formula, as above, i.e. with this configuration:
$ qconf -ssconf ... queue_sort_method seqno load_formula slots ...Avoiding Starvation (Resource Reservation/Backfilling)
To avoid "starvation" of larger, higher-priority jobs by smaller, lower-priority ones (i.e. the smaller ones always run in front of the larger ones) enable resource reservation by setting max_reservation to a reasonable value (maybe around 100), and arrange that relevant jobs are submitted with -R y, e.g. using a JSV. Here is a JSV fragment suitable for client side use, to add reservation to jobs over a certain size, assuming that PE slots is the only relevant resource:
if [ $(jsv_is_param pe_name) = true ]; then pe=$(jsv_get_param pe_name) pemin=$(jsv_get_param pe_min) ... # Check for an appropriate pe_min with no existing reservation. if [ $(jsv_is_param R) = false ]; then if [ $pemin -ge $pe_min_reserve ]; then jsv_set_param R y modified=1 fi fi
Note For "backfilling" (shorter jobs can fill the gaps before reservations) to work properly with jobs which do not specify an h_rt value at submission, the scheduler default_duration must be set to a value other then the default infinity, e.g. to the longest runtime you allow. To monitor reservations, set MONITOR=1 in sched_conf(5) params and use qsched(1) after running process_scheduler_log; qsched -a summarizes all the current reservations.
Fair Share
It is often required to provide a fair share of resources in some sense, whereby heavier users get reduced priority. There are two SGE policies for this. The share tree policy assigns priorities based on historical usage with a specified lifetime, and the functional policy only takes current usage into account, i.e. is similar to the share tree with a very short decay time. (It isn't actually possible to use the share tree with a lifetime less than one hour.) Use one of the other, but not both to avoid confusion.
With both methods, ensure that the default scheduler parameters are changed so that weight_ticket is a lot larger than weight_urgency and weight_priority or set the latter two to zero if you don't need them. Otherwise it is possible to defeat the fair share by submitting with a high priority (-p) or with resources with a high urgency attached. See sge_priority(5) for details.
You may also want to set ACCT_RESERVED_USAGE in execd_params to use effectively 'wall clock' time in the accounting that determines the shares.
Functional
For simple use of the functional policy, add
weight_tickets_functional 10000to the default scheduling parameters (qconf -msconf) and define a non-zero fshare for each user (qconf -muser). If you use enforce_user auto in the configuration,
auto_user_fshare 1000could be used to set up automatically-created users (new ones only).
Warning enforce_user auto implies not using CSP security, which typically is not wise. Share Tree
See share_tree(5).
To make a simple tree, use qconf -Astree with a file with contents similar to:
id=0 name=Root type=0 shares=1 childnodes=1 id=1 name=default type=0 shares=1000 childnodes=NONEand give the share tree policy a high weight, like (qconf -msconf):
weight_tickets_share 10000If you have auto-creation of users (see the warning above), you probably want to ensure that they are preserved with:
auto_user_delete_time 0The share tree usage decays with a half-life of 7 days by default; modify halftime (specified in hours) to change it.
Resource Management
Slot Limits
You normally want to prevent over-subscription of cores on execution hosts by limiting the slots allocated on a host to its core (or actually processor) count - where "processors" might mean hardware threads. There are multiple ways of doing so, according to taste, administrative convenience, and efficiency.
If you only have a single queue, you can get away with specifying the slot counts in the queue definition (qconf -mconf), e.g. by host group
slots 0,[@hexcore=12],[@quadcore=8]...but with multiple queues on the same hosts, you may need to avoid over-subscription due to contributions from each queue.
An easy way for an inhomogeneous cluster is with the following RQS (with qconf -arqs), although it may lead to slow scheduling in a large cluster:
{ Name host-slots description restrict slots to core count enabled true limit hosts {*} to slots=$num_proc }This would probably be the best solution if num_proc, the processor count, is variable by turning hardware threads on and off.
Alternatively, with a host group for each hardware type, you can use a set of limits like
limit hosts {@hexcore} to slots=12 limit hosts {@quadcore} to slots=8which will avoid the possible scheduling inefficiency of the $num_proc dynamic limit.
Finally, and possibly the most foolproof way in normal situations is to set the complex on each host, e.g.
$ for n in 8 16; do qconf -mattr exechost complex_values slots=$n \ `qconf -sobjl exechost load_values "*num_proc=$n*"`; doneMemory Limits
Normally it is advisable to prevent jobs swapping. To do so, make the h_vmem complex consumable, and give it a default value that is (probably slightly less than) the lowest memory/core that you have on execution hosts, e.g.:
$ qconf -sc | grep h_vmem h_vmem h_vmem MEMORY <= YES YES 2000m 0(See complex(5) and the definition of memory_specifier.)
Also set h_vmem to an appropriate value on each execution host, leaving some head-room for system processes, e.g. (with bash-style expansion):
$ qconf -mattr exechost complex_values h_vmem=31.3G node{1..32}Then single-process jobs can't over-subscribe memory on the hosts-at least the jobs on their own-and multi-process ones can't over-subscribe long term (see below).
Jobs which need more than the default (2000m per slot above) need to request it at job submission with -l h_vmem=…, and may end up under-subscribing hosts' slots to get enough memory in total.
Each process is limited by the system to the requested memory (see setrlimit(2)), and attempts to allocate more will fail. If it is a stack allocation, the program will typically die; if it is an attempt to malloc(3) too much, well-written programs should report an allocation failure. Also, the qmaster tracks the total memory accounted to the job, and will kill it if allocated memory exceeds the total requested.
These mechanisms are not ideal in the case of MPI-style jobs, in particular. The rlimit applied is the h_vmem request multiplied by the slot count for the job on the host, but it is for each process in the job-the limit does not apply to the process tree as a whole. This means that MPI processes, for instance, can over-subscribe in the PDC_INTERVAL before the execd notices, and out-of-memory system errors may still occur. Future use of memory control groups will help address this on Linux.
Note Killing by qmaster due to the memory limit may occur spuriously, at least under Linux, if the execd over-accounts memory usage. Older SGE versions, and possibly newer ones on old Linux versions, use the value of VmSize that Linux reports (see proc(5)); that includes cached data, and takes no account of sharing. The current SGE uses a more accurate value if possible (see execd_params USE_SMAPS). Also, if a job maps large files into memory (see mmap(2)), that may cause it to fail due to the rlimit, since that counts the memory mapped data, at least under Linux. A future version of SGE is expected to provide control over using the rlimit.
Note Suspended jobs contribute to the h_vmem consumed on the host, which may need to be taken into account if you allow jobs to preempt others by suspension.
Note Setting h_vmem can cause trouble with programs using pthreads(7), typically appearing as a segmentation violation. This is apparently because the pthreads runtime (at least on GNU/Linux) defines a per-thread stack from the h_vmem limit. The solution is to specify a reasonable value for h_stack in addition; typically a few 10s to 100 or so of MB is a good value, but may depend on the program.
Note There is also an issue with recent OpenJDK Java. It allegedly tries to allocate 1/4 of physical memory for the heap initially by default, which will fail with typical h_vmem on recent systems. The (only?) solution is to use java -XmxN explicitly, with N derived from h_vmem. Licence Tokens
For managing Flexlm licence tokens, see Olesen's method. This could be adapted to similar systems, assuming they can be interrogated suitably. There's also the licence juggler for multiple locations.
Killing Detached Processes
If any of a job's processes detach themselves from the process tree under the shepherd, they are not killed directly when the job terminates. Use ENABLE_ADDGRP_KILL to turn on finding and killing them at job termination. It will probably be on by default in a future version.
Core Binding
Binding processes to cores (or 'CPU affinity') is normally important for performance on 'modern' systems (in the mainstream at least since the SGI Origin). Assuming cores are not over-subscribed, a good default (since SGE 8.0.0c) is to set a default in sge_request(5) of
-binding linear:slotsThe allocated binding is accessible via SGE_BINDING in the job's environment, which can be assigned directly to GOMP_CPU_AFFINITY for the benefit of the GNU OpenMP implementation, for instance. If you happen to use OpenMPI, good defaults matching the SGE -binding are (at least for OpenMPI 1.6):
rmaps_base_schedule_policy = core orte_process_binding = coreAdministration
Maintenance Periods
Rejecting Jobs
In case you want to drain the system, adding $SGE_ROOT/util/resources/jsv/jsv_reject_all.sh as a server JSV will reject all jobs at submission with a suitable message.
Down Time
If you want jobs to stay queued, there are two approaches to avoid starting ones that might run into a maintenance period, assuming you enforce a runtime limit and the maintenance won't start any sooner than that period: a calendar and an advance reservation.
Calendar
You can define a calendar for the shutdown period and attach it to all your queues, e.g.
# qconf -scal shutdown calendar_name shutdown year 6.9.2013-9.9.2013=off week NONE # qconf -mattr queue calendar shutdown serial parallel root@head modified "serial" in cluster queue list root@head modified "parallel" in cluster queue list
Note To get the scheduler to look ahead to the calendar, you need to enable resource reservation (issue #493) but that reservation may interact poorly with calendars (issue #722), but it's not clear whether this is still a problem. Advance reservation
Define a fake PE with allocation_rule 1 and access only by the admin ACL, say, and attach it to all your hosts, possibly via a new queue if you already have a complex pe_list setup:
$ qconf -sp all slots 99999 user_lists admin ... allocation_rule 1 ... $ qconf -sq shutdown qname shutdown hostlist @allhosts ... pe_list all ...Now you can make an advance reservation (assuming max_advance_reservations allows it, and you're in arusers as well as admin):
$ qrsub -l exclusive -pe all $(qselect -pe all|wc -l) -a 201309061200 -d 201309091200Rebooting execution hosts
To reboot execution hosts, you need to ensure they're empty and avoid races with job submission. Thus, submit a job which requires exclusive access to the host and then does a reboot. Since you want to avoid root being able to run jobs for security reasons, use sudo(1) with appropriate settings to allow password-less executions of the commands by the appropriate users. You want to comment out Defaults requiretty from /etc/sudoers, add !requiretty to the relevant policy line, or use -pty y on the job submission. It is cleanest to shut down the execd before the reboot.
The job submission parameters will depend on what is allowed to run on the hosts in question, but assuming you can run SMP jobs on all hosts (some might not be allowed serial jobs), a suitable job might be
qsub -pe smp 1 -R y -N boot-$1 -l h=$node,exclusive -p 1024 -l h_rt=60 -j y <<EOF /usr/bin/sudo /sbin/service sgeexecd.ulgbc5 softstop /usr/bin/sudo /sbin/reboot EOFwhere $node is the host in question, and we try to ensure the job runs early by using a resource reservation and a high priority.
Broken/For-testing Hosts
Administrative Control
A useful tactic for dealing with hosts which are broken, or possibly required for testing and not available to normal users is to make a host group for them, say @testing (qconf -ahgrp testing), and restrict access to it only to admin users with an RQS rule like
limit users {!@admin} hosts {@testing} to slots=0It can also be useful to have a host-level string-valued complex (say comment or broken) with information on the breakage, say with a URL pointing to your monitoring/ticketing system. A utility script can look after adding to the host group, setting the complex and, for instance, assigning downtime (in Nagios' terms) for the host in your monitoring system.
Alternatively the RQS could control access on the basis of the broken complex rather than using host group separately.
A monitoring system like Nagios (which has hooks for such actions and is allowed admin access to the SGE system) can set the status as above when it detects a problem.
Using a restricted host group or complex is more flexible than disabling the relevant queues on the host, as sometimes recommended; that stops you running test jobs on them and can cause confusion if queues are disabled for other reasons.
Using Alarm States
As an alternative to explicitly restricting access as above, one can put a host into an alarm state to stop it getting jobs. This can be done by defining an appropriate complex and a load formula involving it, along with a suitable load sensor. The sensor executes periodic tests, e.g. using existing frameworks, and sets the load value high via the complex if it detects an error. However, since it takes time for the load to be reported, jobs might still get scheduled for a while after the problem occurs.
Running tests could also be done in the prolog potentially to set the queue into an error state before trying to run the job. However, that is queue-specific, and the prolog only runs on parallel jobs' master node.
Copyright © 2012, 2013, Dave Love, University of Liverpool
Last updated 2014-02-27 15:36:43 GMT
HowTo:
- We want MPI jobs to eat all of the Infiniband on a node, so that no two MPI jobs can run on the same node. However, we want to be able to have a bunch of instances of the same job on the same node. Solution: Its complicated, but see Daniel Templeton's blog for how to do this.
Helpful Hints
- Current Working Directory
- To ensure that your job runs in the directory from which you submit it (and to ensure that it's standard output and error files land there) use the -cwd option:
% qsub -cwd runme- Running Now
- If you want GridEngine to run your job now or else fail, give it the -now option:
% qsub -now y runme- Embedding Options
- You don't have to remember all the qsub options you need for every job you run. You can embed them in your script:
% cat runme #!/bin/sh # # Execute from the current working directory #$ -cwd # # This is a long-running job #$ -l inf # # Can use up to 6GB of memory #$ -l vf=6G # ~/project/sim- With all the options in the script, executing it is simple:
% qsub runmeYou can, of course, still use command-line arguments to augment or override embedded options.- Mail Notification
- To receive email notifications about your job, use the "-m" option:
% qsub -m as runmeIn the example above, you will get mail if the job aborts or is suspended. The mail options are:
a - abort b - begin e - exit s - suspendDeleting Your Jobs
Deleting your submitted jobs can be done with the qdel command:% qdel job-id The specified job-id is deleted.% qdel -u username All the jobs by usrename are deleted.Users can only delete their own jobs.
Google matched content |
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: March, 12, 2019