SGE user commands

Understand SGE and queueing system. Submit batch jobs to SGE cluster.




Using a cluster environment is similar to using linux environments for your job submission. The difference is that you need to specify needed resources beforehand. The cluster is controlled by a SGE (Sun Grid Engine Software) that organizes the queues and resources. This sort of scheduling system is necessary when limited computational resources are shared by many. Here I show how to use Sun Grid Engine for job submission, monitoring and troubleshooting.

Sun Grid Engine

Cluster environment such as MBB is similar as unix/linux environments for job submission (e.g. running your scripts or other software). The difference is that you need to specify needed resources beforehand. The cluster is controlled by a SGE (Sun Grid Engine software) that orders the queues and resources. This sort of scheduling system is necessary when limited computational resources are shared by many.

SGE will do the job scheduling. That means you can submit all your jobs and SGE will queue them and run them when resources you requested becomes available. SGE will also achieve load balancing where the jobs will be distributed so that specific nodes do not get overloaded. In addition, SGE will allow you to do job monitoring which will be useful when you want to check if your job is running ,and if it failed it will help you understand what went wrong.

Access and use the cluster

The SSH protocol (also referred to as Secure Shell) is a method for secure remote login from one computer to another. It provides several alternative options for strong authentication, and it protects the communications security and integrity with strong encryption.

To access to a cluster, you need your administrator creates your account first. Then you can acces via SSH.

ssh <yourUserName>@<clusterAdress>

Job scheduling: command to submit a job

Here an example of bash script submitBatchJob.sh to submit a batch job via SGE:

# Job name
#$ -N myjob_1
# Using current working directory (otherwise, you will have to use '#$ wd /path/to/run')
#$ -cwd
# job time limits (h_rt is required [s_rt == software time limit / h_rt == hardware time limit])
#$ -l s_rt=999:55:00
#$ -l h_rt=920:00:00
# choose to run on a specific queue
#$ -q cemeb20.q
# Get a mail when the job begins, ends or is suspended
#$ -m ebs
#$ -M me@mail.com
# Redirects the standard output to the named file.
#$ -o submitjob_sge_cluster/qsub_outputs/myjob_20190903-110055.out
##$ -e submitjob_sge_cluster/qsub_outputs/myjob_20190903-110055.err
# merge standard and error outputs
#$ -j yes
# choose a parallel environment and run on 60 slots (use $PE_HOSTFILE)
# -pe mpi 16
# Export all my environment variables into job runtime context
#$ -V
################### job to submit #############################################
bash myjob.sh

To submit the job with these parameters, simply use the qsub command:

qsub submitBatchJob.sh

Job monitoring: check the current status of my submitted job

You can see the current status of the available Grid Engine queues and the jobs associated with the queues with qstat:

qstat

It returns:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2041601 0.53741 myjob_1    peguerin     r     09/17/2020 12:08:30 cemeb20.q@mbbnode-0-30.local       1        
2041602 0.54128 myjob_2    peguerin     qw    09/17/2017 12:07:12                                    1       

Specifies a full format display of information:

qstat -f

See jobs owned by the user $USER:

qstat | grep $USER

See a specific job status with a given job-ID <job-ID>:

qstat -j <job-ID>

To delete Grid Engine job with a given job-ID <job-ID> from queues:

qdel -j <job-ID>

Queue information

This command displays a list of available queues in a cluster:

qconf -sql

This command shows the status of Grid Engine hosts, queues, jobs

qhost

To manage the queues of the system, we use qconf. Let's print information about the queue named <queue>

qconf -sq <queue>

Finally, this qstat command displays a cluster queue summary:

qstat -g c

It returns:

CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
cemeb.q                           0.70     64      0     39    104      8      0 
cemeb20.q                         1.17    121      0     20    176     40     48 
mbb.q                             0.00      0      0     40     40      0      0 
mem.q                             0.08      3      0     37     40      0      0