Submit jobs

sbatch <run_script>.sh

Examples of job scripts

  • Serial job
#!/bin/bash
#SBATCH --account=def-afyshe-ab
# time (DD-HH:MM:SS)
#SBATCH --time=00-00:01:00
echo 'Hello, world!'
sleep 5
  • Array job
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=def-afyshe-ab
#SBATCH --time=0-01:00:00
#SBATCH --mem-per-cpu=500M
# job name (%x), job ID (%j)
#SBATCH --output=output/%x-%j.txt
# run a 10 job array, with a maximum of 5 running at a time
#SBATCH --array=1-10%5
# job array with indexes [1,2,3,5,7]
#SBATCH --array=1,2,3,5,7

./myapplication $SLURM_ARRAY_TASK_ID

Interactive job

salloc --time=1:0:0 --ntasks=2 --account=def-someuser
salloc --time=0-01:00:00 --cpus-per-task=2 --account=def-afyshe-ab --mem-per-cpu=512M
salloc: Granted job allocation 1234567
...        # do some work
exit       # terminate the allocation
salloc: Relinquishing job allocation 1234567

Monitoring jobs

  • Show jobs for a specific user
squeue -u <user> -r
  • Show only running jobs, or only pending jobs
squeue -u <user> -t running
squeue -u <user> -t pending
PENDING, RUNNING, SUSPENDED, COMPLETING, COMPLETED, OUT_OF_MEMORY, FAILED
  • Show (detailed) information for a specific job
scontrol show job <jobid>
scontrol show job -dd <jobid>
  • Show status information for a running job
sstat -j <jobid>
# List info resource used by a job: Average cpu time, Max memory, Max virtual memory, Job ID
sstat -j <jobid> --format=AveCPU,MaxRSS,MaxVMSize,JobID 
  • Attach a running job
srun --jobid=<jobid> --pty bash -i
  • Email notification
#SBATCH --mail-user=<email_address>
#SBATCH --mail-type=ALL
#SBATCH --mail-type=TIME_LIMIT
#SBATCH --mail-type=TIME_LIMIT_80

Completed jobs

  • Show a short summary of a completed job
seff <jobid>
  • Show a detailed summary of a completed job or all jobs of a user
sacct -j <jobid>
sacct -j <jobid> --format=JobID,JobName,AveCPU,MaxRSS,MaxVMSize,Elapsed
sacct –u <user> --format=JobID,JobName,AveCPU,MaxRSS,MaxVMSize,Elapsed

Controlling jobs

# Cancel a specific job
scancel <jobid>
# Cancel all jobs for a specific user
scancel -u $USER
# Cancel all pending jobs for a specific user
scancel -t PENDING -u $USER
# Cancel all running jobs for a specific user
scancel -t RUNNING -u $USER
# Cancel one or more jobs by name
scancel --name <jobName>
# Hold a job, prevent it form starting
scontrol hold <jobid>
# Release a job hold, allowing the job to try to start
scontrol release <jobid>
# Release a previously held job to begin execution
scontrol resume <jobid>
# Requeue a running, suspended or finished job into pending state
scontrol requeue <jobid>
# List running jobs by user
squeue -u <user> -ho %A -t RUNNING
# Set a new Timelimit a running (need admin privilege)/pending job
scontrol update jobid=<jobid> TimeLimit=<TimeLimit>
# Set other parameters for a job
scontrol update jobid=<jobid> Account=<account> CPUsPerTask=<count> MinMemoryCPU=<MB> Gres=<list>

SLURM Environment Variables

Environment Variable Description
SLURM_JOB_NAME User specified job name
SLURM_JOB_ID Slurm job id
SLURM_NNODES Number of nodes allocated to the job
SLURM_NTASKS Number of tasks allocated to the job
SLURM_ARRAY_TASK_ID Array index for the job
SLURM_ARRAY_TASK_MAX Total number of array indexes for the job
SLURM_MEM_PER_CPU Memory allocated per CPU
SLURM_JOB_NODELIST List of nodes on which resources are allocated to job
SLURM_JOB_CPUS_PER_NODE Number of CPUs allocated per node
SLURM_JOB_PARTITION List of Partition(s) that the job is in
SLURM_JOB_ACCOUNT Account under which this job is run

Account information

# List user and their default account (accounting group)
sacctmgr show user <user> withassoc
# Show usage info for user
sshare -l -U <user>
# Show usage info for all users under a specific account
sshare -l -A <account>_cpu --all

Cluster information

# Show idle node on cluster
sinfo --states=idle
# Show down, drained and draining nodes and their reason
sinfo -R
# Show detailed node info
sinfo --Node --long
# Show reservations on the cluster
scontrol show reservation 
# Show configuration descriptions
man slurm.conf
# Check configuration values
scontrol show config | grep Max
# Show job info on cluster
partition-stats

Software modules

# Show currently loaded modules
module list
# Search for a module (if listed)
module avail <name>
# Will give a little bit more info
module spider <name>
# Load a module
module load <moduleName>
# Unload a module
module unload <moduleName>
# Show commands in the module
module show <moduleName>

Disk usage

quota
quota --per_user
diskusage_report --per_user --all_users

References