Slurm

Submit jobs

$ sbatch <run_script>.sh

Examples of job scripts

  • Serial job
#!/bin/bash
#SBATCH --account=def-afyshe-ab
# time (DD-HH:MM:SS)
#SBATCH --time=00-00:01:00
echo 'Hello, world!'
sleep 5
  • Array job
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=def-afyshe-ab
#SBATCH --time=0-01:00:00
#SBATCH --mem-per-cpu=500M
# job name (%x), job ID (%j)
#SBATCH --output=output/%x-%j.txt
# run a 10 job array, with a maximum of 5 running at a time
#SBATCH --array=1-10%5
# job array with indexes [1,2,3,5,7]
#SBATCH --array=1,2,3,5,7

./myapplication $SLURM_ARRAY_TASK_ID

Interactive job

$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser
$ salloc --time=0-01:00:00 --cpus-per-task=2 --account=def-afyshe-ab --mem-per-cpu=512M
salloc: Granted job allocation 1234567
$ ...        # do some work
$ exit       # terminate the allocation
salloc: Relinquishing job allocation 1234567

Monitoring jobs

  • Show jobs for a specific user
$ squeue -u <user> -r
  • Show only running jobs, or only pending jobs
$ squeue -u <user> -t running
$ squeue -u <user> -t pending
PENDING, RUNNING, SUSPENDED, COMPLETING, COMPLETED, OUT_OF_MEMORY, FAILED
  • Show (detailed) information for a specific job
$ scontrol show job <jobid>
$ scontrol show job -dd <jobid>
  • Show status information for a running job
$ sstat -j <jobid>
# List info resource used by a job: Average cpu time, Max memory, Max virtual memory, Job ID
$ sstat -j <jobid> --format=AveCPU,MaxRSS,MaxVMSize,JobID 
  • Email notification
#SBATCH --mail-user=<email_address>
#SBATCH --mail-type=ALL
#SBATCH --mail-type=TIME_LIMIT
#SBATCH --mail-type=TIME_LIMIT_80

Completed jobs

  • Show a short summary of a completed job
$ seff <jobid>
  • Show a detailed summary of a completed job or all jobs of a user
$ sacct -j <jobid>
$ sacct -j <jobid> --format=JobID,JobName,AveCPU,MaxRSS,MaxVMSize,Elapsed
$ sacct –u <user> --format=JobID,JobName,AveCPU,MaxRSS,MaxVMSize,Elapsed

Controlling jobs

# Cancel a specific job
$ scancel <jobid>
# Cancel all jobs for a specific user
$ scancel -u $USER
# Cancel all pending jobs for a specific user
$ scancel -t PENDING -u $USER
# Cancel all running jobs for a specific user
$ scancel -t RUNNING -u $USER
# Cancel one or more jobs by name
$ scancel --name <jobName>
# Hold a job, prevent it form starting
$ scontrol hold <jobid>
# Release a job hold, allowing the job to try to start
$ scontrol resume <jobid>
# Requeue a running, suspended or finished job into pending state
$ scontrol requeue <jobid>
# List running jobs by user
$ squeue -u <user> -ho %A -t RUNNING
# Set a new Timelimit a running (need admin privilege)/pending job
$ scontrol update jobid=<jobid> TimeLimit=<TimeLimit>

SLURM Environment Variables

Environment Variable Description
SLURM_JOB_NAME User specified job name
SLURM_JOB_ID Slurm job id
SLURM_NNODES Number of nodes allocated to the job
SLURM_NTASKS Number of tasks allocated to the job
SLURM_ARRAY_TASK_ID Array index for the job
SLURM_ARRAY_TASK_MAX Total number of array indexes for the job
SLURM_MEM_PER_CPU Memory allocated per CPU
SLURM_JOB_NODELIST List of nodes on which resources are allocated to job
SLURM_JOB_CPUS_PER_NODE Number of CPUs allocated per node
SLURM_JOB_PARTITION List of Partition(s) that the job is in
SLURM_JOB_ACCOUNT Account under which this job is run

Account information

# List user and their default account (accounting group)
$ sacctmgr show user <user> withassoc
# Show usage info for user
$ sshare -l -U <user>
# Show usage info for all users under a specific account
$ sshare -l -A <account>_cpu --all

Cluster information

# Show idle node on cluster
$ sinfo --states=idle
# Show down, drained and draining nodes and their reason
$ sinfo -R
# Show detailed node info
$ sinfo --Node --long
# Show reservations on the cluster
$ scontrol show reservation 
# Show configuration descriptions
$ man slurm.conf
# Check configuration values
$ scontrol show config | grep Max
# Show job info on cluster
$ partition-stats

Software modules

# Show currently loaded modules
$ module list
# Search for a module (if listed)
$ module avail <name>
# Will give a little bit more info
$ module spider <name>
# Load a module
$ module load <moduleName>
# Unload a module
$ module unload <moduleName>
# Show commands in the module
$ module show <moduleName>

Disk usage

$ quota