SLURM is a cluster management and job scheduling system. This is the software we use in the CS clusters for resource management.
This page contains general instructions for all SLURM clusters in CS. Specific information per cluster is in the end.
To send jobs to a cluster, one must first connect to a submission node. For each cluster there's at least one submission node named <cluster>-gw, e.g.: phoenix-gw, hm-gw, etc.
First login to a submission node. E.g. if you're working on the phoenix cluster:
ssh phoenix-gw
Or, from outside the CS network:
ssh -J mylogin@bava.cs.huji.ac.il mylogin@phoenix-gw
Submit a script (myscript) that requires 4 cpus, 400M RAM and will run at most for 2 hours:
sbatch --mem=400m -c4 --time=2:0:0 "myscript"
Submit a binary executable (myexecutable), for maximum 3 days:
sbatch --mem=400m -c4 --time=3-0 --wrap="myexecutable"
Submit a script that requires 2 gpus (on clusters that have gpus)
sbatch --mem=500m -c2 --gres=gpu:2 "myscript"
Submit a script that requires 1 gpu with at least 6 GB video memory (framebuffer memory)
sbatch --mem=10g -c2 --gres=gpu:1,vmem:6g "myscript"
Submit an array of 50 jobs with 2 cpus and 1G ram (the jobs are independent, each will be scheduled separately)
sbatch --mem=1g -c2 --array=0-49 "myscript"
Run a shell interactively (might have limited resources):
srun --mem=400m -c2 --time=1-12 --pty $SHELL
To run graphical programs one needs to connect to the gw using ssh (not rlogin or telnet) with X11 forwarding enabled. Then GUI program should work normally. E.g.:
srun --mem=400m -c2 --time=2:0:0 xterm
Note: There are several limitation for GUI programs. Please see Graphical Commands for more details.
Each job submission must declare how much resources it will require. Resources are RAM, CPUs, time, and GPUs. Requesting too few resources will cause the job to be either killed or have a considerable impact on the performance. Requesting too much resources, will cause the jobs' starting time to be delayed (can be for a few days), and the priority of all the user's jobs to be reduced.
There are different time limits on different clusters. Most clusters won't allow jobs longer than 7, 14 or 21 days. Some clusters have dedicated nodes for short jobs (usually up to 2 days).
Requesting the maximum allowed time (for no good reason) will cause the job's starting time to be delayed. This is due to the backfill[1] algorithm. Also, the priority is by fairshare use, so keeping a job alive without using the resources will reduce the priority of user and lab (and delay the jobs of other users). I.e. don't keep a shell running overnight.
Job management takes resources and time. I.e. each submission includes queuing, dispatching, running and finalizing the job. As such it is best to avoid very short jobs. If jobs are less than 5 minutes, it is best to combine several jobs in a single script and run them sequentially instead of separating them to different slurm jobs.
Each job must declare how many CPUs it will require. Due to hyper-threading (on nodes with hyper-threading enabled), this number needs to be even (it will be rounded up if odd).
The number of CPUs is forced, if the number of processes/threads exceeds the number of allocated CPUs they will share the allocated CPUs (even if other CPUs are available), causing performance reduction.
Requesting too many CPUs can cause a delay of the starting time (as it will wait for the resources to become available), and cause other jobs to wait for the occupied but unused CPUs.
Each job submission must declare how much RAM it will take.
Requesting too much RAM will cause a delay in the starting time of the job and other jobs.
Requesting too little will either cause the job to be killed or to use virtual memory (swap) instead. Swapping can cause severe performance degradation and it is best to avoid it.
GPU misuse has the same effect as CPU misuse. I.e. starting time and performance issues. It is important to make sure your jobs can use more than one GPU before requesting multiple GPUs per job.
As GPUs are considered an expensive resource, it is important not to request too many GPUs without using them. If your job doesn't need a GPU, don't request one and it will be best to run on a cluster without GPUs.
If the lower end GPUs aren't sufficient for a job, instead of requesting specific type of GPU (instead of e.g. --gres=gpu:m60) you can request minimum video memory with the vmem gres (e.g. --gres=gpu,vmem:6g).
Each call to the scheduler causes some sort of load. A single call to squeue or sbatch are ok, but any loop over a slurm command (whether submitting or querying) will cause the load to become significant and should be avoided.
To submit multiple jobs, instead of looping over sbatch, the --array option should be used (e.g. `--array=0-49` for 50 jobs). With array, hundreds of jobs can be submitted in less than a second without unnecessarily loading the scheduler. Submitting hundreds of jobs with a loop over sbatch can take more than several minutes, during which the scheduler will be busy processing the new jobs instead of scheduling them. The scheduler also has other performance optimizations when processing pending array jobs vs. pending single jobs.
To distinguish between the jobs inside an array, the SLURM_ARRAY_TASK_ID environment variable is set with the task id of each job in the array.
In a similar fashion, a loop (or `watch`) over squeue should also be avoided. To know when a job ends (or starts) the --mail-type option should be used. To add dependencies between jobs, the --dependency option should be used. Please check the man page of sbatch for more details regarding these options.
Used to schedule a script to run as soon as resources are available.
usage:
sbatch [options] <script>
options:
-c n | Allocate n cpus (per task). |
-t t | Total run time limit (e.g. "2:0:0" for 2 hours, or "2-0" for 2 days and 0 hours). |
--mem-per-cpu m | Allocate m MB per cpu. |
--mem m | Allocate m MB per node (--mem and --mem-per-cpu are mutually exclusive) |
--array=1‑k%p | Run the script k times (from 1 to k). The array index of the current run is in the SLURM_ARRAY_TASK_ID environment variable accessible from within the script. The optional %p parameter will limit the jobs to run at most p simultaneous jobs (usually it's nicer to the other users). |
--wrap cmd | instead of giving a script to sbatch, run the command cmd. |
-M cluster | The cluster to run on. Can be comma separated list of clusters which will choose the earliest expected job initiation time. |
-n n | Allocate resources for n tasks. Default is 1. Only relevant for parallel jobs, e.g. with mpi. |
--gres resource | specify general resource to use. Currently only GPU and vmem are supported. e.g. gpu:2 for two GPUs. On clusters with several types of GPUs, a specific GPU can be requested by, e.g. 'gpu:m60:2' for 2 M60 GPUs; or minimum video memory with e.g. 'gpu:1,vmem:6g'. |
More info in "man sbatch"
Shows the status of submitted jobs.
Usage:
squeue -M <cluster>
More info in "man squeue"
A shortcut for different format of squeue.
Usage:
ssqueue
Cancels a job:
Usage:
scancel <job id>
More info in "man scancel"
To hold a job from executing (e.g. to give another job a chance to run), run:
scontrol hold <job id>
To release it:
scontrol release <job id>
To run commands interactively, use the srun command. This will block until there are resources available, and will redirect the input/output of the program to the executing shell. srun has most of the same parameters as sbatch.
If the input/output isn't working currectly (e.g. with shell jobs), usually adding the --pty flag solves the issue.
On some of the clusters interactive jobs have some limitation compared to normal batch jobs.
Used to view statistics about previous jobs.
e.g.
sacct
Long format:
sacct -l
All users:
sacct -a
Since 1/1/2013
sacct -S 2013-01-01
Or any combination of the options.
A shortcut for different output formats of sacct.
Usage:
ssacct
or
ssacct --res
Show data about the cluster and the nodes
More info in "man sinfo"
Show detailed data about each node. Usage:
ssinfo
Show data about running jobs (e.g. memory, time, etc.)
Show general information about the available resources of the cluster (memory, GPUs...) and about the current usage of different users.
Show current usage and limits of an account.
Using the sbatch command, a script is executed once the resources are available. The script must be a text file, i.e. most scripting languages are accepted (sh, bash, csh, python, perl, etc.), but not compiled binary files.
All parameters to sbatch can be incorporated into the script itself, simplifying the batch submission command. The paremeters inside the script files are passed by lines begining with '#SBATCH'. These lines must be after the first line (e.g. after the #!/bin/bash line) but before any real command.
This way, instead of:
sbatch --mem=400m -c4 --time=2:0:0 --gres=gpu:3 script.sh
One can use the script:
#!/bin/bash
#SBATCH --mem=400m
#SBATCH -c4
#SBATCH --time=2:0:0
#SBATCH --gres=gpu:3
some script lines
...
and submit using just:
sbatch script.sh
All programs will be terminated once the batch script is terminated. So if executing a command in the background, it's usually helpful to finish the batch script with the 'wait' command (assuming bash).
For simple interactive sessions, srun --pty should suffice. Using graphical commands is not recommended as it creates additional failure points for the job. I.e. if the X connection is cut off (in the local machine or the submission node) the job will be killed. Moreover, programs that require advance options such as OpenGL might not work properly (or at all).
Also, if the cluster is full, it might take time for the job to start, in which time the user cannot logout from the display (or the job will die on startup).
Nonetheless, if a graphical display is required, the DISPLAY environment variable should be set appropriatly. The simplest method is by ssh'ing to the gw machine with display forwarding. This should set up everything.
Another method, is setting it manually:
This will open an xterm with the specified resources, but it will open only when the resources are allocated.
Each job is given priority according to several weighted factors:
The are three main QOS: high, normal, and low. The default is normal. To use a different QOS, use the --qos flag of sbatch.
Jobs with the high QOS will be allocated before the other QOS. Don't abuse this QOS, otherwise everyone will use it and it will lose its purpose.
This QOS is limited to 5 jobs per lab.
The default QOS.
The low QOS is used to submit jobs that will run only if there is no other jobs to run. Currently no jobs are killed, so if a low priority job will run for 30 days, it can still cause normal and high priority jobs to wait.
These QOSes are used for the killable jobs. Please see Killable Jobs for more information. It is best not to use these QOSes directly.
This factor takes into account past resource use by the user/account, with some decay factor. If user1 used the cluster intensively in the past week, user2 will get higher priority. But if user1 used the cluster 2 years ago, it probably won't effect the current priority.
The share applies to the labs and the users. I.e. if a lab used many resources in the recent past, a new user in that lab might still get low priority compared to users from other labs (but not compared to users from the same lab).
Each lab has an account on the cluster, which is limited to the amount of simultaionuly used resources. This amount is relative to the total cluster resources and to the amount of resources the lab has contributed to the cluster.
The longer the job is in the queue, the higher priority it will gain over other younger jobs.
If the lab has used all the resources it is allowed, users can send killable jobs. These jobs have the lowest priority, can surpass the lab limit, but will be killed if a higher priority job is submitted.
To submit a killable job, the --killable flag should be used with srun or sbatch. e.g.:
sbatch [other options] --killable myscript.sh
Note that killable will set the account and the QOS, so they cannot be set by the sbatch parameters.
It is advisable to add the --requeue flag to killable jobs. This flag ensures that the job will be re-queued when killed.
The --killable flag replaces the previous "-A requeue" option. And might not be available on all clusters (yet).
Queued jobs can have lots of different 'Reasons' that reflect why they are not running, a few of the more commonly seen Reasons are brought here, the full list can be found in man squeue. A job can be pending for multiple reasons but only one reason will be shown, which can in some cases be a bit confusing. On some of our clusters jobs are automatically submitted to all partitions and will then run on the partition that suits them best, this means that it may also be submitted to a partition with a too low time limit or a partition that is not accessible to the user who submitted the job.
One or more higher priority jobs exist for this partition or advanced reservation.
The job is waiting for resources to become available.
The job's time limit exceeds one or more of it's partition's current time limit.
The user/account does not have access to one or more of the partitions that the job was submitted to.
The lab (account) has passed the allowed simultaneous resource usage. See Lab Limit and Killable Jobs.
Usually means the job is interactive and the user has passed the allowed simultaneous interactive jobs on the cluster. When submitting with sbatch this shouldn't appear.
To show on which clusters you have an account, use the sacctmgr command. e.g.
sacctmgr show users -s user=$USER format=user,cluster,account,defaultaccount | awk '$3 != "default" && $3 != "requeue"'
Clusters:
cluster | nodes | RAM | swap | cpu (sockets:cores:threads) | Max time limit | gres | defaults | interactive jobs | Notes |
---|---|---|---|---|---|---|---|---|---|
phoenix | eye-01..04 | 190GB | 250GB | 32 (2:8:2) | 3 weeks (21-0) | -c2 --mem 50 --time 2:0:0 | up to 2 | Phoenix cluster policy | |
sulfur-01..16 | 62GB | 60GB | 8 (2:4:1) | 3 weeks (21-0) | |||||
gsm-01..04 | 256GB | 250GB | 32 (2:8:2) | 2 days (2-0) | gpu:black:4 (nvidia titan black) vmem:6G |
||||
gsm-03..04 | 1 week (7-0) | ||||||||
cortex-01..05 | 252GB | 50GB | 16 (2:8:1) | 2 days (2-0) | gpu:m60:8 (nvidia tesla M60) vmem:8G |
||||
cortex-06..08 | 24 (2:12:1) | ||||||||
cortex-06..07 | 1 week (7-0) | ||||||||
cortex-03..05 | 24 (2:12:1) | ||||||||
lucy-01..03 | 384GB | 8GB | 48 (2:12:2) | 1 week (7-0) | gpu:gtx980:2 (nvidia gtx 980) vmem:4G |
||||
cb-05..20 | 64GB | 128GB | 16 (2:4:2) | 3 weeks (21-0) | |||||
oxygen-01..08 | 252GB | 48GB | 48 (2:12:2) | ||||||
sm-01..08 | 48GB | 48GB | 16 (2:4:2) | ||||||
sm-09..16 | 24GB | 48GB | 16 (2:4:2) | ||||||
sm-17..20 | 64GB | 122GB | 24 (2:6:2) | ||||||
dumfries-001..008 | 125GB | 95GB | 32 (2:8:2) | 1 week (7-0) | gpu:rtx2080:4 (RTX 2080Ti) vmem:10G |
||||
dumfries-009..010 | 2 days (2-0) | ||||||||
creek-01 | 251GB | 91GB | 72 (2:18:2) | 2 days (2-0) | gpu:rtx2080:8 (RTX 2080Ti) vmem:10G |
||||
creek-02..04 | 1 week (7-0) | ||||||||
firth-01..02 | 376GB | 95GB | 40 (2:10:2) | 1 week (7-0) | gpu:rtx6000:4 (Quadro RTX 6000) vmem:24G |
||||
wadi-01..04 | 503GB | 15GB | 128 (2:32:2) | 3 weeks (21-0) | |||||
wadi-05 | 2 days (2-0) | ||||||||
ampere-01 | 376GB | 15GB | 32 (2:16:1) | 1 week (7-0) | gpu:a10:8 (A10) vmem:22G |
||||
hm | hm-05..38 | 64GB | 128GB | 32 (2:8:2) | 3 weeks (21-0) | -c2 --mem 50 --time 2:0:0 | |||
hm-43..52 | 128GB | 128GB | 32 (2:8:2) | ||||||
hm-53..71 | 256GB | 128GB | 48 (2:12:2) | ||||||
sed | sed-01..16 | 256GB | 128GB | 40 (2:10:2) | 3 weeks (21-0) | -c2 --mem 50 --time 2:0:0 | |||
picasso | picasso-02..16 | 62GB | 128GB | 40 (2:10:2) | 3 weeks (21-0) | -c2 --mem 50 --time 2:0:0 | |||
warhol-01..15 | 24GB | 48GB | 8 (2:4:1) | ||||||
blaise | blaise-001..005 | 255GB | 255GB | 160 (2:10:8) | 2 days (2-0) | gpu:4 (nvidia p100) | -c8 --mem 50 --time 2:0:0 | up to 2 | ppc64le architecture |
blaise-002,003,005 | 1 week (7-0) | none | |||||||
silico | silico-001 | 128GB | 120GB | 48 (2:12:2) | 2 days (2-0) | gpu:gtx1080:4 (GTX 1080Ti) | -c2 --mem 50 --time 2:0:0 | up to 2 | |
silico-002..008 | 32 (2:8:2) | 2 weeks (14-0) | |||||||
silico-009..010 | 32 (2:8:2) | gpu:rtx2080:4 (RTX 2080Ti) |
The blaise machines are powerpc based. This is a different architecture from intel and different from linux distributions (software might need to be recompiled). Note that blaise-gw does not have the same distribution as the blaise compute nodes
Man pages: sbatch, srun, sacct, squeue, scancel, sinfo, sstat, sprio.
web pages:
general: http://slurm.schedmd.com/documentation.html user guide: http://slurm.schedmd.com/quickstart.html