Difference between revisions of "Slurm"

From CsWiki
Jump to: navigation, search
m
Line 1: Line 1:
= Slurm =
+
SLURM is a cluster management and job scheduling system. This is the software we use in the CS clusters for resource management.
This information is also available in /vol/slurm/common/README
+
  
 
This page contains general instructions for all SLURM clusters in CS. Specific
 
This page contains general instructions for all SLURM clusters in CS. Specific
Line 16: Line 15:
  
 
   ssh mylogin%phoenix-gw@gw.cs.huji.ac.il
 
   ssh mylogin%phoenix-gw@gw.cs.huji.ac.il
 +
 +
'''Important''': ''phoenix'' is an example, you most likely cannot access it! Currently access to clusters is per lab, so you should ask your supervisor which clusters you have access to.
  
 
Submit a script (myscript) that requires 4 cpus, 400M RAM and will run at most
 
Submit a script (myscript) that requires 4 cpus, 400M RAM and will run at most
Line 35: Line 36:
  
 
To run graphical programs one needs to connect to the gw using ssh (not rlogin
 
To run graphical programs one needs to connect to the gw using ssh (not rlogin
or telnet) with X11 forwarding enabled. e.g.:
+
or telnet) with X11 forwarding enabled. Then GUI program should work normally. E.g.:
  
 
   srun --mem=400m -c2 --time=2:0:0 xterm
 
   srun --mem=400m -c2 --time=2:0:0 xterm
  
More explicit details later on.
+
'''Note''': There are several limitation for GUI programs. Please see [[#Graphical_Commands|Graphical Commands]] for more details.
  
= General info =
+
= Submission Guidelines =
  
Each job submission must declare how much RAM it will take. The results jobs
+
Each job submission must declare how much resources it will require. Resources
requesting more memory than initially requested can result in either the job
+
are RAM, CPUs, time, and GPUs. Requesting too few resources will cause the job
being killed or moved to the virtual memory (this is cluster dependent).
+
to be either killed or have a considerable impact on the performance.
 +
Requesting too much resources, will cause the jobs' starting time to be delayed
 +
(can be for a few days), and the priority of all the user's jobs to be reduced.
  
Each job must declare how many cpus it will require. Due to hyper-threading (on
+
== Time ==
 +
 
 +
There are different time limits on different clusters. Most clusters won't
 +
allow jobs longer than 7, 14 or 21 days. Some clusters have dedicated nodes for
 +
short jobs (usually up to 2 days).
 +
 
 +
Requesting the maximum allowed time (for no good reason) will cause the job's
 +
starting time to be delayed. This is due to the
 +
backfill[https://slurm.schedmd.com/sched_config.html] algorithm. Also, the
 +
priority is by fairshare use, so keeping a job alive without using the
 +
resources will reduce the priority of user and lab (and delay the jobs of other
 +
users). I.e. don't keep a shell running overnight.
 +
 
 +
Job management takes resources and time. I.e. each submission includes queuing,
 +
dispatching, running and finalizing the job. As such it is best to avoid very
 +
short jobs. If jobs are less than 5 minutes, it is best to combine several jobs
 +
in a single script and run them sequentially instead of separate them to
 +
different slurm jobs.
 +
 
 +
== CPUs ==
 +
 
 +
Each job must declare how many CPUs it will require. Due to hyper-threading (on
 
nodes with hyper-threading enabled), this number needs to be even (it will be
 
nodes with hyper-threading enabled), this number needs to be even (it will be
rounded up if odd). The number of cpus is forced, if the number of
+
rounded up if odd).
processes/threads exceeds the number of allocated cpus they will share the
+
 
allocated cpus (even if other cpus are available).
+
The number of CPUs is forced, if the number of processes/threads exceeds the
 +
number of allocated CPUs they will share the allocated CPUs (even if other CPUs
 +
are available), causing performance reduction.
 +
 
 +
Requesting too many CPUs can cause a delay of the starting time (as it will
 +
wait for the resources to become available), and cause other jobs to wait for
 +
the occupied but unused CPUs.
 +
 
 +
== Memory ==
 +
 
 +
Each job submission must declare how much RAM it will take.
 +
 
 +
Requesting too much RAM will cause a delay in the starting time of the job and
 +
other jobs.
 +
 
 +
Requesting too little will either cause the job to be killed or to use virtual
 +
memory (swap) instead. Swapping can cause severe performance degradation and it
 +
is best to avoid it.
 +
 
 +
== GPUs ==
 +
 
 +
GPU misuse has the same effect of CPU misuse. I.e. starting time and
 +
performance issues. It is important to make sure your jobs can use more than
 +
one GPU before requesting multiple GPUs per job.
 +
 
 +
As GPU are considered an expensive resource, it is important not to request too
 +
many GPUs without using them. If your job doesn't need a GPU, don't request one
 +
and it will be best to run on a cluster without GPUs.
  
The default number of cpus and memory allocation is cluster depended.
 
  
 
= Commands =
 
= Commands =
Line 75: Line 125:
 
|  --mem '''m'''        || Allocate '''m''' MB per node (--mem and --mem-per-cpu are mutually exclusive)
 
|  --mem '''m'''        || Allocate '''m''' MB per node (--mem and --mem-per-cpu are mutually exclusive)
 
|-
 
|-
|  --array=1-'''k'''     || Run the script '''k''' times (from 1 to '''k'''). The array index of the current run is in the SLURM_ARRAY_TASK_ID environment variable accessible from within the script.  
+
|  --array=1‑'''k'''%'''p'''  || Run the script '''k''' times (from 1 to '''k'''). The array index of the current run is in the SLURM_ARRAY_TASK_ID environment variable accessible from within the script. The optional '''%p''' parameter will limit the jobs to run at most '''p''' simultaneous jobs (usually it's nicer to the other users).
 
|-
 
|-
 
|  --wrap '''cmd'''      || instead of giving a script to sbatch, run the command '''cmd'''.
 
|  --wrap '''cmd'''      || instead of giving a script to sbatch, run the command '''cmd'''.
Line 83: Line 133:
 
|  -n '''n'''            || Allocate resources for '''n''' tasks. Default is 1. Only relevant for parallel jobs, e.g. with mpi.
 
|  -n '''n'''            || Allocate resources for '''n''' tasks. Default is 1. Only relevant for parallel jobs, e.g. with mpi.
 
|-
 
|-
|  --gres '''resource''' || specify general resource to use. Currently only gpu is supported. e.g. ''gpu:2'' for two gpus.
+
|  --gres '''resource''' || specify general resource to use. Currently only GPU is supported. e.g. ''gpu:2'' for two GPUs.
 
|}
 
|}
  
Line 97: Line 147:
  
 
More info in "man squeue"
 
More info in "man squeue"
 +
 +
=== ssqueue ===
 +
 +
A shortcut for different format of squeue.
 +
 +
Usage:
 +
 +
  ssqueue
  
 
== scancel ==
 
== scancel ==
Line 122: Line 180:
 
To run command interactively, use the srun command. This will block until there
 
To run command interactively, use the srun command. This will block until there
 
are resources available, and will redirect the input/output of the program to
 
are resources available, and will redirect the input/output of the program to
the executing shell. besides the -M flag, srun has the same parameters as
+
the executing shell. srun has most of the same parameters as sbatch.
sbatch. See section 3 for details on srun from within sbatch script.
+
  
 
If the input/output isn't working currectly (e.g. with shell jobs), usually
 
If the input/output isn't working currectly (e.g. with shell jobs), usually
 
adding the --pty flag solves the issue.
 
adding the --pty flag solves the issue.
  
On some of the clusters interactive jobs have some limitation
+
On some of the clusters interactive jobs have some limitation compared to
compared to batch jobs.
+
normal batch jobs.
  
 
== sacct ==
 
== sacct ==
Line 148: Line 205:
  
 
Or any combination of the options.
 
Or any combination of the options.
 +
 +
=== ssacct ===
 +
 +
A shortcut for different output format of sacct.
 +
 +
Usage:
 +
 +
  ssacct
 +
 +
or
 +
 +
  ssacct --res
  
 
== sinfo ==
 
== sinfo ==
  
 
Show data about the cluster and the nodes
 
Show data about the cluster and the nodes
 +
 +
More info in "man sinfo"
 +
 +
=== ssinfo ===
 +
 +
Show detailed data about each node. Usage:
 +
 +
  ssinfo
  
 
== sstat ==
 
== sstat ==
Line 163: Line 240:
 
= Batch Scripts =
 
= Batch Scripts =
  
Using the sbatch command, a script is executed once the resources are available.
+
Using the sbatch command, a script is executed once the resources are
 +
available. The script must be a text file, i.e. most scripting languages are
 +
accepted (sh, bash, csh, python, perl, etc.), but not compiled binary files.
  
 
All parameters to sbatch can be incorporated into the script itself,
 
All parameters to sbatch can be incorporated into the script itself,
simplifying the batch submission command. The paremeters inside the
+
simplifying the batch submission command. The paremeters inside the script
script files are passed by lines begining with '#SBATCH'. These lines
+
files are passed by lines begining with '#SBATCH'. These lines must be after
must be after the first line (e.g. the #!/bin/bash line) but before
+
the first line (e.g. after the #!/bin/bash line) but before any real command.
any real command.
+
  
 
This way, instead of:
 
This way, instead of:
Line 190: Line 268:
 
  sbatch script.sh
 
  sbatch script.sh
  
 +
<!--
 
Inside the batch script, the 'srun' command is used to launch specific tasks
 
Inside the batch script, the 'srun' command is used to launch specific tasks
 
that require part of the allocated space. i.e. several srun can be run in the
 
that require part of the allocated space. i.e. several srun can be run in the
Line 208: Line 287:
 
resources to be available before executing the command. The batch script itself
 
resources to be available before executing the command. The batch script itself
 
is assumed not to take cpu or memory.
 
is assumed not to take cpu or memory.
 +
-->
  
 
All programs will be terminated once the batch script is terminated. So if
 
All programs will be terminated once the batch script is terminated. So if
executing srun in the background, it's usually helpful to finish the batch
+
executing a command in the background, it's usually helpful to finish the batch
 
script with the 'wait' command (assuming bash).
 
script with the 'wait' command (assuming bash).
  
 
= Graphical Commands =
 
= Graphical Commands =
  
For simple interactive session, srun --pty should suffice. For graphical
+
For simple interactive session, srun --pty should suffice. Using graphical
programs, the DISPLAY should be set appropriatly. The simplest method is by
+
commands is not recommended as it creates additional failure points for the
ssh'ing to the gw machine. This should set up everything.
+
job. I.e. if the X connection is cut off (in the local machine or the
 +
submission node) the job will be killed. Moreover, programs that requires
 +
advance options such as OpenGL might not work properly (or at all).
 +
 
 +
Also, if the cluster is full, it might take time for the job to start, in which
 +
time the user cannot logout from the display (or the job will die on startup).
 +
 
 +
Nonetheless, if graphical display is required, the DISPLAY environment variable
 +
should be set appropriatly. The simplest method is by ssh'ing to the gw machine
 +
with display forwarding. This should set up everything.
  
 
Another method, is setting it manually:
 
Another method, is setting it manually:
Line 245: Line 334:
 
=== high===
 
=== high===
 
Jobs with the high QOS will be allocated before the other QOS. Don't abuse this QOS, otherwise everyone will use it and it will lose its purpose.
 
Jobs with the high QOS will be allocated before the other QOS. Don't abuse this QOS, otherwise everyone will use it and it will lose its purpose.
 +
 +
In the future, we might limit the use of the high QOS.
  
 
=== normal ===
 
=== normal ===
Line 262: Line 353:
 
get higher priority. But if user1 used the cluster 2 years ago, it probably
 
get higher priority. But if user1 used the cluster 2 years ago, it probably
 
won't effect the current priority.
 
won't effect the current priority.
 +
 +
The share applies to the labs and the users. I.e. if a lab used many resources
 +
in the recent past, a new user in that lab might still get low priority
 +
compared to users from other labs (but not compared to users from the same
 +
lab).
  
 
== Job age ==
 
== Job age ==
Line 393: Line 489:
 
= More information =
 
= More information =
  
Man pages: sbatch, srun, sacct, squeue, scancel, sinfo, sstat, sprio
+
Man pages: sbatch, srun, sacct, squeue, scancel, sinfo, sstat, sprio.
 +
 
 
web pages:
 
web pages:
 
     general: http://slurm.schedmd.com/documentation.html
 
     general: http://slurm.schedmd.com/documentation.html
 
   user guide: http://slurm.schedmd.com/quickstart.html
 
   user guide: http://slurm.schedmd.com/quickstart.html
 
[[category:Basic Account Information]]
 
[[category:Basic Account Information]]

Revision as of 14:17, 17 July 2018

SLURM is a cluster management and job scheduling system. This is the software we use in the CS clusters for resource management.

This page contains general instructions for all SLURM clusters in CS. Specific information per cluster is in the end.

To send jobs to a cluster, one must first connect to a submission node. For each cluster there's at least one submission node named <cluster>-gw, e.g.: eye-gw, cb-gw, hm-gw, etc.

Quick start

First login to a submission node. E.g. if you're working on the phoenix cluster:

 ssh phoenix-gw

Or, from outside the CS network:

 ssh mylogin%phoenix-gw@gw.cs.huji.ac.il

Important: phoenix is an example, you most likely cannot access it! Currently access to clusters is per lab, so you should ask your supervisor which clusters you have access to.

Submit a script (myscript) that requires 4 cpus, 400M RAM and will run at most for 2 hours:

 sbatch --mem=400m -c4 --time=2:0:0 "myscript" 

Submit a binary executable (myexecutable), for maximum 3 days:

 sbatch --mem=400m -c4 --time=3-0 --wrap="myexecutable"

Submit a script that requires 2 gpus (on clusters that have gpus)

 sbatch --mem=500m -c2 --gres=gpu:2 "myscript" 

Run a shell interactively (might have limited resources):

 srun --mem=400m -c2 --time=1-12 --pty $SHELL

To run graphical programs one needs to connect to the gw using ssh (not rlogin or telnet) with X11 forwarding enabled. Then GUI program should work normally. E.g.:

 srun --mem=400m -c2 --time=2:0:0 xterm

Note: There are several limitation for GUI programs. Please see Graphical Commands for more details.

Submission Guidelines

Each job submission must declare how much resources it will require. Resources are RAM, CPUs, time, and GPUs. Requesting too few resources will cause the job to be either killed or have a considerable impact on the performance. Requesting too much resources, will cause the jobs' starting time to be delayed (can be for a few days), and the priority of all the user's jobs to be reduced.

Time

There are different time limits on different clusters. Most clusters won't allow jobs longer than 7, 14 or 21 days. Some clusters have dedicated nodes for short jobs (usually up to 2 days).

Requesting the maximum allowed time (for no good reason) will cause the job's starting time to be delayed. This is due to the backfill[1] algorithm. Also, the priority is by fairshare use, so keeping a job alive without using the resources will reduce the priority of user and lab (and delay the jobs of other users). I.e. don't keep a shell running overnight.

Job management takes resources and time. I.e. each submission includes queuing, dispatching, running and finalizing the job. As such it is best to avoid very short jobs. If jobs are less than 5 minutes, it is best to combine several jobs in a single script and run them sequentially instead of separate them to different slurm jobs.

CPUs

Each job must declare how many CPUs it will require. Due to hyper-threading (on nodes with hyper-threading enabled), this number needs to be even (it will be rounded up if odd).

The number of CPUs is forced, if the number of processes/threads exceeds the number of allocated CPUs they will share the allocated CPUs (even if other CPUs are available), causing performance reduction.

Requesting too many CPUs can cause a delay of the starting time (as it will wait for the resources to become available), and cause other jobs to wait for the occupied but unused CPUs.

Memory

Each job submission must declare how much RAM it will take.

Requesting too much RAM will cause a delay in the starting time of the job and other jobs.

Requesting too little will either cause the job to be killed or to use virtual memory (swap) instead. Swapping can cause severe performance degradation and it is best to avoid it.

GPUs

GPU misuse has the same effect of CPU misuse. I.e. starting time and performance issues. It is important to make sure your jobs can use more than one GPU before requesting multiple GPUs per job.

As GPU are considered an expensive resource, it is important not to request too many GPUs without using them. If your job doesn't need a GPU, don't request one and it will be best to run on a cluster without GPUs.


Commands

sbatch

Used to schedule a script to run as soon as resources are available.

usage:

sbatch [options] <script>

options:

-c n Allocate n cpus (per task).
-t t Total run time limit (e.g. "2:0:0" for 2 hours, or "2-0" for 2 days and 0 hours).
--mem-per-cpu m Allocate m MB per cpu.
--mem m Allocate m MB per node (--mem and --mem-per-cpu are mutually exclusive)
--array=1‑k%p Run the script k times (from 1 to k). The array index of the current run is in the SLURM_ARRAY_TASK_ID environment variable accessible from within the script. The optional %p parameter will limit the jobs to run at most p simultaneous jobs (usually it's nicer to the other users).
--wrap cmd instead of giving a script to sbatch, run the command cmd.
-M cluster The cluster to run on. Can be comma separated list of clusters which will choose the earliest expected job initiation time.
-n n Allocate resources for n tasks. Default is 1. Only relevant for parallel jobs, e.g. with mpi.
--gres resource specify general resource to use. Currently only GPU is supported. e.g. gpu:2 for two GPUs.

More info in "man sbatch"

squeue

Shows the status of submitted jobs.

Usage:

 squeue -M<cluster>

More info in "man squeue"

ssqueue

A shortcut for different format of squeue.

Usage:

 ssqueue

scancel

Cancels a job:

Usage:

 scancel <job id>

More info in "man scancel"

hold and release

To hold a job from executing (e.g. to give another job a chance to run), run:

 scontrol hold <job id>

To release it:

 scontrol release <job id>

srun

To run command interactively, use the srun command. This will block until there are resources available, and will redirect the input/output of the program to the executing shell. srun has most of the same parameters as sbatch.

If the input/output isn't working currectly (e.g. with shell jobs), usually adding the --pty flag solves the issue.

On some of the clusters interactive jobs have some limitation compared to normal batch jobs.

sacct

Used to view statistics about previous jobs.

e.g.

sacct

Long format:

sacct -l

All users:

sacct -a

Since 1/1/2013

sacct -S 2013-01-01

Or any combination of the options.

ssacct

A shortcut for different output format of sacct.

Usage:

 ssacct

or

 ssacct --res

sinfo

Show data about the cluster and the nodes

More info in "man sinfo"

ssinfo

Show detailed data about each node. Usage:

 ssinfo

sstat

Show data about running jobs (e.g. memory, time, etc.)

susage

Show general information about the available resources of the cluster (memory, GPUs...) and about the current usage of different users.

Batch Scripts

Using the sbatch command, a script is executed once the resources are available. The script must be a text file, i.e. most scripting languages are accepted (sh, bash, csh, python, perl, etc.), but not compiled binary files.

All parameters to sbatch can be incorporated into the script itself, simplifying the batch submission command. The paremeters inside the script files are passed by lines begining with '#SBATCH'. These lines must be after the first line (e.g. after the #!/bin/bash line) but before any real command.

This way, instead of:

sbatch --mem=400m -c4 --time=2:0:0 --gres=gpu:3 script.sh

One can use the script:

#!/bin/bash
#SBATCH --mem=400m
#SBATCH -c4
#SBATCH --time=2:0:0
#SBATCH --gres=gpu:3

some script lines
...

and submit using just:

sbatch script.sh


All programs will be terminated once the batch script is terminated. So if executing a command in the background, it's usually helpful to finish the batch script with the 'wait' command (assuming bash).

Graphical Commands

For simple interactive session, srun --pty should suffice. Using graphical commands is not recommended as it creates additional failure points for the job. I.e. if the X connection is cut off (in the local machine or the submission node) the job will be killed. Moreover, programs that requires advance options such as OpenGL might not work properly (or at all).

Also, if the cluster is full, it might take time for the job to start, in which time the user cannot logout from the display (or the job will die on startup).

Nonetheless, if graphical display is required, the DISPLAY environment variable should be set appropriatly. The simplest method is by ssh'ing to the gw machine with display forwarding. This should set up everything.

Another method, is setting it manually:

  1. On the machine where the X server is running (where the window will be opened), before connecting to the gw, run:
    xauth list $HOST:0
    this will return a line similar to:
    ant-87.cs.huji.ac.il:0  MIT-MAGIC-COOKIE-1  fe8332fcbfd2de8fb37d4acdf64767be
  2. login to the gw machine
  3. run:
    xauth add <line returned from step 1>
  4. Set the DISPLAY according to <host>:0. If e.g. I'm working on ant-87:
    setenv DISPLAY ant-87:0
  5. Verify that it works by running e.g. xeyes
  6. Run the command. e.g.
    srun -n1 -c4 xterm

This will open an xterm with the specified resources, but it will open only when the resources are allocated.

Priority/Scheduling

Each job is given priority according to several weighted factors:

  1. QOS - Requested quality of service
  2. Fairshare - The past resource consumption of the user/account
  3. Job age - How long the job is waiting in the queue

QOS

The are four QOS: high, normal, low and requeue. The default is normal. To use a different QOS, use the --qos flag of sbatch.

high

Jobs with the high QOS will be allocated before the other QOS. Don't abuse this QOS, otherwise everyone will use it and it will lose its purpose.

In the future, we might limit the use of the high QOS.

normal

The default QOS.

low

The low QOS is used to submit jobs that will run only if there is no other jobs to run. Currently no jobs are killed, so if a low priority job will run for 30 days, it can still cause normal and high priority jobs to wait.

requeue

This QOS has the same priority as the low QOS, but jobs on this QOS will be killed and requeued if it will allow jobs from the normal or high QOS to be dispatched sooner.


Fairshare

This factor takes into account past resource use by the user/account, with some decay factor. If user1 used the cluster intensively in the past week, user2 will get higher priority. But if user1 used the cluster 2 years ago, it probably won't effect the current priority.

The share applies to the labs and the users. I.e. if a lab used many resources in the recent past, a new user in that lab might still get low priority compared to users from other labs (but not compared to users from the same lab).

Job age

The longer the job is in the queue, the higher priority it will gain over other younger jobs.

Requeue account

There's a special "requeue" account which allows users to run jobs on almost all clusters but in the requeue QOS. The advantage is access to many unused resources, the disadvantage is the jobs being killed once "real" jobs (i.e. normal QOS) wants to run.

Access to the requeue account is gained by request to the system.

To use the requeue account use the "-A requeue" option, e.g.:

 sbatch [other options] -A requeue myscript.sh

To submit a job to a different cluster (e.g. clusterA) use the -M option:

 sbatch [other options] -A requeue -M clusterA myscript.sh

The -M option is only available on sbatch (not srun), so no interactive shells on remote clusters.

Some clusters aren't running the most up-to-date linux, so the jobs might not work the same on all clusters (though usually they should).

To submit a job to either clusterA or clusterB (selected on the earliest expected job initiation time):

 sbatch [other options] -A requeue -M clusterA,clusterB myscript.sh

Once a job is submitted to a cluster, it cannot be moved to a different cluster. If both cluster A and B are occupied, and the job is submitted to cluster A, it won't run on cluster B even if it's available before cluster A.

To submit a job to all clusters (whether available or not):

 sbatch [other options] -A requeue -M all myscript.sh

Not all clusters have a requeue account, so when using the "-M all" option, there will be some warnings about invalid account, those are OK and should be ignored.

Information about specific clusters

To show on which clusters you have an account, use the sacctmgr command. e.g.

sacctmgr show users -s user=$USER format=user,cluster,account,defaultaccount | awk '$3 != "default" && $3 != "requeue"'

To show on which clusters you have a requeue account:

sacctmgr show users -s user=$USER format=user,cluster,account account=requeue

Clusters:

cluster nodes RAM swap cpu (sockets:cores:threads) Max time limit gres defaults interactive jobs Notes
eye eye-01..04 190GB 250GB 32 (2:8:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
cb cb-05..20 64GB 128GB 16 (2:4:2) 7 weeks (50-0) -c2 --mem 50 --time 2:0:0
hm hm-05..38 64GB 128GB 32 (2:8:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
hm-43..52 128GB 128GB 32 (2:8:2)
hm-53..71 256GB 128GB 48 (2:12:2)
sed sed-01..16 256GB 128GB 40 (2:10:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
picasso picasso-02..16 62GB 128GB 40 (2:10:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
warhol-01..15 24GB 48GB 8 (2:4:1)
gsm gsm-01..04 256GB 250GB 32 (2:8:2) 2 days (2-0) gpu:4 (nvidia titan black) -c2 --mem 50 --time 2:0:0 up to 2
gsm-03..04 1 week (7-0) none
lucy lucy-01..03 384GB 8GB 48 (2:12:2) 3 weeks (21-0) gpu:2 (nvidia gtx 980) -c2 --mem 50 --time 2:0:0
sm sm-01..08 48GB 48GB 16 (2:4:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
sm-09..16 24GB 48GB 16 (2:4:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
sm-17..20 64GB 122GB 24 (2:6:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
sulfur sulfur-01..16 62GB 60GB 8 (2:4:1) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
oxygen oxygen-01..07 252GB 48GB 48 (2:12:2) 3 weeks (21-0) -c2 --mem 50 --time 2:0:0
cortex cortex-01..05 252GB 50GB 16 (2:8:1) 2 days (2-0) gpu:8 (nvidia tesla M60) -c2 --mem 50 --time 2:0:0 up to 2
cortex-06..08 24 (2:12:1)
cortex-06..07 1 week (7-0) none
cortex-03..05 24 (2:12:1)
blaise blaise-001..005 255GB 255GB 160 (2:10:8) 2 days (2-0) gpu:4 (nvidia p100) -c8 --mem 50 --time 2:0:0 up to 2 ppc64le architecture
blaise-002,003,005 1 week (7-0) none
silico silico-001 128GB 120GB 48 (2:12:2) 2 days (2-0) gpu:4 (GTX 1080Ti) -c2 --mem 50 --time 2:0:0 up to 2
silico-002..008 32 (2:8:2) 2 weeks (14-0)

The blaise machines are powerpc based. This is a different architecture from intel and different linux distribution (software might need to be recompiled). Note that blaise-gw does not have the same distribution as the blaise compute nodes.

More information

Man pages: sbatch, srun, sacct, squeue, scancel, sinfo, sstat, sprio.

web pages:

    general: http://slurm.schedmd.com/documentation.html
 user guide: http://slurm.schedmd.com/quickstart.html