There are two NVIDIA DGX systems in the cluster, each includes 8 A100 GPU cards with 80GB memory.
The cards are partitioned into smaller units using a technology called MIG - Multi-Instance GPU.
These GPU nodes are named dogfish-01 and dogfish-02.
On dogfish-01 the cards are partitioned into 7 units, each with 10GB of memory - a total of 56 units.
On dogfish-02 the cards are partitioned into 2 units, each with 40GB of memory - a total of 16 units.
Two nodes named puffin-01 and puffin-02.
Each contain 8 Nvidia A30 cards with 24GB memory. The cards are not partitioned into smaller units.
This partition includes node dogfish-01 with A100 GPUs and 10GB memory for each unit.
You can allocate only one unit of GPU per job on this partition!
To allocate resources on one of the GPU units use this flag in your submit command or script:
--gres gpu:a100-1-10
And in an sbatch script:
#SBATCH --gres=gpu:a100-1-10
This partition includes node dogfish-02 with A100 GPUs and 40GB memory for each unit.
You can allocate only one unit of GPU per job on this partition!
To allocate resources on one of the GPU units use this flag in your submit command or script:
--gres gpu:a100-3-40
And in an sbatch script:
#SBATCH --gres=gpu:a100-3-40
The puffin partition includes nodes puffin-01 and puffin-02 with A30 GPUs and 24GB memory each.
You can allocate more than one GPU on this partition per job.
To submit a job to this partition use the following flags in your submit command or script:
--partition=puffin
--gres gpu:a30
For 2 GPUs use:
--gres gpu:a30:2
And in an sbatch script:
#SBATCH --partition=puffin
#SBATCH --gres=gpu:a30
And for 2 GPUs use:
#SBATCH --gres gpu:a30:2
DO NOT use the following option to request GPU resources:
--gres gpu
Slurm will allocate the first available GPU.
This can result in allocating a GPU with more memory than you need and as a result unwanted expenses.