High Performance Computing

This is the second CTIT cluster, using SLURM as the workload manager.

Slurm is a highly configurable open-source workload manager. In its simplest configuration, it can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers. More complex configurations rely upon a database for archiving accounting records, managing resource limits, and supporting sophisticated scheduling algorithms.

Architecture.

Slurm Architecture
As depicted in the above picture, Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications.

Partitions.

The HPC/SLURM cluster contains multiple partitions:
Partition name Nodes Details
main all
m610 ctit061..70 DDR Infiniband
r930 caserta
r730 ctit080 2x Tesla P100
t630 ctit081..83 2x Titan-X / 4x Titan-X / 4x 1080-Ti
gpu_p100 ctit080 2x Tesla P100
gpu_titan-x ctit081..82 2x Titan-X / 4x Titan-X
gpu_1080-ti ctit083 4x 1080-Ti
debug all

The debug partition is for testing only.

The main partition can be used to submit a job to any of the nodes.
The m610, r930, r730, t630 partitions can be used to submit to the specific required model nodes.
The gpu_p100 and the gpu_titan-x partitions can be used to submit to the specific required models nodes containing gpu's.
The debug partition is for testing only.

Features & Generic consumable Resources.

This cluster also supports features (--constraint) and/or generic consumable resources (--gres).

The available features are :
  • amd/*intel*, refers to the silicon builder of the Cpus
  • avx/*avx2*, refers to the avx and avx2 instruction set, available in the newer nodes (for example: Keras, Tensorflow v1.6 and above)
  • tesla, refers to the Tesla family cards (ctit080)
  • geforce, refers to the GeForce family cards (ctit[081-083])
  • quadro, refers to the Quadro family cards (ctit[084-085])
  • p100, refers to the specific Tesla P100 model (ctit080)
  • titan-x refers to the specific Titan-X model (ctit[081-082])
  • gtx-1080ti refers to the specific GeForce gtx-1080ti model (ctit083).
  • rtx-6000 refers to the specific Quadro rtx-6000 model (ctit[084-085]).
The generic consumable resources are :
  • gpu[:pascal/turing][:amount] (currently we only have pascal/turing based gpu's).

Keep in mind for gpu's you need to load the module of the required cuda version !

Submitting Jobs

Before submitting jobs please note the maximum number of jobs and maximum number of job steps per job which can be scheduled.
These numbers can be obtained using the scontrol show config command on the korenvliet.

sbatch is used to submit a job script for later execution. The script will typically contain one task or (if required) multiple srun commands to launch parallel tasks.
See the slurm_sbatch and slurm_srun wiki page for more details.

slurm_arch.gif - Slurm Architecture (38.7 KB) Geert Jan Laanstra, 11 Aug 2017 15:07