Basics

Slurm Basics

Slurm jobs may be scheduled by slurm client tools. For a brief introduction checkout Slurm Quickstart

There are some possibilities to schedule a task in a slurm controlled system.

The main paradigm for job scheduling in a slurm cluster is using sbatch. It schedules a Job and requests the user claimed resources.

The cluster provides two types of resources: CPUs and GPUs which can be requested for jobs in variable amounts.

The GPUs in the cluster come in two flavours: The GPU objects tesla and gtx.

You may request a single gpu object via the option --gres=gpu:1. The Slurm scheduler reserves one gpu object exclusive for your job and therefore schedules the jobs by free resources.

CPUs are requested with -c or --cpus-per-task= options. For further information have a look at the man-pages of srun and sbatch. Reading the Slurm documentation is also highly recommended

The commands sinfo and squeue provide detailed information about the clusters state and jobs running.

CPU management and CPU-only jobs

Though the facility is called GPU-cluster, it also appropriate for CPU-only computing as it not only provides 12 GPUs but also 240 CPU-cores. Effective utilization of the CPU-resources can be tricky so you should make yourself familiar with CPU-management.

Choosing the appropriate partition

The cluster offers two partitions. Partitions can be considered as separate queues with slightly different features.

Partition selection is done with the parameter -p or --partition= in your srun-commands and sbatch-scripts. The default partition is cpu. Jobs which aren’t mapped on a partition will be started there.

We have the cpu and gpu partition. If you have a cpu-only job (not requesting any GPU-resources with --gres=gpu:n ), you should start it on the cpu partition.

A job using GPU should be started on the gpu partition, with one exception. Jobs which request one GPU (with --gres=gpu:1) and more than 2 CPUs (with the -c or --cpus-per-task option) should use the cpu partition.

The reason for this policy is not obvious and will be explained under GPU Blocking

The example example.job.sbatch request one GTX 1080 Ti for the job and calls the payload example.job.sh via srun.

File: example.job.sbatch

#!/bin/bash
#SBATCH --gres=gpu:gtx:1
#SBATCH --partition=gpu
#SBATCH --time=4:00:00
srun example.job.sh

File: ​​​​example.job.sh

#!/bin/bash
module load cuda/9.0
CUDA_DEVICE=$(echo "$CUDA_VISIBLE_DEVICES," | cut -d',' -f $((SLURM_LOCALID + 1)) );
T_REGEX='^[0-9]$';
if ! [[ "$CUDA_DEVICE" =~ $T_REGEX ]]; then
        echo "error no reserved gpu provided"
        exit 1;
fi
echo "Process $SLURM_PROCID of Job $SLURM_JOBID withe the local id $SLURM_LOCALID using gpu id $CUDA_DEVICE (we may use gpu: $CUDA_VISIBLE_DEVICES on $(hostname))"
echo "computing on $(nvidia-smi --query-gpu=gpu_name --format=csv -i $CUDA_DEVICE | tail -n 1)"
sleep 15
echo "done"

In the payload script example.job.sh the cluster specific environment variable CUDA_DEVICE tells your job which gpu device to use.

If you want to get informed on job completion you may add the following sbatch parameter to your sbatch file.

#SBATCH --mail-user=user@techfak.uni-bielefeld.de
#SBATCH --mail-type=END

Please only use your eMail address @techfak.uni-bielefeld.de because others will be dropped. You may setup a mail forward within the TechFak webmailer. For a tutorial see /media/compute/vol/tutorial

GPU-blocking

The Slurm resource manager provides an abstraction layer for resource allocation. So the user just throws jobs including resource requests on the cluster. Slurm will distribute the jobs to the hardware resources automatically. An appropriate node will be chosen for the job(s) and if the requested resources are not available, the job will be queued for being started later when the requested resources become available again.

Even though the resource abstraction layer exists, it’s a good idea to keep the real hardware structure of the cluster in mind. It consists of 6 nodes (real machines), each providing 2 GPUs and 40 CPUs (2 sockets with 10 real cores + 10 virtual cores each).

Starting with the simple default configuration we had nodes in a state where GPUs were unuseable because CPU-intensive jobs were able to consume all CPU-resources on a single node. If all 40 CPUs on a node are in use, no further jobs can be startet on such a node. The node is then in the allocate state. If there are unused GPUs on an allocated node, they can not be utilized, because every job needs at least one CPU to run.

To prevent this GPU-lock, we established a second partition (this is the Slurm terminus for a job queue) called ‘cpu’. The key feature of this partition is the fact that on a single node a maximum of 36 CPUs will be allocated for jobs from the ‘cpu’-partition. So, jobs started via the ‘cpu’-partition can’t eat up all CPUs of a node, there are always 4 CPUs left.

On the other hand jobs using GPU-resources should be started on our second partition called ‘gpu’. These jobs are able to utilize the 4 ‘spare’ CPUs on a node.

But even with this configuration, it’s still possible to generate a GPU-lock: Imagine a node where 36 CPUs are allocated to CPU-only jobs. No GPU is yet in use on this node. Then a job from the ‘gpu’-partition which requests 1 GPU and 4 CPUs is started on this node. The node is then allocated (36+4 CPUs are in use) but only 1 GPU is used. The second GPU is blocked because there’s no CPU left to start another job on the node.

To avoid this, just follow the simple rule: If a job utilizes one GPU and more than two CPUs, start it on the ‘cpu’-partition. Jobs requesting 2 GPUs don’t fall under this category as they already utilize both GPUs on a node and no blocking condition can occure.