#

GPU Computing on Odyssey

Odyssey has a number of nodes that have NVIDIA Tesla general purpose graphics processing units (GPGPU) attached to them. It is possible to use CUDA tools to run computational work on them and in some use cases see very significant speedups.

One node with 8 Tesla K20Xm is available for general use from the gpu partition; the remaining are nodes are owned by various research groups available and may be available when idle through gpu_requeue. FAS members have access to the fas_gpu partition which has 64 nodes with 2xK80s, 16 nodes with 2xK20Xm, and 8 nodes with 2xK20m. Direct access to these nodes by members of other groups is by special request. Please visit the RC Portal and submit a help request for more information.

GPGPU's on SLURM

To request a single GPU on slurm just add #SBATCH --gres=gpu to your submission script and it will give you access to a GPU. To request multiple GPUs add #SBATCH --gres=gpu:n where 'n' is the number of GPUs. You can use this method to request both CPUs and GPGPUs independently. So if you want 1 CPU and 2 GPUs from our general use GPU nodes in the 'gpu' partition, you would specify:

#SBATCH -p gpu
#SBATCH -n 1
#SBATCH --gres=gpu:2
When you submit a GPU job SLURM automatically selects some GPUs and restricts your jobs to those GPUs. In your code you reference those GPUs using zero-based indexing from [0,n) where n is the number of GPUs requested. For example, if you're using a GPU-enabled tensorflow build and requested 2 GPUs you would simply reference gpu:0 or gpu:1 from your code.

For an interactive session to work with the GPUs you can use following. While on GPU node, you can run nvidia-smi to get information about the assigned GPU's.
srun --pty -p gpu -t 0-06:00 --mem 8000 --gres=gpu:1 /bin/bash

CUDA Runtime

The current version of the Nvidia driver installed on all GPU-enabled nodes on the Odyssey cluster is 396.26, which supports Cuda version 9. 

To load the toolkit and additional runtime libraries (cublas, cufftw, ...) remember to always load the module for cuda in your Slurm  job script or interactive session.

>$ module load cuda/9.0-fasrc02

NOTE: In the past our Cuda installations were heterogeneous and different nodes on the cluster would provide different versions of the Cuda driver. For this reason might have used in your job submissions  the Slurm flags --constraint=cuda-$version  (for example --constraint=cuda-7.5)  to specifically request nodes that were supporting that version.
This is no longer needed as our cuda modules are the same throughout the cluster, and you should remove those flags from your scripts. 

Using CUDA-dependent modules

CUDA-dependent applications are accessed on Odyssey in a manner that is similar to compilers and MPI libraries. For these applications, a CUDA module must first be loaded before an application is available. For example, to use cuDNN, a CUDA-based neural network library from NVIDIA, the following command will work:

$ module load cuda/9.0-fasrc02 cudnn/7.0_cuda9.0-fasrc01

If you don't load the CUDA module first, the cuDNN module is not available.

$ module purge
$ module load cudnn/7.0_cuda9.0-fasrc01
Lmod has detected the following error:
The following module(s) are unknown: "cudnn/7.0_cuda9.0-fasrc01"
 
Please use the command module-query or our user Portal to find available versions and how to load them.
More information on software modules can be found here, and how to run jobs here

See an example on how use the cuda module to install and use Tensorflow.

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.