Machine Learning - Deep Learning - Artificial Intelligence jobs

Deep learning frameworks

Currently the following machine learning libraries are installed:

  • tensorflow

  • keras

  • pytorch

  • sklearn

Hardware optimized for deep learning

There are three hosts that are available for running deep learning jobs

GPUs

host(s)

GPU / host

GPU ram (GB)

GPU resource flag

onode10

1 x Nvidia V100

32

v100d32q:1

4

onode11

1 x Nvidia V100

32

v100d32q:1

onode12

1 x Nvidia V100

32

v100d32q:1

onode17

1 x Nvidia V100

32

v100d32q:1

8

anode[01-08]

1 x Nvidia K20x

4.5

k20:1

Allocating GPU resources

In order to use a GPU for the deep learning job (or other jobs that require GPU usage), the following flag must be specified in the job script:

#SBATCH --gres=gpu

Not all the GPUs have the same amout of memory. Using --gres=gpu will allocate any available GPU. Advanced selections of the GPUs types can be specifyied by passing extra flags to --gres. The detailed flags for the different GPU types are mentioned in the columns GPU resources flag in the table above. For example, to allocate a Nvidia V100 GPU with 32GB GPU ram use the flag:

#SBATCH --gres=gpu:v100d32q:1

Using tensorflow, Keras or pytorch

The default environment for:

  • tensorflow, keras and sklearn: python/tensorflow

  • pytorch: python/pytorch

For any of these environment the cuda module must be imported.

A typical batch job script looks like:

#!/bin/bash

#SBATCH --job-name=keras-classify
#SBATCH --partition=gpu

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu
#SBATCH --mem=12000
#SBATCH --time=0-01:00:00

## set the environment modules
module purge
module load cuda
module load python/tensorflow

## execute the python job
python3 keras_classification.py

To connect to a jupyter notebook with the deep learning environment copy the jupyter notebook server job script from the python jupyter server guide and load the cuda module and shown above in addition to the needed machine learning framework module.

Deep learning jobs tips and best practices

It is recommended to:

  • develop and prototype using interactive jobs such as jupyter notebooks or VNC sessions or batch interactive jobs and run the production models using bactch jobs.

  • use checkpoints in-order to have higher turnover of GPU jobs since the resources are scarce.

Tensorflow has built in checkpointing features for training models. Details on possible workflows for jobs with checkpoints can be found in the slurm jobs guide

Distribued training and inference with torch

Please follow the official documentation for distributed training and inference with torch:

Job sript for octopus using GPUs

master:

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=0 --master-addr=<SLURM_SUBMIT_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

salve(s)

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=1 --master-addr=<COMPUTE_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=2 --master-addr=<COMPUTE_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=3 --master-addr=<COMPUTE_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

Distribued training with tensorflow and keras

Please follow the official documentation for distributed tensorflow training:

Job sript for octopus using GPUs

Multi-worker training with MirroredStrategy:

#!/bin/bash

#SBATCH --job-name=tf_dist
#SBATCH --partition=gpu

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu
#SBATCH --mem=32000
#SBATCH --time=0-01:00:00

## set the environment modules
module purge
module load cuda

# on both machines
module load python/ai-4

# define the port number
export TF_PORT=19090

# srun dump the compute node hostname
srun hostname -s > hosts.out

# ensure that the tf config env var is unset
unset TF_CONFIG

srun python /home/shared/tensorflow_distributed/tensorflow_distributes_multi_worker_mirrored_strategy.py

Troubleshooting

check the nvidia driver

To make sure that the job that has been dispatched to a node that has a GPU, the following command can be included in the job script before the command that executes a notebook or a command that runs the training for example:

# BUNCH OF SBATCH COMMANDS (JOB HEADER)

## set the environment modules
module purge
module load cuda
module load python/tensorflow

nvidia-smi

the expected output should be similar to the following where the Nvidia driver version is mentioned in addition to the CUDA toolkit version and some other specs of the GPU(s) and the list of GPU processes at the end (in this case none)

[john@onode12 ~]$ nvidia-smi
Sun Dec  8 00:41:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.30       Driver Version: 430.30       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID V100D-32Q      On   | 00000000:02:02.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |  31657MiB / 32638MiB |     13%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|   No running processes found                                                |
+-----------------------------------------------------------------------------+

This snippet can be included in the job script

check the deep learning framework backend

For tensorflow, when the following snippet is executed:6Q, Compute Capability 7.0``)

import tensorflow as tf
with tf.Session() as sess:
   devices = sess.list_devices()

the GPU(s) should be displayed in the output (search for ``StreamExecutor device (0): GRID V100D-32Q

2019-12-08 01:01:44.211101: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-12-08 01:01:44.246405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.247114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GRID V100D-32Q major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:02:02.0
2019-12-08 01:01:44.254377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-12-08 01:01:44.288733: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-12-08 01:01:44.310036: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-12-08 01:01:44.345122: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-12-08 01:01:44.378862: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-12-08 01:01:44.395244: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-12-08 01:01:44.448277: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-12-08 01:01:44.448677: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.449664: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.450245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-08 01:01:44.451105: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-12-08 01:01:44.461730: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1996250000 Hz
2019-12-08 01:01:44.462592: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5650b0feed20 executing computations on platform Host. Devices:
2019-12-08 01:01:44.462644: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-08 01:01:44.463168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.463942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GRID V100D-32Q major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:02:02.0
2019-12-08 01:01:44.464020: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-12-08 01:01:44.464037: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-12-08 01:01:44.464052: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-12-08 01:01:44.464067: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-12-08 01:01:44.464080: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-12-08 01:01:44.464094: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-12-08 01:01:44.464109: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-12-08 01:01:44.464181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.464867: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.465426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-08 01:01:44.465481: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-12-08 01:01:44.729323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-08 01:01:44.729383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-12-08 01:01:44.729399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-12-08 01:01:44.729779: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.730551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.731236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.731866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14226 MB memory) -> physical GPU (device: 0, name: GRID V100D-32Q, pci bus id: 0000:02:02.0, compute capability: 7.0)
2019-12-08 01:01:44.734308: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5650b1acf9a0 executing computations on platform CUDA. Devices:
2019-12-08 01:01:44.734353: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GRID V100D-32Q, Compute Capability 7.0

This snippet can be included at the top of the notebook or python script.

Similar checks can be done for pytorch.