Machine Learning - Deep Learning - Artificial Intelligence jobs¶

Deep learning frameworks¶

Currently the following machine learning libraries are installed:

tensorflow

keras

pytorch

sklearn

Hardware optimized for deep learning¶

There are three hosts that are available for running deep learning jobs

GPUs	host(s)	GPU / host	GPU ram (GB)	GPU resource flag
	onode10	1 x Nvidia V100	32	v100d32q:1
4	onode11	1 x Nvidia V100	32	v100d32q:1
	onode12	1 x Nvidia V100	32	v100d32q:1
	onode17	1 x Nvidia V100	32	v100d32q:1
8	anode[01-08]	1 x Nvidia K20x	4.5	k20:1

Allocating GPU resources¶

In order to use a GPU for the deep learning job (or other jobs that require GPU usage), the following flag must be specified in the job script:

#SBATCH --gres=gpu

Not all the GPUs have the same amout of memory. Using --gres=gpu will allocate any available GPU. Advanced selections of the GPUs types can be specifyied by passing extra flags to --gres. The detailed flags for the different GPU types are mentioned in the columns GPU resources flag in the table above. For example, to allocate a Nvidia V100 GPU with 32GB GPU ram use the flag:

#SBATCH --gres=gpu:v100d32q:1

Using tensorflow, Keras or pytorch¶

The default environment for:

tensorflow, keras and sklearn: python/tensorflow

pytorch: python/pytorch

For any of these environment the cuda module must be imported.

A typical batch job script looks like:

#!/bin/bash

#SBATCH --job-name=keras-classify
#SBATCH --partition=gpu

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu
#SBATCH --mem=12000
#SBATCH --time=0-01:00:00

## set the environment modules
module purge
module load cuda
module load python/tensorflow

## execute the python job
python3 keras_classification.py

To connect to a jupyter notebook with the deep learning environment copy the jupyter notebook server job script from the python jupyter server guide and load the cuda module and shown above in addition to the needed machine learning framework module.

Deep learning jobs tips and best practices¶

It is recommended to:

develop and prototype using interactive jobs such as jupyter notebooks or VNC sessions or batch interactive jobs and run the production models using bactch jobs.

use checkpoints in-order to have higher turnover of GPU jobs since the resources are scarce.

Tensorflow has built in checkpointing features for training models. Details on possible workflows for jobs with checkpoints can be found in the slurm jobs guide

Distribued training and inference with torch¶

Please follow the official documentation for distributed training and inference with torch:

torch run

torch.nn.DistributedDataParalle

torch rpc parallel

Job sript for octopus using GPUs¶

master:

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=0 --master-addr=<SLURM_SUBMIT_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

salve(s)

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=1 --master-addr=<COMPUTE_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=2 --master-addr=<COMPUTE_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

torchrun --nproc-per-node=1 --nnodes=4 --node-rank=3 --master-addr=<COMPUTE_HOST> --master-port=4444 \
   $PWD/my_torch_script.py baz --arg1=foo --arg2=bar

Distribued training with tensorflow and keras¶

Please follow the official documentation for distributed tensorflow training:

tensorflow distributed

Job sript for octopus using GPUs¶

Multi-worker training with MirroredStrategy:

#!/bin/bash

#SBATCH --job-name=tf_dist
#SBATCH --partition=gpu

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu
#SBATCH --mem=32000
#SBATCH --time=0-01:00:00

## set the environment modules
module purge
module load cuda

# on both machines
module load python/ai-4

# define the port number
export TF_PORT=19090

# srun dump the compute node hostname
srun hostname -s > hosts.out

# ensure that the tf config env var is unset
unset TF_CONFIG

srun python /home/shared/tensorflow_distributed/tensorflow_distributes_multi_worker_mirrored_strategy.py

Troubleshooting¶

check the nvidia driver

To make sure that the job that has been dispatched to a node that has a GPU, the following command can be included in the job script before the command that executes a notebook or a command that runs the training for example:

# BUNCH OF SBATCH COMMANDS (JOB HEADER)

## set the environment modules
module purge
module load cuda
module load python/tensorflow

nvidia-smi

the expected output should be similar to the following where the Nvidia driver version is mentioned in addition to the CUDA toolkit version and some other specs of the GPU(s) and the list of GPU processes at the end (in this case none)

[john@onode12 ~]$ nvidia-smi
Sun Dec  8 00:41:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.30       Driver Version: 430.30       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID V100D-32Q      On   | 00000000:02:02.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |  31657MiB / 32638MiB |     13%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|   No running processes found                                                |
+-----------------------------------------------------------------------------+

This snippet can be included in the job script

check the deep learning framework backend

For tensorflow, when the following snippet is executed:6Q, Compute Capability 7.0``)

import tensorflow as tf
with tf.Session() as sess:
   devices = sess.list_devices()

the GPU(s) should be displayed in the output (search for ``StreamExecutor device (0): GRID V100D-32Q

2019-12-08 01:01:44.211101: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-12-08 01:01:44.246405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.247114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GRID V100D-32Q major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:02:02.0
2019-12-08 01:01:44.254377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-12-08 01:01:44.288733: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-12-08 01:01:44.310036: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-12-08 01:01:44.345122: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-12-08 01:01:44.378862: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-12-08 01:01:44.395244: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-12-08 01:01:44.448277: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-12-08 01:01:44.448677: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.449664: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.450245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-08 01:01:44.451105: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-12-08 01:01:44.461730: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1996250000 Hz
2019-12-08 01:01:44.462592: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5650b0feed20 executing computations on platform Host. Devices:
2019-12-08 01:01:44.462644: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-08 01:01:44.463168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.463942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GRID V100D-32Q major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:02:02.0
2019-12-08 01:01:44.464020: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-12-08 01:01:44.464037: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-12-08 01:01:44.464052: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-12-08 01:01:44.464067: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-12-08 01:01:44.464080: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-12-08 01:01:44.464094: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-12-08 01:01:44.464109: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-12-08 01:01:44.464181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.464867: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.465426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-08 01:01:44.465481: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-12-08 01:01:44.729323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-08 01:01:44.729383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-12-08 01:01:44.729399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-12-08 01:01:44.729779: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.730551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.731236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-08 01:01:44.731866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14226 MB memory) -> physical GPU (device: 0, name: GRID V100D-32Q, pci bus id: 0000:02:02.0, compute capability: 7.0)
2019-12-08 01:01:44.734308: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5650b1acf9a0 executing computations on platform CUDA. Devices:
2019-12-08 01:01:44.734353: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GRID V100D-32Q, Compute Capability 7.0

This snippet can be included at the top of the notebook or python script.

Similar checks can be done for pytorch.