Large language models#

Author: Mher Kazandjian

Warning

This document is a work in progress and is subject to change.

Environments#

The following environments are available on octopus for running large language models and developing new models:

  • python/ai-4

  • python/transformers/r1

Resources requirements estimation tips and tricks#

Warning

The following tips and tricks are not guaranteed to work for all models. They are just a starting point for estimating the resources required to run a model.

Warning

Do not expect for code or notebooks copied and pasted from huggingface, kaggle, github, stack overflow or elsewhere to work on octopus out of the box since the HPC constraints and optimizations should be taken into account and you should know what you are doing and understand the code being executed. This is a common amateur mistake users frequently do and that results in degraded performance or wrong results.

Warning

Make sure that your scripts make use of accelerators whenever the GPU resources are specified in the job scripts. The V100 GPUs are enterprise grade high end GPUs. If the performance you are getting is slower than what you expect, most probably there is a bottleneck in your script. The bottleneck could be in:

  • make sure that the compute node you are running on has a GPU by executing nvidia-smi. If you are not getting any output, then the compute node does not have a GPU.

  • data loading, data augmentation, model architecture, or the optimizer.

  • Make sure that you are using the GPU packages of e.g PyTorch, TensorFlow, etc. and not the CPU packages.

  • if you installed additional packages using pip or conda or set up your environment from scratch ensure that your changes do not override the pre-installed packages.

  • make sure to profile your script to identify the bottleneck. You can use basic profilers or use gpu_usage_live or nvtop to monitor the GPU usage.

  • Other advanced profilers are available on octopus. You can use nsys or nvprof to profile your script. You can also use nvvp to visualize the profiling results.

  • Understand the resources requirements of your model for training or inference and compare that to the capabilities of the GPU. If the performance is much slower than expected then most probably there is a bottleneck in your script.

  • To optimize reads and writes you can use caching to the ram disk in /dev/shm on onode10, onode11, onode12, onode17. These have 128GB of ram and can be used to cache the data.

  • if all the above attempts fail then please contact HPC support for further assistance.

Available models in the model library#

Currently the HPC service provides two main repositories for large language models:

  • hugging face models: /scratch/shared/ai/models/llms/hugging_face

  • ollama models: /scratch/shared/ai/models/llms/ollama

In total around 100 models are available in the model library with a total size of 5 TB. The following list is the list of hugging face models are available on octopus (last updated 2024-12-13):

core24/
└── jais-13b-chat
FreedomIntelligence
├── AceGPT-13B
├── AceGPT-7B
└── AceGPT-7B-chat
google
├── gemma-2-2b-it
└── gemma-2-9b-it
inceptionai
├── jais-13b
├── jais-13b-chat
├── jais-30b-chat-v3
└── jais-30b-v3
meta-llama
├── llama-2-13b
├── llama-2-13b-chat
├── llama-2-13b-hf
├── llama-2-7b
├── llama-2-7b-chat
├── llama-2-7b-chat-hf
├── llama-2-7b-hf
├── Llama-3.1-70B-Instruct
├── Llama-3.1-8B-Instruct
├── Llama-3.2-11B-Vision-Instruct
├── Llama-3.2-1B-Instruct
├── Llama-3.2-3B-Instruct
├── Llama-3.2-90B-Vision-Instruct
├── Meta-Llama-3-8B
└── Meta-Llama-3-8B-Instruct
mistralai
├── Mistral-7B-Instruct-v0.3
├── Mistral-7B-v0.1
├── mistral-7b-v0.1.Q4_K_M
├── Mistral-7B-v0.3
├── Mistral-Large-Instruct-2407
└── Mixtral-8x7B-Instruct-v0.1
pfnet
└── Llama3-Preferred-MedSwallow-70B
Qwen
├── Qwen2.5-72B-Instruct
├── Qwen2.5-7B-Instruct
├── Qwen2-VL-2B-Instruct
├── Qwen2-VL-72B-Instruct
└── Qwen2-VL-7B-Instruct
tiiaue
├── falcon-180b
├── falcon-1b
├── falcon-40b
├── falcon-40b-instruct
└── falcon-7b

The following list is the list of ollama models (last updated 2024-12-13) (see also here):

llama3.1:latest                     42182419e950    4.7 GB    2 weeks ago
llama3.1:405b                       65fa6b82bfda    228 GB    7 weeks ago
nemotron:latest                     2262f047a28a    42 GB     7 weeks ago
llama3.2:1b                         baf6a787fdff    1.3 GB    7 weeks ago
gemma:latest                        a72c7f4d0a15    5.0 GB    7 months ago
gemma:2b                            b50d6c999e59    1.7 GB    7 months ago
llama3:70b-instruct                 bcfb190ca3a7    39 GB     7 months ago
llama3:70b                          bcfb190ca3a7    39 GB     7 months ago
llama3:latest                       71a106a91016    4.7 GB    7 months ago
llama3:instruct                     71a106a91016    4.7 GB    7 months ago
mixtral:8x7b-text-v0.1-fp16         221f0bf341e3    93 GB     10 months ago
codellama:70b-code                  f51f75d243f2    38 GB     10 months ago
codellama:70b                       e59b580dfce7    38 GB     10 months ago
codellama:70b-instruct              e59b580dfce7    38 GB     10 months ago
deepseek-coder:33b-instruct-fp16    b54904179335    66 GB     10 months ago
deepseek-coder:33b-base-q4_0        ca50732c8ee1    18 GB     10 months ago
deepseek-coder:33b                  acec7c0b0fd9    18 GB     10 months ago
mistral:latest                      61e88e884507    4.1 GB    10 months ago
deepseek-coder:6.7b                 ce298d984115    3.8 GB    10 months ago
deepseek-coder:1.3b-base-q8_0       71f702eff852    1.4 GB    10 months ago
deepseek-coder:1.3b                 3ddd2d3fc8d2    776 MB    10 months ago
deepseek-coder:latest               3ddd2d3fc8d2    776 MB    10 months ago
deepseek-coder:1.3b-instruct        3ddd2d3fc8d2    776 MB    10 months ago
megadolphin:latest                  8fa55398527b    67 GB     10 months ago
dolphin-mixtral:8x7b                cfada4ba31c7    26 GB     10 months ago
zephyr:latest                       bbe38b81adec    4.1 GB    10 months ago
stablelm-zephyr:latest              0a108dbd846e    1.6 GB    10 months ago
deepseek-coder:33b-instruct         acec7c0b0fd9    18 GB     10 months ago
wizardlm:70b-llama2-q4_0            2d269a65a092    38 GB     10 months ago
yarn-mistral:7b-128k                6511b83c33d5    4.1 GB    10 months ago
wizardlm-uncensored:13b             886a369d74fc    7.4 GB    10 months ago
falcon:180b-chat                    e2bc879d7cee    101 GB    10 months ago
mixtral:latest                      7708c059a8bb    26 GB     10 months ago
starcoder:7b                        53fdbc3a2006    4.3 GB    10 months ago
starcoder:15b                       fc59c84e00c5    9.0 GB    10 months ago
codellama:34b                       685be00e1532    19 GB     10 months ago
starcoder:3b                        847e5a7aa26f    1.8 GB    10 months ago
starcoder:1b                        77e6c46054d9    726 MB    10 months ago
falcon:7b                           4280f7257e73    4.2 GB    10 months ago
medllama2:latest                    a53737ec0c72    3.8 GB    10 months ago
mixtral:8x7b-instruct-v0.1-q8_0     a6689be5de7d    49 GB     10 months ago
llava:latest                        cd3274b81a85    4.5 GB    10 months ago
mistral:instruct                    61e88e884507    4.1 GB    10 months ago
phi:latest                          e2fd6321a5fe    1.6 GB    10 months ago
tinyllama:latest                    2644915ede35    637 MB    10 months ago

For the latest list check the content of the directories listed above.

Since downloading large models it time consuming please email it.helpdesk@aub.edu.lb for additional models that you would like to be deployed that are not in the list above.

It is a good practice to cache the model (if it fits) to /dev/shm/ to speed up loading the models for repeated use. The read and write speed to /dev/shm is around 4 GB/s. Loading the hugging face mistral 7B model can be done in about 5 seconds.

Note

Some of the models that are gated on hugging face require permission to access them on octopus as well. These models are the following:

  • meta-llama

  • mistral

  • inceptionai

In order to access them please contact it.helpdesk@aub.edu.lb to have access to them. It is a pre-requisites to already have access to the models on huggingface before getting access to them on octopus. All the ollama models are available to all users.

To cache the a model (e.g Llama-3.2-1B-instruct) to /dev/shm/ the following command can be used:

rsync -PrlHvtpog /scratch/shared/ai/models/llms/hugging_face/meta-llama/Llama-3.1-8B-Instruct \
    --exclude "*.git*" --exclude "*original*" \
    /dev/shm

The following snippet can be used to load the a model from the model library on octopus:

import os
from transformers import AutoModelForCausalLM, AutoTokenizer

cache_dir = '/scratch/shared/ai/models/llms/hugging_face'
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model_path = os.path.join(cache_dir, model_name)
model = AutoModelForCausalLM.from_pretrained(model_path, cache_dir=cache_dir)

#
# do something with the model ...
#

# dump the modified model to disk
trial_no = 0
model_out = os.path.join('~/scratch/models_workspace/', f'{model_name}_{trial_no}')
os.makedirs(model_out, exist_ok=True)
model.save_pretrained(os.path.expanduser(model_out))

print('done')

Download the model and datasets#

If the model is small you can try to download it to the local disk into the scratch directory. Please set your HF_HOME environment variable to the scratch directory before downloading the model. Executing the following commands will create a cache directory for the model and Hugging Face datasets in your scratch directory or in a custom local directory.

# read about the huggingface-cli tool
module load python/ai-4
huggingface-cli --help

export HF_HOME=~/scratch/huggingface
# to make this permanent add the above line to your .bashrc file, it will take effect next time
# you login to the system or source ~/.bashrc
echo "export HF_HOME=~/scratch/huggingface" >> ~/.bashrc

# read about the huggingface-cli tool
huggingface-cli --help

# (optional) only for gated content
# login to huggingface or set up your api key via the environment variable HF_TOKEN
huggingface-cli login
# or set the HF_TOKEN environment variable
export HF_TOKEN="replace_this_your_huggingface_token_replace_this"

#
# specify the model or dataset name and download it
#
MODEL="Qwen/Qwen2.5-VL-3B-Instruct"
huggingface-cli download ${MODEL}

# to download the model to a custom path
DIRPATH=~/scratch/my_custom_dir         # use $PWD to download in the local dir
mkdir -p ${DIRPATH}/${MODEL}
huggingface-cli download ${MODEL} --local-dir ${DIRPATH}/${MODEL}

A similar approach can be followed to download datasets from huggingface.

# (optional) only for gated content
# login to huggingface or set up your api key via the environment variable HF_TOKEN
huggingface-cli login
# or set the HF_TOKEN environment variable
export HF_TOKEN="replace_this_your_huggingface_token_replace_this"

# specify the dataset name and download it
DATASET="PULSE-ECG/ECGBench"
huggingface-cli download --repo-type dataset "${DATASET}"

# to download the dataset to a custom path
DIRPATH=~/scratch/my_custom_dir         # use $PWD to download in the local dir
mkdir -p "${DIRPATH}/${DATASET}"
huggingface-cli download --repo-type dataset "${DATASET}" --local-dir "${DIRPATH}/${DATASET}"

Running inference and evaluating models#

Hugging face models using the transformers package#

In the following example the mistral 7B model will be evaluated using the transformers/r1 pre-deployed environment. The job script and the python script that runs the model are available on octopus at:

/apps/shared/...../path/to/example1

The expected evaluation time the example below is ?? seconds. This example produces ?? tokens at an average rate of ?? tokens / min. During this test a total of ?? GB is transfered from the disk to the GPU and a total of ?? (float??) operations are done. The total memory transfer from VRAM to the GPU is ?? GB at an average rate of ?? GB/s and a peak of ?? GB/s.

The job script is the following:

############################ eval_mistral.sh ###############################
#!/bin/bash

#SBATCH --job-name=eval-mistral
#SBATCH --account=abc123

#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32000
#SBATCH --gres=gpu:v100d32q:1
#SBATCH --time=0-00:10:00

#SBATCH --mail-type=ALL
#SBATCH --mail-user=abc123@mail.aub.edu

# prepare the scripts and cache the model
cp /scratch/llms/.../mistral7b... /dev/shm
cp /apps/shared/ai/.../eval_mistral_userguide.py /dev/shm/

# load the transformers environment and evaluate the model
module load python/transformers/r1
cd /dev/shm
python eval_mistral_userguide.py
########################## end eval_mistral.sh #############################
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model_name = "mistralai/Mistral-7B-v0.1"

cache_dir = '/dev/shm/huggingface_cache'

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    cache_dir=cache_dir)

# evaluate the model for 10 prompts
prompts = [
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is",
    "My favourite condiment is"
]
for prompt in tqdm.tqdm(prompts):
    model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
    model.to(device)
    generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
    tokenizer.batch_decode(generated_ids)[0]

Evaluating quantized models#

Once a model is fine tuned or trained (see below) it is convient (assuming that the loss in accuracy is not high to quantize the model to evaluate the quantized model for testing purposes. For use cases that do not requite high accuracy quantized models are good enough and they outperform the llama7B model)

Todo

double check this statement.

Using llama.cpp#

In this section I will explain the basics of quantization and how to evaluate such models without any optimization on a CPU. Later in this section I will describe and demonstrate how to scale the model evaluation using a single GPU and multiple GPUs across several hosts or across multiple mosts using only CPUs and compare the performance.

Evaluate the quantized model on a CPU - non optimized#

module load gcc/12
rsync -PrlHvtpog /scratch/shared/ai/models/llms/mistralai/Mistral-7B-v0.1/mistral-7b-v0.1.Q4_K_M /dev/shm/
/apps/sw/llama.cpp/amd-avx2/bin/main -t 16 -ngl 24 --color --temp 0.7 -n 1 -m /dev/shm/mistral-7b-v0.1.Q4_K_M/mistral-7b-v0.1.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Evaluate the quantized model on a CPU (optimized)#

module load gcc/12
module load cuda/12
rsync -PrlHvtpog /scratch/shared/ai/models/llms/mistralai/Mistral-7B-v0.1/mistral-7b-v0.1.Q4_K_M /dev/shm/
/apps/sw/llama.cpp/amd-v100-cublas-12/bin/main -t 8 -ngl 24 --color --temp 0.7 -n 1 -m /dev/shm/mistral-7b-v0.1.Q4_K_M/mistral-7b-v0.1.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Evaluate the quantized model on a CPU across multiple hosts#

module load llama.cpp/mpi

Evaluate the quantized model on a GPU#

module load llama.cpp/gpu-v100
...

module load llama.cpp/gpu-k20
...

Evaluate the quantized model across multiple GPUs#

module load llama.cpp/gpu-v100-mpi
...

module load llama.cpp/gpu-k20-mpi
...

Benchmark the quantized model#

[test01@onode12 work]$ /apps/sw/llama.cpp/amd-v100-cublas-12/bin/llama-bench -m /dev/shm/mistral-7b-v0.1.Q4_K_M/mistral-7b-v0.1.Q4_K_M.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla V100-PCIE-32GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 | pp 512     |  2233.80 ± 65.69 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 | tg 128     |     82.05 ± 0.15 |

Farm the evaluation of quantized models#

Todo

under development

# .. todo:: cache the model to some ram disks and then rsync it to other ram

disks. decide depending on the read time from /scratch what is the best strategy that leads to having the model on all the machines the fastest. i.e figure out what is the best strategy to broadcast the model.

# define your prompts in a .txt file with one prompt per line
python farm_llama_cpp.py \
  --partitions=all \
  --prompts-file=/path/to/my_prompts.txt \
  --stats

Fine-tuning#

Using Accelerate#

Todo

in progress

Using Deepspeed#

Todo

in progress

Using unsloth#

It is possible to fine tune quantized models using unsloth up to 70B using two V100 GPUs. Smaller models can be executed on one V100 GPU. In order to use unsloth a singularity container has been prepared and it works out of the box.

The official unsloth documentation can be found here: https://docs.unsloth.ai/

The procudure of running the fine tuning is as follows:

  • Create a slurm job script that allocated two V100 GPUs

  • Run the unsloth container

  • Inside the container run the fine tuning script

The unsloth container is available on octopus at the following location:

/apps/sw/apptainer/images/unsloth-2025-03.sif

The following is a job script that is also located at

/home/shared/fine_tune_llama_70B_unsloth/job-2025-03.sh

Unsloth training script
  1# see https://github.com/unslothai/unsloth
  2
  3# %%
  4import sys
  5from unsloth import FastLanguageModel
  6import torch
  7max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
  8dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
  9load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
 10
 11# %%
 12model_name = "/scratch/shared/ai/models/llms/hugging_face/unsloth/Meta-Llama-3.1-70B-bnb-4bit"
 13#model_name = "/scratch/shared/ai/models/llms/hugging_face/unsloth/Llama-3.2-1B-Instruct"
 14#model_name = "/dev/shm/unsloth/Meta-Llama-3.1-70B-bnb-4bit"
 15#model_name = "/dev/shm/unsloth/Llama-3.2-1B-Instruct"
 16model, tokenizer = FastLanguageModel.from_pretrained(
 17    model_name = model_name,
 18    max_seq_length = max_seq_length,
 19    dtype = dtype,
 20    load_in_4bit = load_in_4bit,
 21    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
 22    device_map = 'auto'
 23)
 24
 25# %%
 26"""We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"""
 27
 28model = FastLanguageModel.get_peft_model(
 29    model,
 30    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
 31    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
 32                      "gate_proj", "up_proj", "down_proj",],
 33    lora_alpha = 16,
 34    lora_dropout = 0, # Supports any, but = 0 is optimized
 35    bias = "none",    # Supports any, but = "none" is optimized
 36    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
 37    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
 38    random_state = 3407,
 39    use_rslora = False,  # We support rank stabilized LoRA
 40    loftq_config = None, # And LoftQ
 41)
 42
 43# %%
 44### Define the prompt formatting function and load the dataset
 45
 46alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
 47
 48### Instruction:
 49{}
 50
 51### Input:
 52{}
 53
 54### Response:
 55{}"""
 56
 57EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
 58def formatting_prompts_func(examples):
 59    instructions = examples["instruction"]
 60    inputs       = examples["input"]
 61    outputs      = examples["output"]
 62    texts = []
 63    for instruction, input, output in zip(instructions, inputs, outputs):
 64        # Must add EOS_TOKEN, otherwise your generation will go on forever!
 65        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
 66        texts.append(text)
 67    return { "text" : texts, }
 68
 69from datasets import load_dataset
 70dataset = load_dataset("yahma/alpaca-cleaned", split="train")
 71dataset = dataset.map(formatting_prompts_func, batched=True,)
 72
 73# %%
 74from trl import SFTTrainer
 75from transformers import TrainingArguments
 76from unsloth import is_bfloat16_supported
 77
 78trainer = SFTTrainer(
 79    model = model,
 80    tokenizer = tokenizer,
 81    train_dataset = dataset,
 82    dataset_text_field = "text",
 83    max_seq_length = max_seq_length,
 84    dataset_num_proc = 2,
 85    packing = False, # Can make training 5x faster for short sequences.
 86    args = TrainingArguments(
 87        per_device_train_batch_size = 2,
 88        gradient_accumulation_steps = 4,
 89        warmup_steps = 5,
 90        #num_train_epochs = 3, # Set this for 1 full training run.
 91        max_steps = 60,
 92        learning_rate = 2e-4,
 93        fp16 = not is_bfloat16_supported(),
 94        bf16 = is_bfloat16_supported(),
 95        logging_steps = 1,
 96        optim = "adamw_8bit",
 97        weight_decay = 0.01,
 98        lr_scheduler_type = "linear",
 99        seed = 3407,
100        output_dir = "outputs",
101        report_to = "none", # Use this for WandB etc
102        # Multi-GPU specific settings
103        ddp_find_unused_parameters=False,  # Important for efficiency
104        local_rank=-1,  # Will be set by accelerate
105    ),
106)
107
108# %%
109gpu_stats = torch.cuda.get_device_properties(0)
110start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
111max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
112print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
113print(f"{start_gpu_memory} GB of memory reserved.")
114
115# %%
116trainer_stats = trainer.train()
117
118# %%
119# @title Show final memory and time stats
120used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
121used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
122used_percentage = round(used_memory / max_memory * 100, 3)
123lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
124print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
125print(
126    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
127)
128print(f"Peak reserved memory = {used_memory} GB.")
129print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
130print(f"Peak reserved memory % of max memory = {used_percentage} %.")
131print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
132
133# %%
134### save the fine-tuned model
135model.save_pretrained("lora_model")  # Local saving
136tokenizer.save_pretrained("lora_model")
137
138# %%
139if True:
140    sys.exit(0)
141
142# %%
143"""<a name="Inference"></a>
144### Inference
145Let's run the model! You can change the instruction and input - leave the output blank!
146
147**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**
148"""
149
150# %%
151# alpaca_prompt = Copied from above
152FastLanguageModel.for_inference(model) # Enable native 2x faster inference
153inputs = tokenizer(
154[
155    alpaca_prompt.format(
156        "Continue the fibonnaci sequence.", # instruction
157        "1, 1, 2, 3, 5, 8", # input
158        "", # output - leave this blank for generation!
159    )
160], return_tensors = "pt").to("cuda")
161
162outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
163tokenizer.batch_decode(outputs)
164
165""" You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!"""
166
167# alpaca_prompt = Copied from above
168FastLanguageModel.for_inference(model) # Enable native 2x faster inference
169inputs = tokenizer(
170[
171    alpaca_prompt.format(
172        "Continue the fibonnaci sequence.", # instruction
173        "1, 1, 2, 3, 5, 8", # input
174        "", # output - leave this blank for generation!
175    )
176], return_tensors = "pt").to("cuda")
177
178from transformers import TextStreamer
179text_streamer = TextStreamer(tokenizer)
180_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
181
182"""<a name="Save"></a>
183### Saving, loading finetuned models
184To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.
185
186**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
187"""
188
189model.save_pretrained("lora_model")  # Local saving
190tokenizer.save_pretrained("lora_model")
191# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
192# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving
193
194"""Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"""
195
196if False:
197    from unsloth import FastLanguageModel
198    model, tokenizer = FastLanguageModel.from_pretrained(
199        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
200        max_seq_length = max_seq_length,
201        dtype = dtype,
202        load_in_4bit = load_in_4bit,
203    )
204    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
205
206# alpaca_prompt = You MUST copy from above!
207
208inputs = tokenizer(
209[
210    alpaca_prompt.format(
211        "What is a famous tall tower in Paris?", # instruction
212        "", # input
213        "", # output - leave this blank for generation!
214    )
215], return_tensors = "pt").to("cuda")
216
217from transformers import TextStreamer
218text_streamer = TextStreamer(tokenizer)
219_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
220
221"""You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**."""
222
223if False:
224    # I highly do NOT suggest - use Unsloth if possible
225    from peft import AutoPeftModelForCausalLM
226    from transformers import AutoTokenizer
227    model = AutoPeftModelForCausalLM.from_pretrained(
228        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
229        load_in_4bit = load_in_4bit,
230    )
231    tokenizer = AutoTokenizer.from_pretrained("lora_model")
232
233"""### Saving to float16 for VLLM
234
235We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
236"""
237
238# Merge to 16bit
239if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
240if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
241
242# Merge to 4bit
243if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
244if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
245
246# Just LoRA adapters
247if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
248if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
249
250"""### GGUF / llama.cpp Conversion
251To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
252
253Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
254* `q8_0` - Fast conversion. High resource use, but generally acceptable.
255* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
256* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
257
258[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
259"""
260
261# Save to 8bit Q8_0
262if False: model.save_pretrained_gguf("model", tokenizer,)
263# Remember to go to https://huggingface.co/settings/tokens for a token!
264# And change hf to your username!
265if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
266
267# Save to 16bit GGUF
268if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
269if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
270
271# Save to q4_k_m GGUF
272if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
273if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
274
275# Save to multiple GGUF options - much faster if you want multiple!
276if False:
277    model.push_to_hub_gguf(
278        "hf/model", # Change hf to your username!
279        tokenizer,
280        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
281        token = "",
282    )
283
284"""Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)
285
286And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
287
288Some other links:
2891. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2902. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
2913. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
2926. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!
293
294<div class="align-center">
295  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
296  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
297  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
298
299  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
300</div>
301
302"""
 1#!/bin/bash
 2
 3#SBATCH --job-name=test-job
 4#SBATCH --account=abc123
 5
 6#SBATCH --partition=gpu
 7#SBATCH --nodes=1
 8#SBATCH --ntasks-per-node=1
 9#SBATCH --cpus-per-task=8
10#SBATCH --mem=128000
11#SBATCH --gres=gpu:v100d32q:2
12#SBATCH --time=0-03:00:00
13
14#SBATCH --mail-type=ALL
15#SBATCH --mail-user=abc123@mail.aub.edu
16
17module load singularity
18
19hostname
20
21# list/check the available GPUs that are on the node
22singularity exec --nv /apps/sw/apptainer/images/unsloth-2025-03.sif nvidia-smi
23
24WORKDIR=~/scratch/llama_70B_unsloth_test1
25#WORKDIR=/dev/shm/llama_70B_unsloth_test1
26
27# create a directory and place all the needed files in there
28mkdir -p ${WORKDIR}
29cd ${WORKDIR}
30
31TRAIN_SCRIPT_NAME=fine_tune_llama_70B_unsloth_job-2025-03.py
32cp /home/shared/fine_tune_llama_70B_unsloth/${TRAIN_SCRIPT_NAME} .
33
34singularity exec --nv \
35    --bind /scratch:/scratch \
36    /apps/sw/apptainer/images/unsloth-2025-03.sif \
37    /bin/bash -c \
38     ". /apps/miniconda/etc/profile.d/conda.sh && \
39      conda activate unsloth && \
40      python3 ${TRAIN_SCRIPT_NAME} 2>&1 | tee output.log"

These scripts are available on the cluster at the following path:

/home/shared/fine_tune_llama_70B_unsloth/fine_tune_llama_70B_unsloth_job-2025-03.sh

To reproduce this example, the following steps are required:

  • copy the script to your home directory

  • change the account to your account

  • change the mail-user to your email

  • submit the job

  • you should get the fine tuned model in ~/scratch/llama_70B_unsloth_test_1

The expected output should look something like this (the output below is trimmed)

Tue Mar 11 15:33:42 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-32GB           Off | 00000000:04:00.0 Off |                  Off |
| N/A   42C    P0              27W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-32GB           Off | 00000000:1B:00.0 Off |                  Off |
| N/A   38C    P0              23W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
 Unsloth: Will patch your computer to enable 2x faster free finetuning.
 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    Tesla V100-PCIE-32GB. Num GPUs = 2. Max memory: 31.739 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 100%|██████████| 6/6 [00:17<00:00,  2.95s/it]
Unsloth 2025.3.9 patched 80 layers with 80 QKV layers, 80 O layers and 80 MLP layers.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 51,760 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 207,093,760/36,535,279,616 (0.57% trained)
GPU = Tesla V100-PCIE-32GB. Max memory = 31.739 GB.
15.875 GB of memory reserved.
100%|██████████| 60/60 [22:14<00:00, 22.24s/it]
Unsloth: Will smartly offload gradients to save VRAM!
{'loss': 1.3177, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 5.6959, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
.
.
.
{'loss': 1.9101, 'grad_norm': nan, 'learning_rate': 0.00019272727272727274, 'epoch': 0.01}
{'loss': 4.2287, 'grad_norm': nan, 'learning_rate': 0.00019272727272727274, 'epoch': 0.01}
{'train_runtime': 1334.31, 'train_samples_per_second': 0.36, 'train_steps_per_second': 0.045, 'train_loss': 4.311967599391937, 'epoch': 0.01}
1334.31 seconds used for training.
22.24 minutes used for training.
Peak reserved memory = 18.598 GB. (per device)
Peak reserved memory for training = 2.723 GB. (per device)
Peak reserved memory % of max memory = 58.597 %.
Peak reserved memory for training % of max memory = 8.579 %.

Fine-tuning llama2 7B using the official facebook llama repo#

TL;DR Procedure to fine-tune llama2 7B on one V100 GPU on octopus.

The following pre-requisites are required to fine tune the llama2 7B model:

  • The facebook llama-recipes repo (already installed on octopus)

  • The LLaMA 7B HF model (email it.helpdesk@aub.edu.lb to request access by presenting a copy of your signed agreement https://llama.meta.com/llama-downloads/ or place your own copy in the right location - see below).

  • A python environment with the right requirements (already installed on octopus)

  • The job script with the octopus specific hardware / software configuration that runs the fine tuning.

To run the fine tuning as described in the llama-recipes repo, the following steps are done:

  1. Load the llama-recipes environment

  2. Clone and install the llama-recipes repo

  3. Cache the model to /dev/shm to speed up the loading of the model

  4. Run the fine tuning script

module load llama
cp -fvr /apps/sw/llama-recipes . && cd llama-recipes
git checkout 2e768b1
pip install .
rsync -PrlHvtpog  /scratch/shared/ai/models/llms/llama/llama-2-7b-hf /dev/shm/
mkdir models
ln -s /dev/shm/llama-2-7b-hf models/7B
time python -m llama_recipes.finetuning  \
  --use_peft --peft_method lora --quantization \
  --model_name models/7B --output_dir /dev/shm/PEFT/model/

The following output is expected:

[test04@onode11 llama-recipes]$ time python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization       --model_name models/7B --output_dir /dev/shm/PEFT/model/
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.51s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroug
hly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
--> Model models/7B
--> models/7B has 262.41024 Million params
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:01<00:00, 10651.47 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:24<00:00, 598.36 examples/s]
--> Training Set Length = 14732
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 8043.54 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 582.25 examples/s]
--> Validation Set Length = 818
Preprocessing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:07<00:00, 1920.48it/s]
Preprocessing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 1971.12it/s]
Training Epoch: 1:   0%|                                                                                | 0/388 [00:00<?, ?it/s]/home/mher/progs/sw/miniconda/envs/llama-orig-bench-1/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
Training Epoch: 1/3, step 387/388 completed (loss: 1.7123626470565796): 100%|███████████████| 388/388 [3:34:24<00:00, 33.16s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 24 GB
Peak active CUDA memory was 21 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████| 84/84 [04:29<00:00,  3.21s/it]
 eval_ppl=tensor(5.2620, device='cuda:0') eval_epoch_loss=tensor(1.6605, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in /dev/shm/PEFT/model/ directory
best eval loss on epoch 1 is 1.660506010055542
Epoch 1: train_perplexity=5.3824, train_epoch_loss=1.6831, epoch time 12864.613309495151s
Training Epoch: 2/3, step 387/388 completed (loss: 1.6909533739089966): 100%|███████████████| 388/388 [3:33:44<00:00, 33.05s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 24 GB
Peak active CUDA memory was 21 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████| 84/84 [04:29<00:00,  3.20s/it]
 eval_ppl=tensor(5.2127, device='cuda:0') eval_epoch_loss=tensor(1.6511, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in /dev/shm/PEFT/model/ directory
best eval loss on epoch 2 is 1.6511057615280151
Epoch 2: train_perplexity=5.1402, train_epoch_loss=1.6371, epoch time 12824.521782848984s
Training Epoch: 3/3, step 11/388 completed (loss: 1.5718340873718262):   3%|                | 12/388 [06:36<3:26:57, 33.03s/it]
Training Epoch: 3/3, step 387/388 completed (loss: 1.6727845668792725): 100%|███████████████| 388/388 [3:33:37<00:00, 33.03s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 24 GB
Peak active CUDA memory was 21 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [04:28<00:00,  3.20s/it]
 eval_ppl=tensor(5.1962, device='cuda:0') eval_epoch_loss=tensor(1.6479, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in /dev/shm/PEFT/model/ directory
best eval loss on epoch 3 is 1.647936224937439
Epoch 3: train_perplexity=5.0411, train_epoch_loss=1.6176, epoch time 12817.443107374012s
Key: avg_train_prep, Value: 5.1879143714904785
Key: avg_train_loss, Value: 1.6459529399871826
Key: avg_eval_prep, Value: 5.223653316497803
Key: avg_eval_loss, Value: 1.6531827449798584
Key: avg_epoch_time, Value: 12835.526066572716
Key: avg_checkpoint_time, Value: 0.040507279336452484

real    659m9.844s
user    349m26.981s
sys     351m6.738s

The following table summarizes the performance of the fine tuning of the llama2

Model

GPU

Epochs

Wall Time

llama2 7B

Nvidia V100

3

10h 50m

The full job script (below) that reproduces the results can be found at /home/shared/fine_tune_llama_7b/job.sh. It can be copied to your home directory and executed as follows (change test04 with your username):

#!/bin/bash

#SBATCH --job-name=llama7b-finetune
#SBATCH --account=test04

#SBATCH --partition=msfea-ai
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:v100d32q:1
#SBATCH --mem=32000
#SBATCH --time=0-12:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=test04@mail.aub.edu

module load llama
cp -fvr /apps/sw/llama-recipes . && cd llama-recipes
git checkout 2e768b1
pip install .
rsync -PrlHvtpog  /scratch/shared/ai/models/llms/llama/llama-2-7b-hf /dev/shm/
mkdir models
ln -s /dev/shm/llama-2-7b-hf models/7B
time python -m llama_recipes.finetuning  \
  --use_peft --peft_method lora --quantization \
  --model_name models/7B --output_dir /dev/shm/PEFT/model/

Todo

Add instructions for resuming from an epoch

Todo

Add instructions for providing a custom fine-tuning dataset

Fine tuning llama2 13B#

Prior to fine tuning the 13B llama2 model, it must be shared in-order fit on two or four V100 GPUs.

Note

i am not sure if it was possible to fine tune 13B on two GPUs! try again

Fine-tuning using llama_recipes#

# 4 GPUs

  python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model
master
    $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=0 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model
slaves
    $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=1 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model
    $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=2 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model
    $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=3 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model

Sharding#

Todo

add a section here on how to shard llama2 13B

Serving models using ollama#

There are a bunch of models that are available on octopus. The models are

$ ollama list
NAME                                    ID              SIZE    MODIFIED
codellama:34b                           685be00e1532    19 GB   9 days ago
codellama:70b                           e59b580dfce7    38 GB   2 days ago
codellama:70b-code                      f51f75d243f2    38 GB   2 days ago
codellama:70b-instruct                  e59b580dfce7    38 GB   2 days ago
deepseek-coder:1.3b                     3ddd2d3fc8d2    776 MB  8 days ago
....
....
....

Since downloading large models it time consuming please email it.helpdesk@aub.edu.lb for additional models that you would like to be deployed that are not in the list above.

Note

The environment variable OLLAMA_MODELS is set to /scratch/shared/ai/models/llms/ollama/models. This is the default location where the models are stored. If you would like to use a different location, you can set the environment variable OLLAMA_MODELS to the desired location. If there is a model that needs to be loaded / offloaded multiple time for some reason (such as a script that needs to execute many times that exists and re-runs) then caching the models to be used to /dev/shm is a good idea. In this case set the evn variable OLLAMA_MODELS to /dev/shm/ollama/models and put your models in there by copying them from the default location.

Todo

add a bash function that caches a certain named model to /dev/shm

Load and list the models#

module load ollama
ollama list

Run a model in interactive mode#

module load ollama

ollama serve > /dev/null 2>&1 &
# wait a bit (~ 20 seconds) until the server is up and running
ollama run phi:latest

Run a model in batch mode#

Create a python script that uses the ollama client to run a model. In the example below the phi model is used since it is small and can be loaded quickly.

import ollama
response = ollama.chat(model='phi', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
])
print(response['message']['content'])
module load ollama
module load python/ai-4

ollama serve > /dev/null 2>&1 &
sleep 20
python ollama_eval.py

Quantizing models#

Todo

add notes here