Large language models#
Author: Mher Kazandjian
Warning
This document is a work in progress and is subject to change.
Environments#
The following environments are available on octopus
for running large
language models and developing new models:
python/ai-4
python/transformers/r1
Resources requirements estimation tips and tricks#
Warning
The following tips and tricks are not guaranteed to work for all models. They are just a starting point for estimating the resources required to run a model.
Warning
Do not expect for code or notebooks copied and pasted from
huggingface, kaggle, github, stack overflow or elsewhere to work on
octopus
out of the box since the HPC constraints and optimizations
should be taken into account and you should know what you are doing and
understand the code being executed.
This is a common amateur mistake users frequently do and that results in
degraded performance or wrong results.
Warning
Make sure that your scripts make use of accelerators whenever the GPU resources are specified in the job scripts. The V100 GPUs are enterprise grade high end GPUs. If the performance you are getting is slower than what you expect, most probably there is a bottleneck in your script. The bottleneck could be in:
make sure that the compute node you are running on has a GPU by executing
nvidia-smi
. If you are not getting any output, then the compute node does not have a GPU.data loading, data augmentation, model architecture, or the optimizer.
Make sure that you are using the GPU packages of e.g PyTorch, TensorFlow, etc. and not the CPU packages.
if you installed additional packages using pip or conda or set up your environment from scratch ensure that your changes do not override the pre-installed packages.
make sure to profile your script to identify the bottleneck. You can use basic profilers or use
gpu_usage_live
ornvtop
to monitor the GPU usage.Other advanced profilers are available on
octopus
. You can usensys
ornvprof
to profile your script. You can also usenvvp
to visualize the profiling results.Understand the resources requirements of your model for training or inference and compare that to the capabilities of the GPU. If the performance is much slower than expected then most probably there is a bottleneck in your script.
To optimize reads and writes you can use caching to the ram disk in
/dev/shm
ononode10
,onode11
,onode12
,onode17
. These have 128GB of ram and can be used to cache the data.if all the above attempts fail then please contact HPC support for further assistance.
Available models in the model library#
Currently the HPC service provides two main repositories for large language models:
hugging face models:
/scratch/shared/ai/models/llms/hugging_face
ollama models:
/scratch/shared/ai/models/llms/ollama
In total around 100 models are available in the model library with a total size of 5 TB.
The following list is the list of hugging face models are available on octopus
(last updated 2024-12-13):
core24/ └── jais-13b-chat FreedomIntelligence ├── AceGPT-13B ├── AceGPT-7B └── AceGPT-7B-chat google ├── gemma-2-2b-it └── gemma-2-9b-it inceptionai ├── jais-13b ├── jais-13b-chat ├── jais-30b-chat-v3 └── jais-30b-v3 meta-llama ├── llama-2-13b ├── llama-2-13b-chat ├── llama-2-13b-hf ├── llama-2-7b ├── llama-2-7b-chat ├── llama-2-7b-chat-hf ├── llama-2-7b-hf ├── Llama-3.1-70B-Instruct ├── Llama-3.1-8B-Instruct ├── Llama-3.2-11B-Vision-Instruct ├── Llama-3.2-1B-Instruct ├── Llama-3.2-3B-Instruct ├── Llama-3.2-90B-Vision-Instruct ├── Meta-Llama-3-8B └── Meta-Llama-3-8B-Instruct mistralai ├── Mistral-7B-Instruct-v0.3 ├── Mistral-7B-v0.1 ├── mistral-7b-v0.1.Q4_K_M ├── Mistral-7B-v0.3 ├── Mistral-Large-Instruct-2407 └── Mixtral-8x7B-Instruct-v0.1 pfnet └── Llama3-Preferred-MedSwallow-70B Qwen ├── Qwen2.5-72B-Instruct ├── Qwen2.5-7B-Instruct ├── Qwen2-VL-2B-Instruct ├── Qwen2-VL-72B-Instruct └── Qwen2-VL-7B-Instruct tiiaue ├── falcon-180b ├── falcon-1b ├── falcon-40b ├── falcon-40b-instruct └── falcon-7b
The following list is the list of ollama models (last updated 2024-12-13) (see also here):
llama3.1:latest 42182419e950 4.7 GB 2 weeks ago llama3.1:405b 65fa6b82bfda 228 GB 7 weeks ago nemotron:latest 2262f047a28a 42 GB 7 weeks ago llama3.2:1b baf6a787fdff 1.3 GB 7 weeks ago gemma:latest a72c7f4d0a15 5.0 GB 7 months ago gemma:2b b50d6c999e59 1.7 GB 7 months ago llama3:70b-instruct bcfb190ca3a7 39 GB 7 months ago llama3:70b bcfb190ca3a7 39 GB 7 months ago llama3:latest 71a106a91016 4.7 GB 7 months ago llama3:instruct 71a106a91016 4.7 GB 7 months ago mixtral:8x7b-text-v0.1-fp16 221f0bf341e3 93 GB 10 months ago codellama:70b-code f51f75d243f2 38 GB 10 months ago codellama:70b e59b580dfce7 38 GB 10 months ago codellama:70b-instruct e59b580dfce7 38 GB 10 months ago deepseek-coder:33b-instruct-fp16 b54904179335 66 GB 10 months ago deepseek-coder:33b-base-q4_0 ca50732c8ee1 18 GB 10 months ago deepseek-coder:33b acec7c0b0fd9 18 GB 10 months ago mistral:latest 61e88e884507 4.1 GB 10 months ago deepseek-coder:6.7b ce298d984115 3.8 GB 10 months ago deepseek-coder:1.3b-base-q8_0 71f702eff852 1.4 GB 10 months ago deepseek-coder:1.3b 3ddd2d3fc8d2 776 MB 10 months ago deepseek-coder:latest 3ddd2d3fc8d2 776 MB 10 months ago deepseek-coder:1.3b-instruct 3ddd2d3fc8d2 776 MB 10 months ago megadolphin:latest 8fa55398527b 67 GB 10 months ago dolphin-mixtral:8x7b cfada4ba31c7 26 GB 10 months ago zephyr:latest bbe38b81adec 4.1 GB 10 months ago stablelm-zephyr:latest 0a108dbd846e 1.6 GB 10 months ago deepseek-coder:33b-instruct acec7c0b0fd9 18 GB 10 months ago wizardlm:70b-llama2-q4_0 2d269a65a092 38 GB 10 months ago yarn-mistral:7b-128k 6511b83c33d5 4.1 GB 10 months ago wizardlm-uncensored:13b 886a369d74fc 7.4 GB 10 months ago falcon:180b-chat e2bc879d7cee 101 GB 10 months ago mixtral:latest 7708c059a8bb 26 GB 10 months ago starcoder:7b 53fdbc3a2006 4.3 GB 10 months ago starcoder:15b fc59c84e00c5 9.0 GB 10 months ago codellama:34b 685be00e1532 19 GB 10 months ago starcoder:3b 847e5a7aa26f 1.8 GB 10 months ago starcoder:1b 77e6c46054d9 726 MB 10 months ago falcon:7b 4280f7257e73 4.2 GB 10 months ago medllama2:latest a53737ec0c72 3.8 GB 10 months ago mixtral:8x7b-instruct-v0.1-q8_0 a6689be5de7d 49 GB 10 months ago llava:latest cd3274b81a85 4.5 GB 10 months ago mistral:instruct 61e88e884507 4.1 GB 10 months ago phi:latest e2fd6321a5fe 1.6 GB 10 months ago tinyllama:latest 2644915ede35 637 MB 10 months ago
For the latest list check the content of the directories listed above.
Since downloading large models it time consuming please email
it.helpdesk@aub.edu.lb
for additional models that you would like to be
deployed that are not in the list above.
It is a good practice to cache the model (if it fits) to /dev/shm/
to speed
up loading the models for repeated use. The read and write speed to /dev/shm
is around 4 GB/s. Loading the hugging face mistral 7B model can be done in about
5 seconds.
Note
Some of the models that are gated on hugging face require permission to access them on octopus as well. These models are the following:
meta-llama
mistral
inceptionai
In order to access them please contact it.helpdesk@aub.edu.lb to have access to them. It is a pre-requisites to already have access to the models on huggingface before getting access to them on octopus. All the ollama models are available to all users.
To cache the a model (e.g Llama-3.2-1B-instruct) to /dev/shm/
the following command can be used:
rsync -PrlHvtpog /scratch/shared/ai/models/llms/hugging_face/meta-llama/Llama-3.1-8B-Instruct \
--exclude "*.git*" --exclude "*original*" \
/dev/shm
The following snippet can be used to load the a model from the model library on octopus:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
cache_dir = '/scratch/shared/ai/models/llms/hugging_face'
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model_path = os.path.join(cache_dir, model_name)
model = AutoModelForCausalLM.from_pretrained(model_path, cache_dir=cache_dir)
#
# do something with the model ...
#
# dump the modified model to disk
trial_no = 0
model_out = os.path.join('~/scratch/models_workspace/', f'{model_name}_{trial_no}')
os.makedirs(model_out, exist_ok=True)
model.save_pretrained(os.path.expanduser(model_out))
print('done')
Download the model and datasets#
If the model is small you can try to download it to the local disk into the scratch directory. Please set your HF_HOME environment variable to the scratch directory before downloading the model. Executing the following commands will create a cache directory for the model and Hugging Face datasets in your scratch directory or in a custom local directory.
# read about the huggingface-cli tool
module load python/ai-4
huggingface-cli --help
export HF_HOME=~/scratch/huggingface
# to make this permanent add the above line to your .bashrc file, it will take effect next time
# you login to the system or source ~/.bashrc
echo "export HF_HOME=~/scratch/huggingface" >> ~/.bashrc
# read about the huggingface-cli tool
huggingface-cli --help
# (optional) only for gated content
# login to huggingface or set up your api key via the environment variable HF_TOKEN
huggingface-cli login
# or set the HF_TOKEN environment variable
export HF_TOKEN="replace_this_your_huggingface_token_replace_this"
#
# specify the model or dataset name and download it
#
MODEL="Qwen/Qwen2.5-VL-3B-Instruct"
huggingface-cli download ${MODEL}
# to download the model to a custom path
DIRPATH=~/scratch/my_custom_dir # use $PWD to download in the local dir
mkdir -p ${DIRPATH}/${MODEL}
huggingface-cli download ${MODEL} --local-dir ${DIRPATH}/${MODEL}
A similar approach can be followed to download datasets from huggingface.
# (optional) only for gated content
# login to huggingface or set up your api key via the environment variable HF_TOKEN
huggingface-cli login
# or set the HF_TOKEN environment variable
export HF_TOKEN="replace_this_your_huggingface_token_replace_this"
# specify the dataset name and download it
DATASET="PULSE-ECG/ECGBench"
huggingface-cli download --repo-type dataset "${DATASET}"
# to download the dataset to a custom path
DIRPATH=~/scratch/my_custom_dir # use $PWD to download in the local dir
mkdir -p "${DIRPATH}/${DATASET}"
huggingface-cli download --repo-type dataset "${DATASET}" --local-dir "${DIRPATH}/${DATASET}"
Running inference and evaluating models#
Hugging face models using the transformers package#
In the following example the mistral 7B model will be evaluated using the
transformers/r1 pre-deployed environment. The job script and the python script
that runs the model are available on octopus
at:
/apps/shared/...../path/to/example1
The expected evaluation time the example below is ?? seconds. This example produces ?? tokens at an average rate of ?? tokens / min. During this test a total of ?? GB is transfered from the disk to the GPU and a total of ?? (float??) operations are done. The total memory transfer from VRAM to the GPU is ?? GB at an average rate of ?? GB/s and a peak of ?? GB/s.
The job script is the following:
############################ eval_mistral.sh ###############################
#!/bin/bash
#SBATCH --job-name=eval-mistral
#SBATCH --account=abc123
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32000
#SBATCH --gres=gpu:v100d32q:1
#SBATCH --time=0-00:10:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=abc123@mail.aub.edu
# prepare the scripts and cache the model
cp /scratch/llms/.../mistral7b... /dev/shm
cp /apps/shared/ai/.../eval_mistral_userguide.py /dev/shm/
# load the transformers environment and evaluate the model
module load python/transformers/r1
cd /dev/shm
python eval_mistral_userguide.py
########################## end eval_mistral.sh #############################
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model_name = "mistralai/Mistral-7B-v0.1"
cache_dir = '/dev/shm/huggingface_cache'
model = AutoModelForCausalLM.from_pretrained(
model_name,
cache_dir=cache_dir)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
cache_dir=cache_dir)
# evaluate the model for 10 prompts
prompts = [
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is",
"My favourite condiment is"
]
for prompt in tqdm.tqdm(prompts):
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]
Evaluating quantized models#
Once a model is fine tuned or trained (see below) it is convient (assuming that the loss in accuracy is not high to quantize the model to evaluate the quantized model for testing purposes. For use cases that do not requite high accuracy quantized models are good enough and they outperform the llama7B model)
Todo
double check this statement.
Using llama.cpp#
In this section I will explain the basics of quantization and how to evaluate such models without any optimization on a CPU. Later in this section I will describe and demonstrate how to scale the model evaluation using a single GPU and multiple GPUs across several hosts or across multiple mosts using only CPUs and compare the performance.
Evaluate the quantized model on a CPU - non optimized#
module load gcc/12
rsync -PrlHvtpog /scratch/shared/ai/models/llms/mistralai/Mistral-7B-v0.1/mistral-7b-v0.1.Q4_K_M /dev/shm/
/apps/sw/llama.cpp/amd-avx2/bin/main -t 16 -ngl 24 --color --temp 0.7 -n 1 -m /dev/shm/mistral-7b-v0.1.Q4_K_M/mistral-7b-v0.1.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Evaluate the quantized model on a CPU (optimized)#
module load gcc/12
module load cuda/12
rsync -PrlHvtpog /scratch/shared/ai/models/llms/mistralai/Mistral-7B-v0.1/mistral-7b-v0.1.Q4_K_M /dev/shm/
/apps/sw/llama.cpp/amd-v100-cublas-12/bin/main -t 8 -ngl 24 --color --temp 0.7 -n 1 -m /dev/shm/mistral-7b-v0.1.Q4_K_M/mistral-7b-v0.1.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Evaluate the quantized model on a CPU across multiple hosts#
module load llama.cpp/mpi
Evaluate the quantized model on a GPU#
module load llama.cpp/gpu-v100
...
module load llama.cpp/gpu-k20
...
Evaluate the quantized model across multiple GPUs#
module load llama.cpp/gpu-v100-mpi
...
module load llama.cpp/gpu-k20-mpi
...
Benchmark the quantized model#
[test01@onode12 work]$ /apps/sw/llama.cpp/amd-v100-cublas-12/bin/llama-bench -m /dev/shm/mistral-7b-v0.1.Q4_K_M/mistral-7b-v0.1.Q4_K_M.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla V100-PCIE-32GB, compute capability 7.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CUDA | 99 | pp 512 | 2233.80 ± 65.69 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | CUDA | 99 | tg 128 | 82.05 ± 0.15 |
Farm the evaluation of quantized models#
Todo
under development
- # .. todo:: cache the model to some ram disks and then rsync it to other ram
disks. decide depending on the read time from /scratch what is the best strategy that leads to having the model on all the machines the fastest. i.e figure out what is the best strategy to broadcast the model.
# define your prompts in a .txt file with one prompt per line
python farm_llama_cpp.py \
--partitions=all \
--prompts-file=/path/to/my_prompts.txt \
--stats
Fine-tuning#
Using Accelerate#
Todo
in progress
Using Deepspeed#
Todo
in progress
Using unsloth#
It is possible to fine tune quantized models using unsloth up to 70B using two V100 GPUs. Smaller models can be executed on one V100 GPU. In order to use unsloth a singularity container has been prepared and it works out of the box.
The official unsloth documentation can be found here: https://docs.unsloth.ai/
The procudure of running the fine tuning is as follows:
Create a slurm job script that allocated two V100 GPUs
Run the unsloth container
Inside the container run the fine tuning script
The unsloth container is available on octopus
at the following location:
/apps/sw/apptainer/images/unsloth-2025-03.sif
The following is a job script that is also located at
/home/shared/fine_tune_llama_70B_unsloth/job-2025-03.sh
Unsloth training script
1# see https://github.com/unslothai/unsloth
2
3# %%
4import sys
5from unsloth import FastLanguageModel
6import torch
7max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
8dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
9load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
10
11# %%
12model_name = "/scratch/shared/ai/models/llms/hugging_face/unsloth/Meta-Llama-3.1-70B-bnb-4bit"
13#model_name = "/scratch/shared/ai/models/llms/hugging_face/unsloth/Llama-3.2-1B-Instruct"
14#model_name = "/dev/shm/unsloth/Meta-Llama-3.1-70B-bnb-4bit"
15#model_name = "/dev/shm/unsloth/Llama-3.2-1B-Instruct"
16model, tokenizer = FastLanguageModel.from_pretrained(
17 model_name = model_name,
18 max_seq_length = max_seq_length,
19 dtype = dtype,
20 load_in_4bit = load_in_4bit,
21 # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
22 device_map = 'auto'
23)
24
25# %%
26"""We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"""
27
28model = FastLanguageModel.get_peft_model(
29 model,
30 r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
31 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
32 "gate_proj", "up_proj", "down_proj",],
33 lora_alpha = 16,
34 lora_dropout = 0, # Supports any, but = 0 is optimized
35 bias = "none", # Supports any, but = "none" is optimized
36 # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
37 use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
38 random_state = 3407,
39 use_rslora = False, # We support rank stabilized LoRA
40 loftq_config = None, # And LoftQ
41)
42
43# %%
44### Define the prompt formatting function and load the dataset
45
46alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
47
48### Instruction:
49{}
50
51### Input:
52{}
53
54### Response:
55{}"""
56
57EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
58def formatting_prompts_func(examples):
59 instructions = examples["instruction"]
60 inputs = examples["input"]
61 outputs = examples["output"]
62 texts = []
63 for instruction, input, output in zip(instructions, inputs, outputs):
64 # Must add EOS_TOKEN, otherwise your generation will go on forever!
65 text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
66 texts.append(text)
67 return { "text" : texts, }
68
69from datasets import load_dataset
70dataset = load_dataset("yahma/alpaca-cleaned", split="train")
71dataset = dataset.map(formatting_prompts_func, batched=True,)
72
73# %%
74from trl import SFTTrainer
75from transformers import TrainingArguments
76from unsloth import is_bfloat16_supported
77
78trainer = SFTTrainer(
79 model = model,
80 tokenizer = tokenizer,
81 train_dataset = dataset,
82 dataset_text_field = "text",
83 max_seq_length = max_seq_length,
84 dataset_num_proc = 2,
85 packing = False, # Can make training 5x faster for short sequences.
86 args = TrainingArguments(
87 per_device_train_batch_size = 2,
88 gradient_accumulation_steps = 4,
89 warmup_steps = 5,
90 #num_train_epochs = 3, # Set this for 1 full training run.
91 max_steps = 60,
92 learning_rate = 2e-4,
93 fp16 = not is_bfloat16_supported(),
94 bf16 = is_bfloat16_supported(),
95 logging_steps = 1,
96 optim = "adamw_8bit",
97 weight_decay = 0.01,
98 lr_scheduler_type = "linear",
99 seed = 3407,
100 output_dir = "outputs",
101 report_to = "none", # Use this for WandB etc
102 # Multi-GPU specific settings
103 ddp_find_unused_parameters=False, # Important for efficiency
104 local_rank=-1, # Will be set by accelerate
105 ),
106)
107
108# %%
109gpu_stats = torch.cuda.get_device_properties(0)
110start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
111max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
112print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
113print(f"{start_gpu_memory} GB of memory reserved.")
114
115# %%
116trainer_stats = trainer.train()
117
118# %%
119# @title Show final memory and time stats
120used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
121used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
122used_percentage = round(used_memory / max_memory * 100, 3)
123lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
124print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
125print(
126 f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
127)
128print(f"Peak reserved memory = {used_memory} GB.")
129print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
130print(f"Peak reserved memory % of max memory = {used_percentage} %.")
131print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
132
133# %%
134### save the fine-tuned model
135model.save_pretrained("lora_model") # Local saving
136tokenizer.save_pretrained("lora_model")
137
138# %%
139if True:
140 sys.exit(0)
141
142# %%
143"""<a name="Inference"></a>
144### Inference
145Let's run the model! You can change the instruction and input - leave the output blank!
146
147**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**
148"""
149
150# %%
151# alpaca_prompt = Copied from above
152FastLanguageModel.for_inference(model) # Enable native 2x faster inference
153inputs = tokenizer(
154[
155 alpaca_prompt.format(
156 "Continue the fibonnaci sequence.", # instruction
157 "1, 1, 2, 3, 5, 8", # input
158 "", # output - leave this blank for generation!
159 )
160], return_tensors = "pt").to("cuda")
161
162outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
163tokenizer.batch_decode(outputs)
164
165""" You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!"""
166
167# alpaca_prompt = Copied from above
168FastLanguageModel.for_inference(model) # Enable native 2x faster inference
169inputs = tokenizer(
170[
171 alpaca_prompt.format(
172 "Continue the fibonnaci sequence.", # instruction
173 "1, 1, 2, 3, 5, 8", # input
174 "", # output - leave this blank for generation!
175 )
176], return_tensors = "pt").to("cuda")
177
178from transformers import TextStreamer
179text_streamer = TextStreamer(tokenizer)
180_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
181
182"""<a name="Save"></a>
183### Saving, loading finetuned models
184To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.
185
186**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
187"""
188
189model.save_pretrained("lora_model") # Local saving
190tokenizer.save_pretrained("lora_model")
191# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
192# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving
193
194"""Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"""
195
196if False:
197 from unsloth import FastLanguageModel
198 model, tokenizer = FastLanguageModel.from_pretrained(
199 model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
200 max_seq_length = max_seq_length,
201 dtype = dtype,
202 load_in_4bit = load_in_4bit,
203 )
204 FastLanguageModel.for_inference(model) # Enable native 2x faster inference
205
206# alpaca_prompt = You MUST copy from above!
207
208inputs = tokenizer(
209[
210 alpaca_prompt.format(
211 "What is a famous tall tower in Paris?", # instruction
212 "", # input
213 "", # output - leave this blank for generation!
214 )
215], return_tensors = "pt").to("cuda")
216
217from transformers import TextStreamer
218text_streamer = TextStreamer(tokenizer)
219_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
220
221"""You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**."""
222
223if False:
224 # I highly do NOT suggest - use Unsloth if possible
225 from peft import AutoPeftModelForCausalLM
226 from transformers import AutoTokenizer
227 model = AutoPeftModelForCausalLM.from_pretrained(
228 "lora_model", # YOUR MODEL YOU USED FOR TRAINING
229 load_in_4bit = load_in_4bit,
230 )
231 tokenizer = AutoTokenizer.from_pretrained("lora_model")
232
233"""### Saving to float16 for VLLM
234
235We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
236"""
237
238# Merge to 16bit
239if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
240if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
241
242# Merge to 4bit
243if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
244if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
245
246# Just LoRA adapters
247if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
248if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
249
250"""### GGUF / llama.cpp Conversion
251To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
252
253Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
254* `q8_0` - Fast conversion. High resource use, but generally acceptable.
255* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
256* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
257
258[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
259"""
260
261# Save to 8bit Q8_0
262if False: model.save_pretrained_gguf("model", tokenizer,)
263# Remember to go to https://huggingface.co/settings/tokens for a token!
264# And change hf to your username!
265if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
266
267# Save to 16bit GGUF
268if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
269if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
270
271# Save to q4_k_m GGUF
272if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
273if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
274
275# Save to multiple GGUF options - much faster if you want multiple!
276if False:
277 model.push_to_hub_gguf(
278 "hf/model", # Change hf to your username!
279 tokenizer,
280 quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
281 token = "",
282 )
283
284"""Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)
285
286And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
287
288Some other links:
2891. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2902. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
2913. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
2926. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!
293
294<div class="align-center">
295 <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
296 <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
297 <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
298
299 Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
300</div>
301
302"""
1#!/bin/bash
2
3#SBATCH --job-name=test-job
4#SBATCH --account=abc123
5
6#SBATCH --partition=gpu
7#SBATCH --nodes=1
8#SBATCH --ntasks-per-node=1
9#SBATCH --cpus-per-task=8
10#SBATCH --mem=128000
11#SBATCH --gres=gpu:v100d32q:2
12#SBATCH --time=0-03:00:00
13
14#SBATCH --mail-type=ALL
15#SBATCH --mail-user=abc123@mail.aub.edu
16
17module load singularity
18
19hostname
20
21# list/check the available GPUs that are on the node
22singularity exec --nv /apps/sw/apptainer/images/unsloth-2025-03.sif nvidia-smi
23
24WORKDIR=~/scratch/llama_70B_unsloth_test1
25#WORKDIR=/dev/shm/llama_70B_unsloth_test1
26
27# create a directory and place all the needed files in there
28mkdir -p ${WORKDIR}
29cd ${WORKDIR}
30
31TRAIN_SCRIPT_NAME=fine_tune_llama_70B_unsloth_job-2025-03.py
32cp /home/shared/fine_tune_llama_70B_unsloth/${TRAIN_SCRIPT_NAME} .
33
34singularity exec --nv \
35 --bind /scratch:/scratch \
36 /apps/sw/apptainer/images/unsloth-2025-03.sif \
37 /bin/bash -c \
38 ". /apps/miniconda/etc/profile.d/conda.sh && \
39 conda activate unsloth && \
40 python3 ${TRAIN_SCRIPT_NAME} 2>&1 | tee output.log"
These scripts are available on the cluster at the following path:
/home/shared/fine_tune_llama_70B_unsloth/fine_tune_llama_70B_unsloth_job-2025-03.sh
To reproduce this example, the following steps are required:
copy the script to your home directory
change the
account
to your accountchange the
mail-user
to your emailsubmit the job
you should get the fine tuned model in ~/scratch/llama_70B_unsloth_test_1
The expected output should look something like this (the output below is trimmed)
Tue Mar 11 15:33:42 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-32GB Off | 00000000:04:00.0 Off | Off |
| N/A 42C P0 27W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE-32GB Off | 00000000:1B:00.0 Off | Off |
| N/A 38C P0 23W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.49.0.
\\ /| Tesla V100-PCIE-32GB. Num GPUs = 2. Max memory: 31.739 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 7.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 100%|██████████| 6/6 [00:17<00:00, 2.95s/it]
Unsloth 2025.3.9 patched 80 layers with 80 QKV layers, 80 O layers and 80 MLP layers.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 2
\\ /| Num examples = 51,760 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
"-____-" Trainable parameters = 207,093,760/36,535,279,616 (0.57% trained)
GPU = Tesla V100-PCIE-32GB. Max memory = 31.739 GB.
15.875 GB of memory reserved.
100%|██████████| 60/60 [22:14<00:00, 22.24s/it]
Unsloth: Will smartly offload gradients to save VRAM!
{'loss': 1.3177, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 5.6959, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
.
.
.
{'loss': 1.9101, 'grad_norm': nan, 'learning_rate': 0.00019272727272727274, 'epoch': 0.01}
{'loss': 4.2287, 'grad_norm': nan, 'learning_rate': 0.00019272727272727274, 'epoch': 0.01}
{'train_runtime': 1334.31, 'train_samples_per_second': 0.36, 'train_steps_per_second': 0.045, 'train_loss': 4.311967599391937, 'epoch': 0.01}
1334.31 seconds used for training.
22.24 minutes used for training.
Peak reserved memory = 18.598 GB. (per device)
Peak reserved memory for training = 2.723 GB. (per device)
Peak reserved memory % of max memory = 58.597 %.
Peak reserved memory for training % of max memory = 8.579 %.
Fine-tuning llama2 7B using the official facebook llama repo#
TL;DR Procedure to fine-tune llama2 7B on one V100 GPU on octopus
.
The following pre-requisites are required to fine tune the llama2 7B model:
The facebook llama-recipes repo (already installed on
octopus
)The LLaMA 7B HF model (email it.helpdesk@aub.edu.lb to request access by presenting a copy of your signed agreement https://llama.meta.com/llama-downloads/ or place your own copy in the right location - see below).
A python environment with the right requirements (already installed on
octopus
)The job script with the
octopus
specific hardware / software configuration that runs the fine tuning.
To run the fine tuning as described in the llama-recipes repo, the following steps are done:
Load the llama-recipes environment
Clone and install the llama-recipes repo
Cache the model to
/dev/shm
to speed up the loading of the modelRun the fine tuning script
module load llama
cp -fvr /apps/sw/llama-recipes . && cd llama-recipes
git checkout 2e768b1
pip install .
rsync -PrlHvtpog /scratch/shared/ai/models/llms/llama/llama-2-7b-hf /dev/shm/
mkdir models
ln -s /dev/shm/llama-2-7b-hf models/7B
time python -m llama_recipes.finetuning \
--use_peft --peft_method lora --quantization \
--model_name models/7B --output_dir /dev/shm/PEFT/model/
The following output is expected:
[test04@onode11 llama-recipes]$ time python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name models/7B --output_dir /dev/shm/PEFT/model/
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.51s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroug
hly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
--> Model models/7B
--> models/7B has 262.41024 Million params
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:01<00:00, 10651.47 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:24<00:00, 598.36 examples/s]
--> Training Set Length = 14732
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 8043.54 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:01<00:00, 582.25 examples/s]
--> Validation Set Length = 818
Preprocessing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14732/14732 [00:07<00:00, 1920.48it/s]
Preprocessing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 1971.12it/s]
Training Epoch: 1: 0%| | 0/388 [00:00<?, ?it/s]/home/mher/progs/sw/miniconda/envs/llama-orig-bench-1/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
Training Epoch: 1/3, step 387/388 completed (loss: 1.7123626470565796): 100%|███████████████| 388/388 [3:34:24<00:00, 33.16s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 24 GB
Peak active CUDA memory was 21 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████| 84/84 [04:29<00:00, 3.21s/it]
eval_ppl=tensor(5.2620, device='cuda:0') eval_epoch_loss=tensor(1.6605, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in /dev/shm/PEFT/model/ directory
best eval loss on epoch 1 is 1.660506010055542
Epoch 1: train_perplexity=5.3824, train_epoch_loss=1.6831, epoch time 12864.613309495151s
Training Epoch: 2/3, step 387/388 completed (loss: 1.6909533739089966): 100%|███████████████| 388/388 [3:33:44<00:00, 33.05s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 24 GB
Peak active CUDA memory was 21 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████| 84/84 [04:29<00:00, 3.20s/it]
eval_ppl=tensor(5.2127, device='cuda:0') eval_epoch_loss=tensor(1.6511, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in /dev/shm/PEFT/model/ directory
best eval loss on epoch 2 is 1.6511057615280151
Epoch 2: train_perplexity=5.1402, train_epoch_loss=1.6371, epoch time 12824.521782848984s
Training Epoch: 3/3, step 11/388 completed (loss: 1.5718340873718262): 3%|▌ | 12/388 [06:36<3:26:57, 33.03s/it]
Training Epoch: 3/3, step 387/388 completed (loss: 1.6727845668792725): 100%|███████████████| 388/388 [3:33:37<00:00, 33.03s/it]
Max CUDA memory allocated was 21 GB
Max CUDA memory reserved was 24 GB
Peak active CUDA memory was 21 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [04:28<00:00, 3.20s/it]
eval_ppl=tensor(5.1962, device='cuda:0') eval_epoch_loss=tensor(1.6479, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in /dev/shm/PEFT/model/ directory
best eval loss on epoch 3 is 1.647936224937439
Epoch 3: train_perplexity=5.0411, train_epoch_loss=1.6176, epoch time 12817.443107374012s
Key: avg_train_prep, Value: 5.1879143714904785
Key: avg_train_loss, Value: 1.6459529399871826
Key: avg_eval_prep, Value: 5.223653316497803
Key: avg_eval_loss, Value: 1.6531827449798584
Key: avg_epoch_time, Value: 12835.526066572716
Key: avg_checkpoint_time, Value: 0.040507279336452484
real 659m9.844s
user 349m26.981s
sys 351m6.738s
The following table summarizes the performance of the fine tuning of the llama2
Model
GPU
Epochs
Wall Time
llama2 7B
Nvidia V100
3
10h 50m
The full job script (below) that reproduces the results can be found
at /home/shared/fine_tune_llama_7b/job.sh
. It can be copied to your
home directory and executed as follows (change test04 with your username):
#!/bin/bash
#SBATCH --job-name=llama7b-finetune
#SBATCH --account=test04
#SBATCH --partition=msfea-ai
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:v100d32q:1
#SBATCH --mem=32000
#SBATCH --time=0-12:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=test04@mail.aub.edu
module load llama
cp -fvr /apps/sw/llama-recipes . && cd llama-recipes
git checkout 2e768b1
pip install .
rsync -PrlHvtpog /scratch/shared/ai/models/llms/llama/llama-2-7b-hf /dev/shm/
mkdir models
ln -s /dev/shm/llama-2-7b-hf models/7B
time python -m llama_recipes.finetuning \
--use_peft --peft_method lora --quantization \
--model_name models/7B --output_dir /dev/shm/PEFT/model/
Todo
Add instructions for resuming from an epoch
Todo
Add instructions for providing a custom fine-tuning dataset
Fine tuning llama2 13B#
Prior to fine tuning the 13B llama2 model, it must be shared in-order fit on two or four V100 GPUs.
Note
i am not sure if it was possible to fine tune 13B on two GPUs! try again
Fine-tuning using llama_recipes#
# 4 GPUs
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model master $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=0 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model slaves $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=1 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=2 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model $ torchrun --nproc-per-node=1 --nnodes=4 --node-rank=3 --master-addr=onode10 --master-port=4444 examples/finetuning.py --use_peft --peft_method lora --quantization --model_name models/13B --output_dir /dev/shm/PEFT/model
Serving models using ollama#
There are a bunch of models that are available on octopus
. The models are
$ ollama list
NAME ID SIZE MODIFIED
codellama:34b 685be00e1532 19 GB 9 days ago
codellama:70b e59b580dfce7 38 GB 2 days ago
codellama:70b-code f51f75d243f2 38 GB 2 days ago
codellama:70b-instruct e59b580dfce7 38 GB 2 days ago
deepseek-coder:1.3b 3ddd2d3fc8d2 776 MB 8 days ago
....
....
....
Since downloading large models it time consuming please email
it.helpdesk@aub.edu.lb
for additional models that you would like to be
deployed that are not in the list above.
Note
The environment variable OLLAMA_MODELS
is set to
/scratch/shared/ai/models/llms/ollama/models
. This is the default
location where the models are stored. If you would like to use a different
location, you can set the environment variable OLLAMA_MODELS
to the
desired location. If there is a model that needs to be loaded / offloaded
multiple time for some reason (such as a script that needs to execute
many times that exists and re-runs) then caching the models to be used
to /dev/shm
is a good idea. In this case set the evn variable
OLLAMA_MODELS
to /dev/shm/ollama/models
and put your models in
there by copying them from the default location.
Todo
add a bash function that caches a certain named model to /dev/shm
Load and list the models#
module load ollama
ollama list
Run a model in interactive mode#
module load ollama
ollama serve > /dev/null 2>&1 &
# wait a bit (~ 20 seconds) until the server is up and running
ollama run phi:latest
Run a model in batch mode#
Create a python script that uses the ollama client to run a model.
In the example below the phi
model is used since it is small and can
be loaded quickly.
import ollama
response = ollama.chat(model='phi', messages=[
{
'role': 'user',
'content': 'Why is the sky blue?',
},
])
print(response['message']['content'])
module load ollama
module load python/ai-4
ollama serve > /dev/null 2>&1 &
sleep 20
python ollama_eval.py
Quantizing models#
Todo
add notes here