I Compared PEFT-Lora vs Full Fine-Tune on Open AI’s Whisper

Tim Cvetko

ML Engineer
AI Model Developer
AI Developer
PyTorch
The need for increasingly domain-applicable LLMs is causing a turmoil of advances to surpass the limitations of the truly “large” language models. At the expense of generalisability, fine-tuned models are being developed to cover niche reasoning, namely Bloomberg GPTFinance GPT, etc. Now, algorithms like LoRa make LLM fine-tuning possible on local machines.
With that in mind, I wanted to put PEFT-LoRa to the test. Here’s the experiment:

Compare an LLM fine-tuned with PEFT-LoRA to a fully fine-tuned Whisper model along these dimensions:
Total Training Time
Inference Speed
Total Benchmark Accuracy
Number of Parameters
This should give us a bit of perspective on the algorithm's effectiveness. Here’s what this article contains:
Intuitive Understanding of PEFT LoRa Fine-Tuning
Overview of the Training Process(+Code+Stats)

Who should read this?

Who is this blog post useful for? ML Researchers, but also VCs, consultants, etc.
How advanced is this post? Anybody previously acquainted with ML terms should be able to follow along.
Replicate my code: GitHub or Colab

What the Heck is LoRA? Or PEFT? Or QLoRA?!?

(Skip to training if you know this stuff)
Def.
PEFT = parameter-efficient fine-tuning. Performing full-finetuning can lead to catastrophic forgetting because it changes all parameters on the model. Since PEFT only updates a small subset of parameters, it’s more robust against this catastrophic forgetting effect. PEFT is a balance between retaining knowledge from the pre-trained model and adapting it effectively to target tasks with fewer parameters. PEFT weights are trained for each task.
LoRA = low-rank adaptation of the original model weights. LoRA is a strategy that reduces the number of parameters to be trained during fine-tuning by freezing all of the original model parameters and then injecting a pair of rank decomposition matrices alongside the original weights.
QLoRA is an even more memory-efficient version of LoRA where the pretrained model is loaded to GPU memory as quantized 4-bit weights (compared to 8-bits in the case of LoRA)

Let’s Run this Baby

(Here’s What You Need to Know)
For Lora, I’m using a total dropout of 0.5 & r = decomposition factor being 32.
Training on batch_size=4, for 3 epochs on the CommonVoice Dataset for WhisperProcessor(openai/whisper-large-v2)
I’m running this on TPU4 on Google Colab. Do it yourself, too!

Training Whisper with PEFT-LoRA

Seq2SeqTrainingArguments:
Seq2SeqTrainingArguments is a class provided by the transformers library, specifically designed for configuring training parameters for sequence-to-sequence models.
per_device_train_batch_size: Batch size per device (usually GPU) during training. Here, each device processes a batch of 8 samples.
gradient_accumulation_steps: Number of steps to accumulate gradients before updating the model weights. It's set to 1, meaning gradients are updated after each batch.
learning_rate: Initial learning rate for the optimizer.
warmup_steps: Number of steps for which the learning rate increases linearly from 0 to the initial learning rate.
num_train_epochs: Number of training epochs (passes through the entire training dataset).
evaluation_strategy: Strategy for evaluating the model during training. Here, evaluation is done at the end of each epoch.
fp16: Whether to use 16-bit floating-point precision (mixed precision training) to speed up training and reduce memory usage.
per_device_eval_batch_size: Batch size per device for evaluation.
generation_max_length: Maximum length of generated sequences during inference.

Full Fine-Tuning Whisper

TrainingArguments:
per_device_train_batch_size: This parameter defines the batch size per GPU during training. In this case, each GPU processes a batch of 4 samples.
gradient_accumulation_steps: This parameter determines how many steps to accumulate gradients before updating the model weights. Here, gradients are accumulated over 4 steps before updating.
warmup_steps: Number of steps for which the learning rate increases linearly from 0 to the initial learning rate. This helps in stabilizing the training process by preventing large updates at the beginning.
max_steps: Maximum number of training steps. Training will stop once this number of steps is reached.
learning_rate: Initial learning rate for the optimizer. Here, it is set to 2e-4.
fp16: Whether to use 16-bit floating-point precision (mixed precision training) to speed up training and reduce memory usage.

Conclusion

LORA fine-tuning demonstrates superior parameter efficiency with 880k parameters compared to 74M in full fine-tuning.
LORA fine-tuning significantly reduces training time to just 56 minutes, whereas full fine-tuning requires 456 minutes.
Full fine-tuning exhibits faster inference speed with 88 tokens/sec compared to 15 tokens/sec in LORA fine-tuning.
Both approaches achieve comparable Word Error Rate (WER) accuracies, with LORA fine-tuning slightly higher at 36.14 compared to 35.64 in full fine-tuning.
The choice between LORA fine-tuning and full fine-tuning depends on your specific use case but is not just a want, but practically a need for local LLM developments.
Replicate this code via GitHub or Colab.

Hey, thanks for reading!

Thanks for getting to the end of this article. My name is Tim, I love to elaborate ML research papers or ML applications with emphasis on business use cases! Get in touch!
Partner With Tim
View Services

More Projects by Tim