Everything you need to learn about LoRA fine-tuning, from theory, and inner working to implementation.
In the current era of the Language Model (LLM), we can observe that many models are being released more frequently. However, most of these models are not ready for immediate use. Therefore, we need an additional adaptation stage to customize the model according to our specific domain. This stage is called fine-tuning. During fine-tuning, the weights of the model are adjusted. Depending on how we modify the model’s weights, we can classify fine-tuning strategies into a few categories.
Throughout this article, we will discuss LoRA fine-tuning, explain why it’s a more efficient fine-tuning strategy compared to other options, and list its advantages and disadvantages.
Lora is short for Low-Rank Adaptation. LoRA injects trainable rank decomposition matrices into selected pre-trained layers while keeping pre-trained weights frozen.
When I first learned about LoRA, I had two questions when reading LoRA’s abstract. Those are
I will be using these two questions to explain LoRA. Let's begin!
Rank decomposition matrices refer to matrices that are decomposed into a product of matrices. The goal here is to reduce the number of trainable parameters by representing large matrices using two smaller matrices. Let’s take an example. Assuming we have a 10 x 10 matrix we can represent this using 10 x r and r x 10 matrix; where r is the rank of a source matrix.
In the initial 10 x 10 matrix we had 100 elements, in AI lingo it is called trainable parameters. But after decomposition id reduces into 10 x 2 + 10 x 2 = 40 parameters. In this example, this sounds less. But in a real-world example, this is huge compute and memory save!
LoRA and adapters differ in their initialization process. Adapters add extra layers to the original model, which increases the neural network's depth and causes additional latency during the inference phase.
As we can see this additional latency is problematic in latency-sensitive applications. Therefore we should be aware of some additional approaches we can take to prevent it. What if we do the following,
The method described in the above diagram is called regular finetuning
As shown in the third diagram, the model learns a new task while maintaining its depth by combining pre-trained and learned weights before the output layer.
But this approach again has several inefficiencies.
To tackle these challenges, researchers have introduced LoRA, which involves reducing the learnable parameters in newly initialized layers.
Well, that is where the previously discussed rank decomposition of the matrix comes into play. Here, instead of training large parameter matrices (∆W) what we going to do is use smaller two matrices(A & B) to approximate the large matrix (∆W).
Let's assume that our W matrix has a size of 2,500 x 2,500, which means we have a total of 6.25 million parameters. Now, suppose we select r as 8. This results in having two LoRA matrices of size 2,500 x 8 and 8 x 2,500, respectively. In this case, the total number of trainable parameters is 40,000, which is a reduction of approximately 156 times.
In summary, we can represent new weights as below,
In the original paper, researchers introduced a scaling factor for ΔW. This controls the degree of weights we should put on LoRA weights. The final equation looks like this;
In this equation, alpha is another hyperparameter we should tune.
When implementing LoRA for similar tasks my go-to solution is hugging face's peft library. To install it, you can use the pip command. After the installation, we can customize the LoRA settings as per our requirements. We can choose which layers to add LoRA, LoRA Alpha, Rank parameter, and more. While selecting Alpha, it's recommended to set it to two times r.
LoRA fine-tuning is a type of parameter-efficient fine-tuning that involves fewer learnable parameters due to matrix approximation. One of its advantages is that we can have multiple LoRA weights specialized for different tasks, which can be swapped based on the task we are working on. The small size of LoRA makes it easy to store and swap between models. However, swapping delays could be an issue in some cases, so it's important to be mindful of when to use regular fine-tuning, adapters, and LoRA.