LoRA (Low Rank Adaptation) is a technique for minimizing the impact of fine tuning models.

Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.^{1}

Aside from the benefits in training speed and performance, LoRA makes it possible to deploy â€śweights as modulesâ€ť, for a larger model thatâ€™s fine-tuned to be more accurate for specific tasks.

A traditional fine-tunning approach will update weights and record the difference in those weights ($Î”W$). Instead, LoRA looks to decompose the model weights, using the intrinsinc rank hypothesis, which states that not all weights are as important.

A full model is expressed by its weights ($W$), which fine-tunning decomposes in the original pre-trained weights and the deltas found in their fine tuned weights ($W=W_{0}+Î”W$). At this point, the fine tuned weights $Î”W$ has the same dimensions than the original model, but it could be very well decomposed into two matrices, $A$, and $B$, such that $Î”W=AB$. If we consider $Î”W$ to have dimensions $dĂ—k$, then we cannot choose these values, but we can choose a value $r$ and define the size of the new matrices such that $dim(A)=dĂ—r,dim(B)=rĂ—k$.

Finally, $A$ is initialized to a random Gaussian, and $B$ is initialized to zero, such that $W_{0}=0$, making no changes to the original model. Then, we freeze $A$ and just train on $B$.