Instead of updating the pre-trained model weights $$W_0$$ directly, low-rank decomposition matrices added $$W_0 + BA$$ where $$B,A \in R^{d \times r}$$ and only $$BA$$ is finetuned keeping $$W_0$$ is frozen. After the training, they can be added to get final model.

• Inference is the same, $$Wx = W_0x + BAx = (W_0+BA)x$$
• It is efficient and have small training footprint (Only $$BA$$ matrices are trained)
• Swapping the models in deployment is fast. $$W = (W - BA) + B^\prime A^\prime$$
The goal is to use small number of parameters when adopting to new tasks. The hypothesis is that the updates to the network during fine-tuning have low “intrinsic rank”. So we can limit the weight updates to the low-rank decomposition $$W_0 + \Delta W = W_0 + BA$$, where $$B, A \in R^{d\times r}, r<<min(d,k)$$ If the $$r=d$$ then it represents the full model update so it can recover full training in theory.
For multiple tasks during inference, multiple $$(BA)_t$$ can be swapped for each task $$t$$ while keeping the large $$W_0$$ in memory. For reference $$BA$$ has $$2d$$ number of parameters while $$W_0$$ has $$d^2$$.