LoRA (Low Rank Adaptation)

Introduction

LoRA, also known as Low Rank Adaptation, is a fine-tune technique to further train a pretrained model by introducing low rank trainable params and keeping the trained parameters (larger dimensions) frozen.

../../_static/resources/lora/lora_intro.png
  • During fine-tune training a model, LoRA branches (aka adapters) with low rank matrices are attached to the graph and train only the adapter weights and finally trained weights are stored separately (e.g., as safetensors)

  • For each specialized task (meaning specialized training data) the model is fine-tuned and even if the same lora branch is used for each task, it will produce different adapter weights (different safetensors file)

  • During inference, these task specific weights can be loaded into the adapter branches and executed for corresponding specialized tasks

  • Architecture:

  • Assume W0 with dimension d x d is pre-trained weights

  • When model is fine-tuned new weight W = W0 + ΔW (Note: ΔW will have same dimension d x d)

  • Using LoRA, ΔW is decomposed into two smaller matrices (A and B) such that

    • Dimension of A is d x r and dimension of B is r x d

    • Matrix multiplication of A and B will return d x d dimension (ΔW)

  • So, h = W0x + ΔWx = W0x + Abx

  • Further a strength α (a scalar entity with value between 0 to 1) can be added to effect of LoRA adapter weights such that ΔWx = AαBx

../../_static/resources/lora/lora_intro_alpha.png

Reference papers:

LoRA Terminologies

  • Base-model: The original LLM, LVM, or LMM pre-trained model

  • Adapters: Additional model parameters and add-on architecture to be fine-tuned

    • Branches: augmentations to the base-model that enable adapters

    • Adapter Weights: populated weights on LoRA branches after fine-tuning the model

    • Attachment-points: Set of base-model “modules” where branches are attached to enable the adapter (also referred to as “target-modules” in PEFT world)

    • Strength or LoRA Alpha: Scalar multiplier applied (0-1) to each adapter (common for all attachment points), also referred to as alpha “\(\alpha\)”.

  • Compatible Adapters: Adapters have the same rank, attachment points

  • Incompatible Adapters: Adapters have different ranks, attachment points

  • Adapter concurrency (also termed as adapter use-case): A model configuration that specifies the base-model and two or more enabled adapters. Adapters may be trained independently, but a concurrency combines them at inference-time

    • Single adapter concurrency: One adapter is enabled/active

    • Multi-adapter concurrency: Two or more adapters are enabled/active

  • Quantization Encodings Flexibility

    • Non-updatable Encodings: All the use-cases/concurrencies share the same set of quantization encodings

    • All-updatable Encodings (Fully-updatable): Each use-case/concurrency has its own set of quantization encodings

    • Adapter-only-updatable Encodings: Across all use-cases/concurrencies, only the tensors in LoRA branch may have different encodings

  • Use-case/Concurrency Switch

    • Hot Switch: Adapter switch is achieved via changing LoRA alphas with adapter weights preloaded in RAM

    • Cold Switch: Adapter switch starts loading adapter weights from ROM to RAM

LoRA Solutions Feature Summary

QAIRT SDK (Release 2.34 onwards) supports three different generations of solutions for LoRA:

LoRAv1 – LoRA adapter weights are integrated into the graph as inputs. All use-cases share the same set of quantization encodings, referred to as non-updatable encodings. Use case or adapter switching is achieved by modifying the graph inputs.should this

LoRAv2 – LoRA adapter weights are integrated into the graph as parameters. Each use-case has its own set of quantization encodings to ensure optimal accuracy, referred to as all-updatable encodings. Adapter switching is accomplished by loading a patch to replace the adapter weights within the graph.

LoRAv3 - LoRA adapter weights are integrated into the graph as parameters. In addition to supporting legacy LoRAv1 and LoRAv2 flows, LoRAv3 supports multi-adapter concurrency and incompatible adapters (different ranks, different attachment points). Across all use cases, only the tensors in the LoRA branch may have different encodings, referred to as adapter-only-updatable. Notably, the adapter strength (alpha) is no longer a scalar but a vector instead.

The table below summarizes these differences:

Adapter Weight Solutions

Weights as graph inputs1

Non-updatable encodings

Adapter-only updatable encodings

Full updatable encodings

Multi-adapter concurrency2

Legacy V1

Legacy V2

V3

3

Notes:

  • 1 Encodings are non-updatable

  • 2 multi-adapter concurrency and incompatible adapter support achieved by concatenating adapters together

  • 3 An optional hot-switch configuration for adapters (0-switch latency) is available when encodings are not updatable

The following are graphical representations to illustrate the differences

LoraV1

../../_static/resources/lora/lora_intro_loraV1.png

LoraV2

../../_static/resources/lora/lora_intro_loraV2.png

LoraV3

../../_static/resources/lora/lora_intro_loraV3.png