LoRA v2 Overview

../../_static/resources/lora/qnn_tutorial_lora_intro.png

Adapter Requirements

  • PEFT-based Adapters (PEFT = Parameter-Efficient Fine-Tuning)

  • All Adapters to a given base graph:

    1. Have same architecture and attachment points

    2. Have the same max rank and same precision

Supported Functionality

  • Currently supported for Android only

  • Currently supported for ONNX models and HTP backend only

  • Apply a single adapter at a given time (e.g single branch)

  • Dynamic switching of adapter (e.g. without loading/unloading of base model)

  • Regain accuracy on Quantization by creating tailored encodings per adapter during offline conversion

    • Adapter weights are quantized, and each weight has its own encodings

    • Activation encodings for each adapter are different and optimized during calibration process

  • Requires full offline preparation of both Base Model & Adapters, with Quantization done by AIMET

    • Offline preparation only supported on :Linux x86 host platforms

    • Disclaimer: Both base model context binary and adapter binary files MUST be prepared using same QAIRT SDK version.

High-Level end-to-end workflow

../../_static/resources/lora/qnn_tutorial_lora_high_level_e2e_workflow.png

Notes : 1. For switching adapters, “Apply adapter” can be done as needed between calls to execute 2. Setting Alpha is done by updating the relevant input tensor