Offline Flow: Generating LoRA binary sections with QNN API

../../_static/resources/lora/qnn_tutorial_lora_offline_generate_binary_sections_directly.png
  • As shown on previous page, qnn-context-binary-generator tool is extended to produce LoRA binary sections (also referred to as adapters)

  • This page explains how do apply the adapter weights and retrieve the binary sections directly using QNN API (not via QNN tools)

  • This is done in the following manner:

    1. Create the QNN Context & Graphs (either from-scratch or from a Binary)

      • In case the Context/Graph was created from scratch, call QnnContext_getBinary to receive a binary blob of the unmodified QNN context.

    2. Call New QNN API : QnnTensor_updateGraphTensors / QnnTensor_updateContextTensors

      • Tensors must be of UPDATEABLE type, created during graph composition (in step 1.)

    3. Call QnnGraph_finalize (important! Updates are not applied until finalize is called)

    4. Call QnnContext_getBinarySectionSize to receive the size of the binary section

      • A buffer with a suitable size should be allocated and passed to QNN backend as part of the next API call

    5. Call QnnContext_getBinarySection to receive a binary blob containing the LoRA update

  • Steps 2-4 can be done multiple times, each time apply a different adapter (by updating the weights) and retrieve a suitable binary section

Generating LoRA weight-shared binary sections

../../_static/resources/lora/qnn_tutorial_lora_offline_generate_weight_shared_binary_sections_directly.png
  • There is a slight call-flow change when generating weight-shared binary sections (Super Adapters).

  • Compared to regular adapters, where every QNNGraph_Finalize is followed by QnnContext_getBinarySectionSize then QnnContext_getBinarySection, QnnContext_getBinarySection is called once per a group of graphs that are in the same context. The combined adapter file, also referred to as “Super Adapter”, enables weight sharing between adapters, thereby reducing total adapter size.

  • As mentioned in previous page, Super Adapters are generated through qnn-context-binary-generator by adding following option in Config YAML File:

    share_adapters_between_graphs: Yes