LoRA + Graph switch Implementation

  • Graph switching can now be used with LoRA to reduce RAM usage by trading off slight token rate hit

  • As per QNN SDK doc, to enable graph switch, user needs to set context config options as

    QNN_CONTEXT_CONFIG_MEMORY_LIMIT_HINT : non-zero value QNN_CONTEXT_CONFIG_PERSISTENT_BINARY : true

  • If user uses qnn-net-run or qnn-throughput-net-run, this can be done by setting config options accordingly in backend extension config file.:

    memory_limit_hint : non-zero value

    is_persistent_binary : true

  • The adapter buffer should be kept persistent (like context binary buffer) for graph switching

    • During QnnContext_applyBinarySection, if the graph is in an unloaded state, HTP Backend deserializes the graph, then applies the adapter.

    • During QnnGraph_execute, if the graph is in an unloaded state, HTP backend loads the unloaded graph and then reapplies last applied adapter from the persistent buffer.