LoRA + Graph switch Implementation¶
Graph switching can now be used with LoRA to reduce RAM usage by trading off slight token rate hit
As per QNN SDK doc, to enable graph switch, user needs to set context config options as
QNN_CONTEXT_CONFIG_MEMORY_LIMIT_HINT: non-zero valueQNN_CONTEXT_CONFIG_PERSISTENT_BINARY: trueIf user uses
qnn-net-runorqnn-throughput-net-run, this can be done by setting config options accordingly in backend extension config file.:memory_limit_hint: non-zero valueis_persistent_binary: trueThe adapter buffer should be kept persistent (like context binary buffer) for graph switching
During QnnContext_applyBinarySection, if the graph is in an unloaded state, HTP Backend deserializes the graph, then applies the adapter.
During QnnGraph_execute, if the graph is in an unloaded state, HTP backend loads the unloaded graph and then reapplies last applied adapter from the persistent buffer.