QNN HTP Qmem Graph¶
Currently QnnGraph supports inferences for RAW buffers and MemHandles. Raw buffers are not accessible from DSP side and at Graph creation, QNN HTP reserves extra RPCMem buffers to copy RAW buffers before inference (copy out output buffers after inference) If the client uses MemHandles (handles to buffers allocated using rpcmem_alloc and accessible from DSP side) no copy is needed and those internal buffers are not used. Qmem Graph allows the client to pass a hint at graph preparation that this use case will be using RPCMem buffers and there is no need to allocate internal extra RPCmem memory. If the client uses Qmem Graph hint at graph creation time and still passes RAW buffers at inference time, QNN HTP will allocate extra buffers at run time (expected performance impact for that inference).
Online Prepare¶
Preparation
When doing online prepare, the hint (IO tensor memory type) that informs QnnHtp backend to reduce memory allocation is embedded in model.so file.
This information can be passed in by QnnTensor_createGraphTensor and QnnGraph_addNode.
1// example graph. for detail, please refer to Sample App
2Qnn_GraphHandle_t graph;
3// IO tensors
4Qnn_Tensor_t inputTensor;
5Qnn_Tensor_t outputTensor;
6// Set up common setting for tensors ......
7/* There are 2 specific settings for shared buffer:
8* 1. memType should be QNN_TENSORMEMTYPE_MEMHANDLE;
9* 2. union member memHandle should be used instead of clientBuf, and it
10* should be set to nullptr.
11*/
12inputTensor.v1.memType = QNN_TENSORMEMTYPE_MEMHANDLE;
13inputTensor.v1.clientBuf = nullptr;
14outputTensor.v1.memType = QNN_TENSORMEMTYPE_MEMHANDLE;
15outputTensor.v1.clientBuf = nullptr;
16QnnTensor_createGraphTensor(graph, &inputTensor);
17QnnTensor_createGraphTensor(graph, &outputTensor);
18
19// create OpConfig_t with IO tensor just created
20Qnn_OpConfig_t opConfig;
21QnnGraph_addNode(graph, opConfit);
Execution
Please refer to: QNN HTP Shared Buffer Tutorial
Offline Prepare¶
When generating serialized.bin, it is recommended to generate serialized.bin with option --input_output_tensor_mem_type memhandle to reduce the memory
footprint. With this option used, qnn-context-binary-generator will change IO tensor memory type to memhandle. When QnnHtp backend loads serialized.bin, it will
be able to skip memory allocation for IO tensor and understand that the user intends to use shared_buffer during execution.
Skipping this option will not impact inference performance.
Preparation
1// Prerequisites: model.so, qnn-context-binary-generator, QnnHtp backend .so library
2
3./qnn-context-binary-generator --model libqnn_model.so --backend libQnnHtp.so --binary_file qnngraph.serialized --output_dir output --input_output_tensor_mem_type memhandle
4
5// qnngraph.serialized.bin is generated and saved at output/qnngraph.serialized.bin
Execution
Please refer to: QNN HTP Shared Buffer Tutorial
Mis-matching mem_type during preparation and execution
Preparation |
Execution |
Behavior |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|