QNN HTP Qmem Graph

Currently QnnGraph supports inferences for RAW buffers and MemHandles. Raw buffers are not accessible from DSP side and at Graph creation, QNN HTP reserves extra RPCMem buffers to copy RAW buffers before inference (copy out output buffers after inference) If the client uses MemHandles (handles to buffers allocated using rpcmem_alloc and accessible from DSP side) no copy is needed and those internal buffers are not used. Qmem Graph allows the client to pass a hint at graph preparation that this use case will be using RPCMem buffers and there is no need to allocate internal extra RPCmem memory. If the client uses Qmem Graph hint at graph creation time and still passes RAW buffers at inference time, QNN HTP will allocate extra buffers at run time (expected performance impact for that inference).

Online Prepare

Preparation

When doing online prepare, the hint (IO tensor memory type) that informs QnnHtp backend to reduce memory allocation is embedded in model.so file. This information can be passed in by QnnTensor_createGraphTensor and QnnGraph_addNode.

 1// example graph. for detail, please refer to Sample App
 2Qnn_GraphHandle_t graph;
 3// IO tensors
 4Qnn_Tensor_t inputTensor;
 5Qnn_Tensor_t outputTensor;
 6// Set up common setting for tensors ......
 7/* There are 2 specific settings for shared buffer:
 8*  1. memType should be QNN_TENSORMEMTYPE_MEMHANDLE;
 9*  2. union member memHandle should be used instead of clientBuf, and it
10*  should be set to nullptr.
11*/
12inputTensor.v1.memType        = QNN_TENSORMEMTYPE_MEMHANDLE;
13inputTensor.v1.clientBuf      = nullptr;
14outputTensor.v1.memType       = QNN_TENSORMEMTYPE_MEMHANDLE;
15outputTensor.v1.clientBuf     = nullptr;
16QnnTensor_createGraphTensor(graph, &inputTensor);
17QnnTensor_createGraphTensor(graph, &outputTensor);
18
19// create OpConfig_t with IO tensor just created
20Qnn_OpConfig_t opConfig;
21QnnGraph_addNode(graph, opConfit);

Execution

Please refer to: QNN HTP Shared Buffer Tutorial

Offline Prepare

When generating serialized.bin, it is recommended to generate serialized.bin with option --input_output_tensor_mem_type memhandle to reduce the memory footprint. With this option used, qnn-context-binary-generator will change IO tensor memory type to memhandle. When QnnHtp backend loads serialized.bin, it will be able to skip memory allocation for IO tensor and understand that the user intends to use shared_buffer during execution. Skipping this option will not impact inference performance.

Preparation

1// Prerequisites: model.so, qnn-context-binary-generator, QnnHtp backend .so library
2
3./qnn-context-binary-generator --model libqnn_model.so --backend libQnnHtp.so --binary_file qnngraph.serialized --output_dir output --input_output_tensor_mem_type memhandle
4
5// qnngraph.serialized.bin is generated and saved at output/qnngraph.serialized.bin

Execution

Please refer to: QNN HTP Shared Buffer Tutorial

Mis-matching mem_type during preparation and execution

Preparation

Execution

Behavior

raw

raw

  • QNN HTP will allocate memory for IO buffer.

  • HTP will copy input and output at each inference.

raw

memhandle

  • QNN HTP will allocate memory for IO buffer.

  • Data copy avoided.

memhandle

raw

  • QNN HTP will not allocate memory for IO buffer during preparation.

  • QNN HTP will allocate memory for IO buffer during first inference (raw passed in during execution), first inference time impact by memory allocation.

  • HTP will copy input and output at each inference.

memhandle

memhandle

  • QNN HTP will not allocate memory for IO buffer during preparation.

  • Data copy avoided.