QNN GPU QnnMem API Tutorial¶
Introduction¶
This tutorial demonstrates the usage of the QnnMem API for the QNN GPU backend. This feature allows for data sharing between processing domains in the QNN GPU backend.
The supported types of shared memory of the QnnMem API for the QNN GPU backend are as follows:
Qnn_MemDescriptor_t Type |
QnnGpu_MemType_t Type |
Descriptor |
|---|---|---|
|
Note
This tutorial is only focused on the QNN GPU Mem OpenCL buffer usage. There are some prerequisites in the SDK example code not discussed in detail here. Users can refer to the corresponding part in the QNN documentation, or refer to the SampleApp.
SampleApp documentation: Sample App Tutorial
SampleApp code: ${QNN_SDK_ROOT}/examples/QNN/SampleApp
Using QNN_MEM_TYPE_CUSTOM with the QNN API¶
The following documentation demonstrates the custom memory type feature of the QnnMem API for the QNN GPU backend which allows a user to allocate their own OpenCL buffers to manage and register for input and output tensors. Doing so will achieve zero-copy in which the need to read to and write from the GPU memory to host and vice-versa is mitigated. This will yield inference time metric improvements.
The following is a representation of utilizing OpenCL buffers, where each tensor has its own OpenCL buffer and memory handle.
Below is an example of an implementation of this API. It implies that the user has prior background with OpenCL and its APIs.
1 // QnnInterface_t is defined in ${QNN_SDK_ROOT}/include/QNN/QnnInterface.h
2 QnnInterface_t qnnInterface;
3 // Init qnn interface ......
4 // See ${QNN_SDK_ROOT}/examples/QNN/SampleApp code
5
6 // Qnn_Tensor_t is defined in ${QNN_SDK_ROOT}/include/QNN/QnnTypes.h
7 Qnn_Tensor_t inputTensor;
8 // Set up common setting for inputTensor ......
9 /* There are 2 specific settings for custom QnnMem buffer:
10 * 1. memType should be QNN_TENSORMEMTYPE_MEMHANDLE; (line 41)
11 * 2. union member memHandle should be used instead of clientBuf, and it
12 * should be set to nullptr. (line 42)
13 */
14
15
16 // Allocate some buffer, for example two tensors for the gpu runtime with dimensions {64, 128}
17 size_t bufferSize = 64 * 128 * sizeof(float);
18 auto outputTensorBuffer = (float*)malloc(bufferSize);
19
20 // Generate OpenCL context
21 auto clContext = ...;
22 const cl_mem_flags = CL_MEM_READ_WRITE;
23 cl_int clStatus;
24 auto outputTensorCLBuffer = new cl::buffer(*clContext, memFlags, bufferSize, outputTensorBuffer, &clStatus);
25 if (clStatus != CL_SUCCESS) {
26 // Handle error
27 }
28
29 // Fill the info of Qnn_MemDescriptor_t and register the buffer to QNN
30 // Qnn_MemDescriptor_t is defined in ${QNN_SDK_ROOT}/include/QNN/QnnMem.h
31 Qnn_MemDescriptor_t memDescriptor = QNN_MEM_DESCRIPTOR_INIT;
32 memDescriptor.memShape = {inputTensor.rank, inputTensor.dimensions, nullptr};
33 memDescriptor.dataType = inputTensor.dataType;
34 memDescriptor.memType = QNN_MEM_TYPE_CUSTOM;
35
36 // Fill the info of QnnGpu_MemInfoCustom_t to apply to the Qnn_MemDescriptor_t.
37 QnnGpu_MemInfoCustom_t customInfo = QNN_GPU_MEM_INFO_CUSTOM_INIT;
38 customInfo.memType = QNN_GPU_MEM_OPENCL;
39 customInfo.buffer = reinterpret_cast<QnnGpuMem_Buffer_t>(outputTensorBuffer);
40 memDescriptor.customInfo = customInfo;
41 outputTensor.memType = QNN_TENSORMEMTYPE_MEMHANDLE;
42 outputTensor.memHandle = nullptr;
43
44 Qnn_ContextHandle_t context; // Must obtain a QNN context handle before memRegister()
45 // To obtain QNN context handle:
46 // For online prepare, refer to ${QNN_SDK_ROOT}/docs/general/sample_app.html#create-context
47 // For offline prepare, refer to ${QNN_SDK_ROOT}/docs/general/sample_app.html#load-context-from-a-cached-binary
48 Qnn_MemHandle_t memHandles[1];
49 auto result = QnnMem_register(context, &memDescriptor, 1u, memHandles);
50 if (QNN_SUCCESS != result) {
51 // handle errors
52 }
53
54 /**
55 * At this place, the allocation and registration of the OpenCL buffer has been complete.
56 * On user side, this buffer can be manipulated through outputTensorBuffer;
57 */
58
59 // Load the input data to outputTensorBuffer ......
60
61 // Execute QNN graph with input tensor and output tensor ......
62
63 // Get output data for example
64 auto openCLCommandQueue = ...; // Get cl::CommandQueue instance
65 auto mappedPtr =
66 reinterpret_cast<float*>(openCLCommandQueue->enqueueMapBuffer(*outputTensorBuffer,
67 CL_TRUE,
68 CL_MAP_READ,
69 0,
70 sizeof(bufferSize),
71 nullptr,
72 nullptr,
73 &clStatus);
74 if (clStatus != CL_SUCCESS) {
75 // handle error
76 }
77
78 // Access contents of mappedPtr e.g.
79 std::vector<float> contents(mappedPtr, mappedPtr + bufferSize);
80 for (size_t i = 0u; i < contents.size(); i++) {
81 // Read data
82 }
83
84 // On completion, unmap the mappedPtr
85 clStatus = openCLCommandQueue->enqueueUnmapBuffer(*outputTensorBuffer, mappedPtr, nullptr, nullptr);
86 if (clStatus != CL_SUCCESS) {
87 // handle error
88 }
89
90 // Deregister and free all buffers if it's not being used
91 result = QnnMem_deregister(&tensors.memHandle, 1);
92 if (QNN_SUCCESS != registRet) {
93 // handle errors
94 }
95
96 // deallocate memory