GPU¶
This section provides information about the QNN GPU backend.
API Specializations¶
This section contains information related to API specialization for the GPU backend. All QNN GPU
backend specialization is available under the ${QNN_SDK_ROOT}/include/QNN/GPU/ directory.
The current version of the QNN GPU backend API is:
-
QNN_GPU_API_VERSION_MAJOR 3
-
QNN_GPU_API_VERSION_MINOR 11
-
QNN_GPU_API_VERSION_PATCH 0
Operation Limitations¶
QNN GPU operation limitations are documented in GPU Backend Op Definition Supplement.
Kernel Persistence¶
The QNN GPU backend supports two kernel persistence strategies held within a QNN Context: in-memory and on-disk. We refer to the in-memory persistence as the kernel registry and we refer to the on-disk persistence as the kernel repository. These are two mechanisms whereby kernels are re-used to reduce model initialization time. The following will outline how to use these features by introducing a simple use case.
A user creates a new QNN GPU Context by calling
QnnContext_create with a custom config
setting providing a valid kernelRepoDir. Let’s assume this
path is ${QNN_GPU_KERNEL_REPO}. Assume that there is no existing on-disk repo corresponding to this path. Therefore,
kernels will not be deserialized and the in-memory registry will contain no kernels. Kernels originating from the
built-in qti.aisw op package will be deserialized during
QnnContext_create. Kernels originating from
another op package will be deserialized when that op package is registered via
QnnBackend_registerOpPackage.
A user creates model A and finalizes it. Suppose that model A comprises of kernels 1, 2, and 3. These kernels are created from scratch and added to the in-memory kernel registry. A user creates model B and finalizes it. Suppose that model B comprises of kernels 3 and 4. Kernel 3 will be recovered from the in-memory kernel registry and kernel 4 will be created from scratch and added to the registry.
The user now calls QnnContext_free. Since
a valid kernel repo path was provided, the QNN GPU Context will serialize in-memory kernels and, for each op package,
write them to ${QNN_GPU_KERNEL_REPO}/gpukernelcache.${OP_PKG_NAME} where OP_PKG_NAME is the op package
packageName.
If the user creates another QNN GPU Context specifying the same kernel repo path, these kernels will be deserialized as outlined above and added to the in-memory kernel registry. If the user now creates model A or B, all kernels will be ready for creation via the in-memory registry, greatly reducing initialization time.
Note that an op package provides a kernelRepoHash to the Context. If the QNN GPU Context detects that an on-disk kernel repository was generated by an op package of the same name, but with a different kernelRepoHash, the on-disk repository will be automatically invalidated. This ensures that kernel version mis-matches do not occur.
Also note that these QNN GPU kernel persistence features are separate from the QNN context cache feature (see QnnContext_getBinary). A QNN GPU context cache will store everything needed to re-create a context, including kernels.
Precision Mode¶
The QNN GPU backend offers four precision modes via the QNN graph custom config feature (see QnnGpuGraph_CustomConfig_t and QnnGpu_Precision_t). These modes are:
QNN_GPU_PRECISION_FP32 (FP32 mode)
FP32 mode will convert NATIVE tensor data types to FP32 and will select kernels that use an FP32 accumulator.
FP32 mode offers the best accuracy at the expense of performance.
QNN_GPU_PRECISION_FP16 (FP16 mode)
FP16 mode will convert NATIVE tensor data types to FP16 and will select kernels that use an FP16 accumulator where possible.
FP16 mode offers the best performance at the expense of accuracy.
QNN_GPU_PRECISION_HYBRID
Hybrid mode will convert NATIVE tensor data types to FP16 and will select kernels that use an FP32 accumulator.
Hybrid mode offers a good trade-off between performance and accuracy.
QNN_GPU_PRECISION_USER_PROVIDED
This is the default precision mode when a custom config has not been provided.
The QNN GPU backend will not optimize NATIVE tensor data types.
Performance Hints¶
The QNN GPU offers three performance hints via the QNN context custom config feature (see QnnGpuContext_CustomConfig_t and QnnGpuContext_PerfHint_t). These hints are:
QNN_GPU_CONTEXT_PERF_HINT_HIGH
The HIGH perf hint will maximize GPU clock frequencies.
HIGH perf hint offers the best inference latency at the expense of power consumption.
This is the default.
QNN_GPU_CONTEXT_PERF_HINT_NORMAL
The NORMAL perf hint offers balanced performance dependent upon power management.
QNN_GPU_CONTEXT_PERF_HINT_LOW
The LOW perf hint will minimize GPU clock frequencies.
LOW perf hint offers the lowest power consumption at the expense of inference latency.
Note that performance hints are included in the context cache. However, calls to QnnContext_setConfig can override the cached performance hint setting.
Context Configs¶
QnnContext custom configs (QnnGpuContext_CustomConfig_t) and Context Priority (see Qnn_Priority_t and QnnContext_ConfigOption_t) are supported.
Backend Configs¶
QnnBackend custom configs (QnnGpuBackend_CustomConfig_t) and (QnnGpuBackend_ConfigOption_t) are supported.
Disabling Optimizations¶
The QNN GPU backend offers three features to disable the corresponding optimization. These features are enabled via the custom graph config (see QnnGpuGraph_CustomConfig_t).
The QNN GPU backend will share NATIVE tensor memory based upon analysis of the network topology. When disableMemoryOptimizations is non-zero, each tensor in the model will be allocated unique memory and sharing is disabled.
The QNN GPU backend will fuse compatible operations into one operation to improve QnnGraph_execute performance. When disableNodeOptimizations is non-zero, operations will not be fused and will be kept separate. qnn-net-run’s –debug option also disables operation fusion.
The QNN GPU backend will use queue recording to improve QnnGraph_execute performance. When disableQueueRecording is non-zero, queue recording is disabled.
QNN GPU Backend Extensions¶
The QNN backend extension feature facilitates usage of the backend specific APIs, namely custom configurations. More documentation on backend extensions can be found under qnn-net-run. Note that the scope of QNN backend extensions is limited to qnn-net-run.
GPU Backend Extensions is an interface to provide custom options to GPU Backend. In the GPU backend, a list of graph names is required if graph custom config options are specified as indicated by the dependencies in the schema below. The graph custom config options will be applied to each graph. These options can be exercised by providing an extension shared library libQnnGpuNetRunExtensions.so and a config file, if necessary. The schema for GPU backend extensions with various options available in the config are shown below:
{
"type": "object",
"properties": {
// Corresponds to the graph name provided to QnnGraph_create
"graph_names" : {"type": "array", "items": {"type": "string"}},
// Precision Mode [optional]
// Corresponds to QnnGpuGraph_CustomConfig_t::precisionMode.
"precision_mode": {"type": "string", "enum": ["fp16", "fp32", "hybrid"]},
// Disable Memory Optimizations (e.g. sharing tensor memory) [optional]
// Corresponds to QnnGpuGraph_CustomConfig_t::disableMemoryOptimizations.
"disable_memory_optimizations": {"type": "boolean"},
// Disable Node Optimizations (e.g. node fusion) [optional]
// Corresponds to QnnGpuGraph_CustomConfig_t::disableNodeOptimizations.
"disable_node_optimizations": {"type": "boolean"},
// Kernel Disk Repository Path [optional]
// Corresponds to QnnGpuContext_CustomConfig_t::kernelRepoDir.
// Valid values are any valid path having read/write permissions.
"kernel_repo_path": {"type": "string"},
// Disable Recordable Command Queue [optional]
// Corresponds to QnnGpuGraph_CustomConfig_t::disableQueueRecording.
"disable_queue_recording" : {"type" : "boolean"},
// Context custom config performance hint [optional]
// Corresponds to QnnGpuContext_CustomConfig_t::perfHint.
"perf_hint": {"type": "string", "enum": ["high", "normal", "low"]}
// Weight Sharing [optional]
// Corresponds to QnnGpuGraph_CustomConfig_t::weightSharingEnabled.
"weight_sharing": {"type": "boolean"},
},
"dependencies": {
"precision_mode": ["graph_names"],
"disable_memory_optimizations": ["graph_names"],
"disable_node_optimizations": ["graph_names"],
"disable_queue_recording": ["graph_names"]
}
}
To use backend extension related parameters with qnn-net-run, use --config_file argument and give path to JSON file.
$ qnn-net-run --model <qnn_model_name.so> \
--backend <path_to_backend_library>/libQnnGpu.so \
--output_dir <output_dir_for_result> \
--input_list <path_to_input_list.txt>
--perf_profile <performance_mode_to_be_used>
--config_file <path_to_JSON_of_backend_extensions>
The above config file with minimum parameters such as backend extensions config specified through JSON is given below:
{
"backend_extensions" :
{
"shared_library_path" : "path_to_shared_library",
"config_file_path" : "path_to_config_file"
}
}
Custom Profile Reader¶
The qnn-profile-viewer application can accept different readers and writers. The QNN GPU backend offers the libQnnGpuProfilingReader.so library to output profiling data in a JSON format.
Op Package Writing Guidelines¶
Detailed information regarding op package writing will be provided in a future release. In the meantime, please refer to
the op package example which can be found in ${QNN_SDK_ROOT}/examples/QNN/OpPackage/GPU/.
QNN Mem API Tutorial for GPU¶
The QNN GPU backend supports the usage of the QnnMem API to enable the usage of user-provided OpenCL buffers for input and output tensors. Allowing users the capability to provide their own OpenCL buffers eliminates the need of data copy between the host CPU and GPU.
Tuning Mode (Beta)¶
When the tuning mode is enabled, all the kernels are iteratively profiled and the performance metrics are stored in the performanceCache. The best performing kernels are then used to generate a contextBinary leading to a faster and optimized model.
Other Notes¶
Variable input dimensions (e.g. batch) are currently not supported
Variable output dimensions are currently not supported
Signed zero values are supported