HTP

This section provides information specific to QNN HTP backend.

API Specializations

This section contains information related to API specialization for the HTP backend. All QNN HTP backend specialization is available under <QNN_SDK_ROOT>/include/QNN/HTP/ directory.

The current version of the QNN HTP backend API is:

QNN_HTP_API_VERSION_MAJOR 5
QNN_HTP_API_VERSION_MINOR 39
QNN_HTP_API_VERSION_PATCH 0

Usage Expectations

1. The sequence of calls to QnnGraph_addNode() to build the QNN model should be done in the node dependency order.

2. The QnnBackend_registerOpPackage() takes in an optional parameter called ‘target’.

Given below are the acceptable values for ‘target’

  • “CPU” - for both linux x86 and ARM op packages

  • “HTP” - for op package that gets loaded on the HTP.

  • nullptr

    1. For loading context binary on ARM - Loads registered HTP op package

    2. For linux x86 - Registers the linux x86 op package

  1. Loading a context binary generated for a different HTP arch may give indeterminate result.

QNN HTP Supported Operations

QNN HTP supports running quantized 8-bit and quantized 16-bit networks on all Qualcomm SoCs. List of operations supported by QNN HTP Quant runtime can be seen under Backend Support HTP column in Supported Operations

QNN HTP supports running float32 networks using float16 math on select Qualcomm SoCs. If the QNN SDK supports QNN HTP Float, the list of operations supported by HTP Float runtime can be seen under Backend Support HTP_FP16 column in Supported Operations

For the QNN SDK that does support QNN HTP Float, please note that even though HTP and HTP_FP16 are listed under separate columns in Supported Operations they are a single “logical” backend. On select SoCs, QNN HTP backend library supports both quantized and float networks. The separate columns is done to distinguish the different supported operations list for quantized and float QNN HTP runtimes.

QNN HTP Variable Batch

QNN HTP supports variable batch dimension in a limited manner. The batch dimension at graph execute can be an integer mulitple of the respective dimension provided at graph prepare. All inputs and outputs tensors must have the same batch multiple. For example, if the tensor dimensions provided at graph prepare are [b,h,w,d] then graph can be executed with tensor having dimensions as [n*b,h,w,d], where n is a positive integer.

QNN HTP Backend API

File QnnHtpDevice.h is the backend specialization header that goes along with File QnnDevice.h. This header file allows clients to configure the QnnDevice to cater to specific use-cases.

Struct QnnHtpGraph_CustomConfig_t is defined in File QnnHtpGraph.h and is the backend specialization header that goes with File QnnGraph.h

QNN HTP Device Config Options (QnnHtpDevice_CustomConfig_t)

Option Name

Option Description

Default

When to use

QNN_HTP_DEVICE_CONFIG_OPTION_SOC

Integer value used to identify SoC model

QNN_SOC_MODEL_SM8350

Client can provide socModel to indicate which SoC is targeted

QNN_HTP_DEVICE_CONFIG_OPTION_ARCH

Data structure to configure a device to set the HTP Arch. The driver will use ops that are compatible to this HTP Arch

QNN_HTP_DEVICE_ARCH_NONE

Client can provide as part of the custom config when there are multiple devices in use

QNN_HTP_DEVICE_CONFIG_OPTION_SIGNEDPD

Enables signed process domain. In order to use this flag, client also needs to push a signed dsp image to target

False (Unsigned Process Domain)

Client use signed process domain. Check Hexagon SDK document for more detail.

Client can set SocModel as shown below: Refer Qnn_SocModel_t for setting Soc Model.

Note that Qnn_SocModel_t will be deprecated, For setting Soc Model refer to the Supported Snapdragon Devices

1QnnHtpDevice_CustomConfig_t customConfig;
2customConfig.option   = QNN_HTP_DEVICE_CONFIG_OPTION_SOC;
3customConfig.socModel = QNN_SOC_MODEL_SM8550;
4QnnDevice_Config_t devConfig;
5devConfig.option = QNN_DEVICE_CONFIG_OPTION_CUSTOM;
6devConfig.customConfig = &customConfig;
7const QnnDevice_Config_t* pDeviceConfig[] = {&devConfig, NULL};

Client can set Htp arch as shown below: Refer QnnHtpDevice_Arch_t for setting Htp Arch.

1QnnHtpDevice_CustomConfig_t customConfig;
2customConfig.option    = QNN_HTP_DEVICE_CONFIG_OPTION_ARCH;
3customConfig.arch.arch = QNN_HTP_DEVICE_ARCH_V73;
4customConfig.arch.deviceId = 0;  // Id of device to be used. If single device is used by default 0.
5QnnDevice_Config_t devConfig;
6devConfig.option = QNN_DEVICE_CONFIG_OPTION_CUSTOM;
7devConfig.customConfig = &customConfig;
8const QnnDevice_Config_t* pDeviceConfig[] = {&devConfig, NULL};

Client can set signed PD as shown below:

1QnnHtpDevice_CustomConfig_t customConfig;
2customConfig.option    = QNN_HTP_DEVICE_CONFIG_OPTION_SIGNEDPD;
3customConfig.useSignedProcessDomain.useSignedProcessDomain = true;
4customConfig.useSignedProcessDomain.deviceId = 0;   // Id of device to be used. If single device is used by default 0.
5QnnDevice_Config_t devConfig;
6devConfig.option = QNN_DEVICE_CONFIG_OPTION_CUSTOM;
7devConfig.customConfig = &customConfig;
8const QnnDevice_Config_t* pDeviceConfig[] = {&devConfig, NULL};

QNN HTP Context Config Options (QnnHtpContext_CustomConfig_t)

Option Name

Option Description

Default

When to use

QNN_HTP_CONTEXT_CONFIG_OPTION_WEIGHT_SHARING_ENABLED

This feature allows common weights across graphs (max 64) to be shared and stored in a single context binary.

Weight sharing feature is disabled by default. When preparing context binary with multiple graphs that have common weights, these weights will not be shared.

When preparing a context binary with multiple graphs that share common weights, this feature can be utilized to reduce overall memory usage by sharing common weights. It is important to note that shared weights include a portion of the weights enabled across the entire suite of graphs within a context. As an example, if only a single graph is utilized and the portion of shared weights is large, this could adversely impact RAM and ROM. The ideal usage of shared weights is when graphs are very similar in terms of weights, allowing users to decrease RAM and ROM usage by sharing the weights. Optimization in this area could be considered for future releases.

QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_MULTI_CONTEXTS

This enum allows users to associate multiple contexts to a group. When registering the first context to a group, two values must be configured – handle to the first context (it would always be 0 for the first one) and the maximum spill-fill buffer size. Subsequent registration of other contexts would need to pass down the context handle of the first context registered to a group.

Spill fill buffer is not shared across multiple contexts. Each graph creates its own buffer.

When multiple models are executed in sequence and as a result it is possible to reserve a single spill-fill allocation that could be re-used across all the splits. This has the benefit of reducing RAM usage for the application at negligible performance impact. This should only be used with the QnnContext_createFromBinary API.

QNN_HTP_CONTEXT_CONFIG_OPTION_FILE_READ_MEMORY_BUDGET

This feature allows users to configure the read memory budget of the deserialized binary in megabytes (Mb). It gives a hint to the backend to load the binary in chunks, instead of loading the entire binary to memory at once.

File read memory budget is turned off by default. In this case, the entire file is placed in memory at once.

With this feature enabled, initialization time might be impacted. Hence, it should only be used if peak memory usage is of concern. Users need to find the proper balance of performance vs. memory for their specific use-case. Additionally, this config should only be used with mmapped model buffers. Using non-mmapped model buffer will result in an undefined behavior.

QNN_HTP_CONTEXT_CONFIG_OPTION_IO_MEM_ESTIMATION

This field enables I/O memory estimation during QnnContext_createFromBinary API when multiple PDs are available. When enabled, it estimates the total size of the I/O tensors required by the context to ensure sufficient space on the PD before deserialization.

I/O memory estimation is turned off by default.

This feature can help with memory registration failures in large models when multiple PDs are available. Enabling this feature increases peak RAM usage during context initialization phase in QnnContext_createFromBinary, but sustained RAM remains unaffected.

QNN_HTP_CONTEXT_CONFIG_OPTION_DSP_MEMORY_PROFILING_ENABLED

This feature allows context-related total DSP heap usage profiling as detailed in the DSP heap profiling section.

DSP heap profiling is turned off by default. In this case, no DSP profiling will occur.

With this feature enabled, profiling data will be captured upon QnnContext_createFromBinary and QnnContext_free.

QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES

This field enables resource sharing across different contexts during the QnnContext_createFromBinaryListAsync API. When enabled, memory optimizations are applied based on the QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES_OPTIMIZATION_TYPE.

Resource sharing is turned off by default.

This feature optimizes runtime HTP virtual address space and/or memory usage. Note: This feature cannot be used together with graph switching.

QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES_OPTIMIZATION_TYPE

This field is supported only when QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES is true; otherwise, it is ignored. Refer here for the available configuration options.

When QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES is true and no input is provided for this field SEQUENTIAL_WITH_VA_OPTIMIZATION is the default value.

This provides a way to customize resource sharing. More details can be found here.

QNN_HTP_CONTEXT_CONFIG_OPTION_INIT_ACCELERATION

This field enables initialization acceleration during QnnContext_createFromBinary. When enabled, it allows hardware to utilize maximum resources for accelerating model initialization.

Init acceleration is turned OFF by default.

Whenever we initialize a graph through QnnContext_createFromBinary, then this feature can be utilized to improve the initialization time. This feature may not be effective for small graphs with a few number of ops.

QNN_HTP_CONTEXT_CONFIG_OPTION_SKIP_VALIDATION_ON_BINARY_SECTION

This field marks that crc32 check in Lora super adapter apply will be skipped during the QnnContext_applyBinarySection API.

Skip crc32 validation on binary section is turned off by default.

When this feature is enabled, Lora super adapter apply will no longer make crc32 check for non-base adapters (base adapters never do crc32 check). Therefore, in super adapter use cases, non-base adapter apply time cost is improved.

QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_CONCURRENT_RESOURCE_SHARING

This field allows users to associate one or more contexts with the same priority to a group to enable concurrent spill-fill and VTCM backup buffer sharing. When registering the first context to a group, two values must be configured – handle to the first context (it would always be 0 for the first one) and the maximum spill-fill buffer size. Subsequent registration of other contexts would need to pass down the context handle of the first context registered to a group. If a group contains only one context, it means all graphs within that context are sharing the same spill-fill and VTCM backup buffers. The maximum spill-fill buffer size is directly set using the configured value provided. However, the maximum VTCM backup buffer size is determined internally based on the device’s capabilities.

Spill-fill and VTCM backup buffers are not shared across multiple contexts. Each graph creates its own buffer.

When multiple models are executed concurrently, it is possible to reserve a single spill-fill and VTCM backup buffer allocation per priority, which could be re-used across all the splits. This has the benefit of reducing RAM usage for the application at negligible performance impact. This should only be used with the QnnContext_createFromBinary API.

QNN_HTP_CONTEXT_CONFIG_OPTION_LORA_WEIGHT_SHARING_ENABLED

This feature extract out updatable weights across all graphs and put the max sized one to be shared and stored in a single context binary as a separate weight blob.

Lora weight sharing feature is disabled by default. When preparing context binary with multiple lora graphs, there will be no lora memory shared.

When preparing a context binary that includes multiple lora graphs, this feature can be leveraged to reduce overall memory usage by sharing lora memories. It is important to note that the shared lora memory contains only the updatable weights relevant to a single lora graph. To update the contents of the lora memory, users must call the QnnContext_applyBinarySection API after deserialization and before execution. Additionally, for graph switching scenarios, the HTP backend will always re-apply the cached binary section blob prior to execution.

QNN_HTP_CONTEXT_CONFIG_OPTION_PREPARE_ONLY

Enables optimizations regarding memory and performance time when graph preparation is the only task the client wants to perform.

The prepare-only flag is turned off by default.

When clients want to simply prepare a binary, the backend does not need to set up components required for graph execution. With this flag enabled, graph execution cannot be called.

Clients can enable weight sharing as follows:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option  = QNN_HTP_CONTEXT_CONFIG_OPTION_WEIGHT_SHARING_ENABLED;
3customConfig.weightSharingEnabled = true;  // set to false to disable weight sharing
4QnnContext_Config_t contextConfig;
5contextConfig.option       = QNN_CONTEXT_CONFIG_OPTION_CUSTOM;
6contextConfig.customConfig = &customConfig;
7const QnnContext_Config_t* pContextConfig[] = {&contextConfig, NULL};

Note

The Weight Sharing feature has certain requirements and limitations:

  1. Only supports offline prepare on x86_Linux platform. Online prepare and other platforms (ARM/x86_Windows) offline prepare are not supported.

  2. Only supports Hexagon v73 and onward architectures.

  3. Only supports within a single PD. Sharing cross PD or cross different VTCM size and SoC is not supported.

  4. Any previously generated binaries will not automatically benefit from Weight Sharing. Users are required to regenerate new serialized binary to benefit from Weight Sharing. Old serialized binaries will still work without the weight sharing feature.

Clients can set shared spill-fill buffer details for multiple contexts as follows:

Note

This feature is only enabled for an offline prepare usecase. Information regarding spill fill size is written as part of Struct QnnSystemContext_BinaryInfo_t defined in QnnSystemContext.h. The hwInfoBlob field within the struct contains information regarding the index to the graph and respective spill fill buffer size utilized by that graph as defined in QnnHtpSystemContext.h.

Users should figure out the maximum spill fill buffer size needed across all the contexts before proceeding to deserialize. There are two ways to achieve this:

  1. Use qnn-context-binary-utility to output binary details in a JSON file. It essentially prints the content of Struct QnnSystemContext_BinaryInfo_t, along with HTP specific content as defined in QnnHtpSystemContext.h. Search for the “spillFillBufferSize” key to figure out the spill fill buffer size required for each of the graphs.

  2. Add checks at runtime. Users can parse the content of the binary from Struct QnnSystemContext_BinaryInfo_t struct along with HTP specific information from QnnHtpSystemContext.h.

 1// ===== FIRST CONTEXT =====
 2QnnHtpContext_CustomConfig_t customConfig;
 3customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_MULTI_CONTEXTS;
 4QnnHtpContext_GroupRegistration_t groupInfo;
 5groupInfo.firstGroupHandle      = 0x0;      // New group
 6groupInfo.maxSpillFillBuffer    = 30081024; // Max spill-fill buffer across contexts. Must be >0
 7customConfig.groupRegistration  = groupInfo;
 8QnnContext_Config_t* cfgs[] = {&customConfig, NULL};
 9QnnContext_createFromBinary(..., cfgs, ..., &contextHandle, ...);
10
11// ===== SECOND CONTEXT =====
12QnnHtpContext_CustomConfig_t customConfig2;
13customConfig2.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_MULTI_CONTEXTS;
14QnnHtpContext_GroupRegistration_t groupInfo2;
15groupInfo2.firstGroupHandle      = contextHandle;  // associated to above contextHandle
16groupInfo2.maxSpillFillBuffer    = 30081024;       // same value as above OR don't set this now
17customConfig2.groupRegistration  = groupInfo2;
18QnnContext_Config_t* cfgs2[] = {&customConfig2, NULL};
19QnnContext_createFromBinary(..., cfgs2, ..., &contextHandle2, ...);
20
21// ===== THIRD CONTEXT =====
22QnnHtpContext_CustomConfig_t customConfig3;
23customConfig3.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_MULTI_CONTEXTS;
24QnnHtpContext_GroupRegistration_t groupInfo3;
25groupInfo3.firstGroupHandle      = contextHandle;  // associated to above contextHandle
26groupInfo3.maxSpillFillBuffer    = 30081024;       // same value as above or don't set this
27customConfig3.groupRegistration  = groupInfo3;
28QnnContext_Config_t* cfgs3[] = {&customConfig3, NULL};
29QnnContext_createFromBinary(..., cfgs3, ..., &contextHandle3, ...);

Clients can set shared spill-fill and VTCM backup buffers for concurrent resource sharing as follows:

Note

This feature is only supported on Android for the V81 Hexagon architecture. This feature is only enabled for an offline prepare usecase. Information regarding spill fill size is written as part of Struct QnnSystemContext_BinaryInfo_t defined in QnnSystemContext.h. The hwInfoBlob field within the struct contains information regarding the index to the graph and respective spill fill buffer size utilized by that graph as defined in QnnHtpSystemContext.h.

Users should figure out the maximum spill fill buffer size needed across the contexts for each priority before proceeding to deserialize. There are two ways to achieve this:

  1. Use qnn-context-binary-utility to output binary details in a JSON file. It essentially prints the content of Struct QnnSystemContext_BinaryInfo_t, along with HTP specific content as defined in QnnHtpSystemContext.h. Search for the “spillFillBufferSize” key to figure out the spill fill buffer size required for each of the graphs.

  2. Add checks at runtime. Users can parse the content of the binary from Struct QnnSystemContext_BinaryInfo_t struct along with HTP specific information from QnnHtpSystemContext.h.

In addition, the context/graph priorities within each group must be the same, and the priorities cannot be modified by either QnnContext_setConfig() or QnnGraph_setConfig() on the fly. However, if this concurrent feature is not enabled, you may have graphs with different priorities within the same context.

 1// ===== CONTEXT #1: NEW GROUP =====
 2QnnHtpContext_CustomConfig_t customConfig;
 3customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_CONCURRENT_RESOURCE_SHARING;
 4QnnHtpContext_GroupRegistration_t groupInfo;
 5groupInfo.firstGroupHandle                = 0x0;      // New group, can be any priority
 6groupInfo.maxSpillFillBuffer              = 30081024; // Max spill-fill buffer across contexts. Must be > 0
 7customConfig.concurrentGroupRegistration  = groupInfo;
 8QnnContext_Config_t* cfgs[] = {&customConfig, NULL};
 9QnnContext_createFromBinary(..., cfgs, ..., &contextHandle1, ...);
10
11// ===== CONTEXT #2: SAME GROUP AS CONTEXT #1 =====
12QnnHtpContext_CustomConfig_t customConfig2;
13customConfig2.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_CONCURRENT_RESOURCE_SHARING;
14QnnHtpContext_GroupRegistration_t groupInfo2;
15// Must be the same priority as CONTEXT #1
16groupInfo2.firstGroupHandle                = contextHandle1; // associated with the above contextHandle1
17groupInfo2.maxSpillFillBuffer              = 30081024;       // same value as above OR don't set this now
18customConfig2.concurrentGroupRegistration  = groupInfo2;
19QnnContext_Config_t* cfgs2[] = {&customConfig2, NULL};
20QnnContext_createFromBinary(..., cfgs2, ..., &contextHandle2, ...);
21
22// ===== CONTEXT #3: SAME GROUP AS CONTEXT #1 =====
23QnnHtpContext_CustomConfig_t customConfig3;
24customConfig3.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_CONCURRENT_RESOURCE_SHARING;
25QnnHtpContext_GroupRegistration_t groupInfo3;
26// Must be the same priority as CONTEXT #1
27groupInfo3.firstGroupHandle                = contextHandle1; // associated with the above contextHandle1
28groupInfo3.maxSpillFillBuffer              = 30081024;       // same value as above or don't set this
29customConfig3.concurrentGroupRegistration  = groupInfo3;
30QnnContext_Config_t* cfgs3[] = {&customConfig3, NULL};
31QnnContext_createFromBinary(..., cfgs3, ..., &contextHandle3, ...);
32
33// ===== CONTEXT #4: NEW GROUP WITH ONLY ONE CONTEXT =====
34QnnHtpContext_CustomConfig_t customConfig4;
35customConfig4.option = QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_CONCURRENT_RESOURCE_SHARING;
36QnnHtpContext_GroupRegistration_t groupInfo4;
37groupInfo4.firstGroupHandle                = 0x0;            // New group, can be any priority
38groupInfo4.maxSpillFillBuffer              = 30081024;       // Max spill-fill buffer across contexts. Must be > 0
39customConfig4.concurrentGroupRegistration  = groupInfo4;
40QnnContext_Config_t* cfgs4[] = {&customConfig4, NULL};
41QnnContext_createFromBinary(..., cfgs4, ..., &contextHandle4, ...);

Clients can configure read memory budget of serialized binary as follows:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_FILE_READ_MEMORY_BUDGET;
3customConfig.fileReadMemoryBudgetInMb = 25;
4QnnContext_Config_t* cfgs[] = {&customConfig, NULL};
5QnnContext_createFromBinary(..., cfgs, ..., &contextHandle, ...);

In the example above, 25 MB chucks are loaded to memory at a time. If user sets the value to greater than the file size, min(fileSize, fileReadMemoryBudgetInMb) is used. The value should be greater than 0 and less than or equal to the file size. If a value of 0 is passed, this feature is turned off.

Clients can configure I/O memory estimation as follows:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_IO_MEM_ESTIMATION;
3customConfig.ioMemoryEstimation = true;
4QnnContext_Config_t* cfgs[] = {&customConfig, NULL};
5QnnContext_createFromBinary(..., cfgs, ..., &contextHandle, ...);

Clients can configure share resources and share resource optimization type as follows:

Note

This custom config needs to be set and passed as a group configuration and not as individual context configuration. QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES_OPTIMIZATION_TYPE is applied only when QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES is true; otherwise, it is ignored.

The following table lists available configuration options for QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES_OPTIMIZATION_TYPE.

Option Name

Option Description

SEQUENTIAL_WITH_VA_OPTIMIZATION

  • Graphs have to be executed sequentially, otherwise unexpected system behavior may be observed.

  • Optimizes both HTP Virtual Address (VA) space and runtime memory usage.

  • Ideal for large generative AI workloads with multiple splits.

  • VA optimization is supported only on specific SoCs. If used on an unsupported SoC, the API will return QNN_CONTEXT_ERROR_UNSUPPORTED_FEATURE.

SEQUENTIAL_WITHOUT_VA_OPTIMIZATION

  • Graphs have to be executed sequentially, otherwise unexpected system behavior may be observed.

  • Optimizes runtime memory usage, but without explicit HTP VA space optimization.

  • Suitable for smaller generative AI workloads and traditional AI models.

CONCURRENT_OPTIMIZATION

  • Designed for concurrent graph execution with runtime memory optimization.

  • When enabled, spill-fill and VTCM backup buffers are shared by contexts with the same priorities.

  • This feature is only supported on Android for the V81 Hexagon architecture.

1QnnHtpContext_CustomConfig_t customListConfig[2];
2customListConfig[0].option = QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES;
3customListConfig[0].shareResources = true;
4QnnHtpContext_CustomConfig_t shResOptConfig;
5customListConfig[1].option = QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES_OPTIMIZATION_TYPE;
6customListConfig[1].shareResOptType = SEQUENTIAL_WITH_VA_OPTIMIZATION;
7QnnContext_Config_t* cfgs[] = {&customListConfig[0], &customListConfig[1], NULL};
8QnnContext_createFromBinaryListAsync(..., &contextParams, cfgs, ...);

Clients can configure init acceleration as follows:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_INIT_ACCELERATION;
3customConfig.initAcceleration = true;
4QnnContext_Config_t* cfgs[] = {&customConfig, NULL};
5QnnContext_createFromBinary(..., cfgs, ..., &contextHandle, ...);

Clients can configure skip validation on binary section as follows:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_SKIP_VALIDATION_ON_BINARY_SECTION;
3customConfig.skipValidationOnBinarySection = true;
4QnnContext_Config_t* cfgs[] = {&customConfig, NULL};
5QnnContext_createFromBinary(..., cfgs, ..., &contextHandle, ...);

Clients can enable lora weight sharing as follows:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option  = QNN_HTP_CONTEXT_CONFIG_OPTION_LORA_WEIGHT_SHARING_ENABLED;
3customConfig.loraWeightSharingEnabled = true;  // set to false to disable lora weight sharing
4QnnContext_Config_t contextConfig;
5contextConfig.option       = QNN_CONTEXT_CONFIG_OPTION_CUSTOM;
6contextConfig.customConfig = &customConfig;
7const QnnContext_Config_t* pContextConfig[] = {&contextConfig, NULL};

Note

The Lora Weight Sharing feature has certain requirements and limitations:

  1. Only supports offline prepare on x86_Linux platform. Online prepare and other platforms (ARM/x86_Windows) offline prepare are not supported.

  2. Any previously generated binaries will not automatically benefit from Lora Weight Sharing. Users are required to regenerate new serialized binary to benefit from Lora Weight Sharing. Old serialized binaries will still work without the lora weight sharing feature.

QNN HTP Graph Config Options (QnnHtpGraph_CustomConfig_t)

Option Name

Option Description

Default

When to use

QNN_HTP_GRAPH_CONFIG_OPTION_OPTIMIZATION

This enum provides different HTP graph optimization options that can be used to finalize the graph for optimum performance.

QNN_HTP_GRAPH_OPTIMIZATION_TYPE_UNKNOWN

Client can provide this option when an optimization is desired for the graph finalize process

QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION

An enum which defines the different precision modes supported by QNN backends

QNN_PRECISION_FLOAT32

Client provides when they desire to use a specific math type in the implementation of an operation

QNN_HTP_GRAPH_CONFIG_OPTION_VTCM_SIZE_IN_MB/QNN_HTP_GRAPH_CONFIG_OPTION_VTCM_SIZE

Used to define the amount of VTCM memory (in MB) to reserve and utilize

4

When a client wants to use a specific (<= MAX_SOC_VTCM) or the maximum VTCM amount
  • To use the maximum VTCM amount, set the value to QNN_HTP_GRAPH_CONFIG_OPTION_MAX and specify the target SoC (QNN_HTP_DEVICE_CONFIG_OPTION_SOC).

  • When loading a context binary generated with VTCM = QNN_HTP_GRAPH_CONFIG_OPTION_MAX, if QnnContext_BinaryCompatibilityType_t is set to STRICT, QNN performs the optimality check for the graph VTCM size.

QNN_HTP_GRAPH_CONFIG_OPTION_FOLD_RELU_ACTIVATION_INTO_CONV_OFF

For any graph where a Convolution or Convolution like operation is followed by Relu or ReluMinMax, the relu is folded into the Convolution operation

always fold

Clients that cannot guarantee that the quantization parameters of the Relu output exactly reflect the range of the data after the Relu should set this flag. This will come at the cost of performance, but perserve the requested quantization encodings

QNN_HTP_GRAPH_CONFIG_OPTION_SHORT_DEPTH_CONV_ON_HMX_OFF

Run all Convolution operations using HMX instructions

always use HMX instructions

Clients that have graphs where weights are not symmetric and have Convolution with short depths should set this flag to guarantee accurate results

QNN_HTP_GRAPH_CONFIG_OPTION_NUM_HVX_THREADS

Used to define number of HVX threads to reserve and utilize for a particular graph

4

When a client wants to keep aside specific number of HVX threads for other parallel work loads

QNN_HTP_GRAPH_CONFIG_OPTION_WEIGHTS_PACKING

Used to enable weights packing for a particular graph

false

This feature is currently in experimental beta release, any proposed method of usage and behavior may change in future releases. At graph prepare, enabling this feature will cause 8-bit weights that are in the 4-bit range to be stored in the context binary as packed 4-bit, potentially reducing the context binary size. However, please note that while this may reduce the size of a context binary, it does not guarantee any performance improvements.

QNN_HTP_GRAPH_CONFIG_OPTION_FINALIZE_CONFIG

This option sets the graph finalize level settings.

No explicit setting.

It is used to configure the graph finalize level. More details will be provided in a separate supplementary note.

QNN_HTP_GRAPH_OPTIMIZATION_TYPE_FINALIZE_OPTIMIZATION_FLAG = 3 configuration will take into account QNN_HTP_DEVICE_CONFIG_OPTION_SOC configuration when possible. When SOC information is taken into account, O3 configuration is expected to provide more optimal graph in most cases, but may result in less optimal graph in some cases. Also, it may yield possible larger context binary size and hence possible degradation on graph loading time.

Note

Its recommended to refer Hexagon SDK documentation prior to following section as significant functionality described here, inherently uses Hexagon SDK APIs.

If the user specifies both QNN_HTP_DEVICE_CONFIG_OPTION_SOC and QNN_HTP_DEVICE_CONFIG_OPTION_ARCH , the HTP backend driver uses the QNN_HTP_DEVICE_CONFIG_OPTION_SOC configuration and ignores the QNN_HTP_DEVICE_CONFIG_OPTION_ARCH configuration. We recommend using QNN_HTP_DEVICE_CONFIG_OPTION_SOC instead of QNN_HTP_DEVICE_CONFIG_OPTION_ARCH.

Typedef QnnHtpDevice_DeviceInfoExtension_t is a backend specific subcomponent of Struct QnnDevice_HardwareDeviceInfo_t. Information for these structs are provided by the client for offline operation, and can be populated by a call to QnnDevice_getPlatformInfo()

QNN HTP Device Info Extension Options (QnnHtpDevice_DeviceInfoExtension_t)

Option Name

Option Description

Default

When to use

QNN_HTP_DEVICE_CONFIG_OPTION_PCIE_DEVICE_INFO_EXTENSION

This structure provides info about the device that connect with PCIe bus: VTCM size (in MB), socModel, number of NSPs on the PCIe device, signed PD support, DLBC support, device architecture

NULL

For online operation, caller should get these info from QnnDevice_getPlatformInfo. For offline operation, caller need to create this structure and filling the correct information for QnnDevice_create

QNN_HTP_DEVICE_CONFIG_OPTION_ON_CHIP_DEVICE_INFO_EXTENSION

This structure provides info about the NSP device inside SoC: VTCM size (in MB), socModel, signed PD support, DLBC support, device architecture

NULL

For online operation, caller should get these info from QnnDevice_getPlatformInfo. For offline operation, caller need to create this structure and filling the correct information for QnnDevice_create

QNN HTP Performance Infrastructure API

Clients can invoke QnnDevice_getInfrastructure after loading the QNN HTP library and then invoke methods that are available in QnnHtpPerfInfrastructure.h. These APIs allow a client to control CPU and HTP accelerator’s system settings for performance purpose. A few use-cases are:

  1. Set up a voting policy by controlling the core clocks and voltage corners.

  2. Set up DCVS modes to achieve different performance settings as applicable to a use-case.

  3. Set up a specific RPC Control Latency per session to control CPU low power modes and reduce CPU wake-up latency impact on FastRPC. Latency critical applications recommended to vote for greater than 0 and less than 200 us; moderate latency requirement can vote for more than 200 us or by default 0 us without specific setting needed.

Note

Setting up number of threads for accelerator is not supported. QNN Perf Infrastructure maps directly to fast RPC features on CPU side, and on HTP side maps to HAP_Power DCVS v3. For detailed configuration of voltage corners and DCVS modes, please refer to Hexagon SDK documentation on HAP_power_set API.

QNN provides interface through APIs to control DSP core and bus clocks based on power and performance needs. These APIs allow programmers to adjust the DSP power usage as per the application’s power requirement, thereby providing a good balance between power consumption and performance. Performance Parameters table shows settings for various user defined performance profiles. These performance parameters are listed in QnnHtpPerfInfrastructure.h and can be used to control performance settings. Usage of these performance parameters are shown below.

Clock Corner Settings - It is used to set Bus and Core operating corners for performance setting.

QnnHtpPerfInfrastructure_DcvsV3_t dcvsV3Config;

Bus Parameters - Bus params is used to set the bus clock parameters.

dcvsV3Config.setBusParams            = 1;    //True to consider Bus parameter otherwise False.
dcvsV3Config.busVoltageCornerMin     = DCVS_VOLTAGE_VCORNER_TURBO;
dcvsV3Config.busVoltageCornerTarget  = DCVS_VOLTAGE_VCORNER_TURBO;
dcvsV3Config.busVoltageCornerMax     = DCVS_VOLTAGE_VCORNER_TURBO;

Core Parameters - Core params is used to set the core clock parameters.

dcvsV3Config.setCoreParams            = 1;     //True to consider Core parameter otherwise False.
dcvsV3Config.coreVoltageCornerMin     = DCVS_VOLTAGE_VCORNER_TURBO;
dcvsV3Config.coreVoltageCornerTarget  = DCVS_VOLTAGE_VCORNER_TURBO;
dcvsV3Config.coreVoltageCornerMax     = DCVS_VOLTAGE_VCORNER_TURBO;

DCVS Enable - setDcvsEnable and dcvsEnable parameters enables user to vote for DCVS participation.

dcvsV3Config.setDcvsEnable = 1;
dcvsV3Config.dcvsEnable    = 0;   // zero value means to disable dcvs

Sleep Latency - setSleepLatency and sleepLatency parameters can be used to request for a sleep latency in micro seconds.

dcvsV3Config.setSleepLatency = 1;
dcvsV3Config.sleepLatency    = 100;   // give sleep latency value, ranges 10-65535 us

Sleep Disable - setSleepDisable and sleepDisable parameters enables user to disable sleep (all LPM modes) in HTP.

dcvsV3Config.setSleepDisable = 1;
dcvsV3Config.sleepDisable    = 1;    // non zero value means disable sleep

Power Mode - powerMode parameter enables user to request for a particular DCVS mode when set_dcvs_enable and dcvs_enable both are set to TRUE.

dcvsV3Config.powerMode = QNN_HTP_PERF_INFRASTRUCTURE_POWERMODE_PERFORMANCE_MODE;

QNN HTP Performance Infrastructure APIs provides interface to the client to control the performance and system settings of the QNN HTP Accelerator.

Create Power Config ID - This API is used to associate unique client context so that subsequent APIs can refer to the same context using created ID.

Qnn_ErrorHandle_t createPowerConfigId(uint32_t deviceId, uint32_t coreId, uint32_t* powerConfigId);

//Usage
uint32_t powerConfigId;  // Below Api creates the power config id.
uint32_t deviceId = 0;
uint32_t coreId = 0;
sample_app::StatusCode sample_app::QnnSampleApp::createPowerConfigId() {
    QnnDevice_Infrastructure_t deviceInfra = nullptr;
    QnnInterface_t qnnInterface;
    Qnn_ErrorHandle_t devErr = qnnInterface.QNN_INTERFACE_VER_NAME.deviceGetInfrastructure(&deviceInfra);
    if (devErr != QNN_SUCCESS) {
        QNN_ERROR("device error");
        return StatusCode::FAILURE;
      }
    QnnHtpDevice_Infrastructure_t *htpInfra = static_cast<QnnHtpDevice_Infrastructure_t *>(deviceInfra);
    QnnHtpDevice_PerfInfrastructure_t perfInfra = htpInfra->perfInfra;
    Qnn_ErrorHandle_t perfInfraErr = perfInfra.createPowerConfigId(deviceId, coreId, &powerConfigId);
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("createPowerConfigId failed");
        return StatusCode::FAILURE;
      }
    return StatusCode::SUCCESS;
}

Set Power Config - This API allows client to set up system power configuration that will enable different performance modes. This API uses HAP_power_dcvs_v3_payload struct to config HAP power parameters. Detailed HAP power parameters description please refer to Hexagon SDK HAP_power_dcvs_v3_payload documentation. setPowerConfig API below has settings which gives high performance, users can experiment with different settings according to their requirements.

Qnn_ErrorHandle_t setPowerConfig(uint32_t powerConfigId, const QnnHtpPerfInfrastructure_PowerConfig_t** config);

//Usage
sample_app::StatusCode sample_app::QnnSampleApp::setPowerConfig() {
    QnnDevice_Infrastructure_t deviceInfra = nullptr;
    QnnInterface_t qnnInterface;
    Qnn_ErrorHandle_t devErr = qnnInterface.QNN_INTERFACE_VER_NAME.deviceGetInfrastructure(&deviceInfra);
    if (devErr != QNN_SUCCESS) {
        QNN_ERROR("device error");
        return StatusCode::FAILURE;
    }
    QnnHtpDevice_Infrastructure_t *htpInfra = static_cast<QnnHtpDevice_Infrastructure_t *>(deviceInfra);
    QnnHtpDevice_PerfInfrastructure_t perfInfra = htpInfra->perfInfra;

    QnnHtpPerfInfrastructure_PowerConfig_t powerConfig;
    memset(&powerConfig, 0, sizeof(powerConfig));
    powerConfig.option                     = QNN_HTP_PERF_INFRASTRUCTURE_POWER_CONFIGOPTION_DCVS_V3;
    powerConfig.dcvsV3Config.dcvsEnable    = 0; //1- To enable Dcvs and consider dcvs power mode, 0- To disable dcvs
    powerConfig.dcvsV3Config.setDcvsEnable = 1;
    powerConfig.dcvsV3Config.contextId     = powerConfigId;  //use the power config id created

    // refer QnnHtpPerfInfrastructure.h
    powerConfig.dcvsV3Config.powerMode       = QNN_HTP_PERF_INFRASTRUCTURE_POWERMODE_PERFORMANCE_MODE;
    powerConfig.dcvsV3Config.setSleepLatency = 1; //True to consider Latency parameter otherwise False
    powerConfig.dcvsV3Config.setBusParams    = 1; //True to consider Bus parameter otherwise False
    powerConfig.dcvsV3Config.setCoreParams   = 1; //True to consider Core parameter otherwise False
    powerConfig.dcvsV3Config.sleepDisable    = 1; //True to disable sleep, False to re-enable sleep
    powerConfig.dcvsV3Config.setSleepDisable = 1; //True to consider sleep disable/enable parameter otherwise False

    //Set Sleep latency parameter
    powerConfig.dcvsV3Config.sleepLatency    =  40; // set dsp sleep latency ranges 10-65535 micro sec, refer hexagon sdk

    //set Bus Clock Parameters (refer QnnHtpPerfInfrastructure.h)
    powerConfig.dcvsV3Config.busVoltageCornerMin     = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.busVoltageCornerTarget  = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.busVoltageCornerMax     = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;

    //set Core Clock Parameters (refer QnnHtpPerfInfrastructure.h)
    powerConfig.dcvsV3Config.coreVoltageCornerMin    = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.coreVoltageCornerTarget = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.coreVoltageCornerMax    = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;

    // Set power config with different performance parameters
    const QnnHtpPerfInfrastructure_PowerConfig_t *powerConfigs[] = {&powerConfig, NULL};

    Qnn_ErrorHandle_t perfInfraErr = perfInfra.setPowerConfig(powerConfigId, powerConfigs);
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("setPowerConfig failed");
        return StatusCode::FAILURE;
    }
    return StatusCode::SUCCESS;
}

Destroy Power Config ID - This API allows client to destroy power configuration ID which was created earlier.

Qnn_ErrorHandle_t destroyPowerConfigId(uint32_t powerConfigId);

//Usage
sample_app::StatusCode sample_app::QnnSampleApp::destroyPowerConfigId() {
    QnnDevice_Infrastructure_t deviceInfra = nullptr;
    QnnInterface_t qnnInterface;
    Qnn_ErrorHandle_t devErr = qnnInterface.QNN_INTERFACE_VER_NAME.deviceGetInfrastructure(&deviceInfra);
    if (devErr != QNN_SUCCESS) {
        QNN_ERROR("device error");
        return StatusCode::FAILURE;
    }
    QnnHtpDevice_Infrastructure_t *htpInfra = static_cast<QnnHtpDevice_Infrastructure_t *>(deviceInfra);
    QnnHtpDevice_PerfInfrastructure_t perfInfra = htpInfra->perfInfra;

    Qnn_ErrorHandle_t perfInfraErr = perfInfra.destroyPowerConfigId(powerConfigId);
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("destroyPowerConfigId failed");
        return StatusCode::FAILURE;
    }
    return StatusCode::SUCCESS;
}

Apart from the above APIs, the user can use RPC polling and control latency for better performance in high performance modes.

RPC Polling and Latency Settings - rpcPollingTimeConfig parameter can be used to request for a rpc polling time in micro seconds, rpcControlLatencyConfig parameter can be used to reduce CPU wakeup delays.

sample_app::StatusCode sample_app::QnnSampleApp::setRpcLatencyAndPolling() {
    QnnDevice_Infrastructure_t deviceInfra = nullptr;
    QnnInterface_t qnnInterface;
    Qnn_ErrorHandle_t devErr = qnnInterface.QNN_INTERFACE_VER_NAME.deviceGetInfrastructure(&deviceInfra);
    if (devErr != QNN_SUCCESS) {
        QNN_ERROR("device error");
        return StatusCode::FAILURE;
      }
    QnnHtpDevice_Infrastructure_t *htpInfra = static_cast<QnnHtpDevice_Infrastructure_t *>(deviceInfra);
    QnnHtpDevice_PerfInfrastructure_t perfInfra = htpInfra->perfInfra;

    // set RPC Control Latency
    QnnHtpPerfInfrastructure_PowerConfig_t rpcControlLatency;            // refer QnnHtpPerfInfrastructure.h
    memset(&rpcControlLatency, 0, sizeof(rpcControlLatency));
    rpcControlLatency.option = QNN_HTP_PERF_INFRASTRUCTURE_POWER_CONFIGOPTION_RPC_CONTROL_LATENCY;
    rpcControlLatency.rpcControlLatencyConfig = 100;         // use rpc control latency recommended 100 us, refer hexagon sdk
    const QnnHtpPerfInfrastructure_PowerConfig_t *powerConfigs1[] = {&rpcControlLatency, NULL};

    Qnn_ErrorHandle_t perfInfraErr = perfInfra.setPowerConfig(powerConfigId, powerConfigs1);  // set RPC latency config on power config ID created
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("setPowerConfig failed");
        return StatusCode::FAILURE;
    }

    // set RPC Polling
    QnnHtpPerfInfrastructure_PowerConfig_t rpcPollingTime;   // refer QnnHtpPerfInfrastructure.h
    memset(&rpcPollingTime, 0, sizeof(rpcPollingTime));
    rpcPollingTime.option = QNN_HTP_PERF_INFRASTRUCTURE_POWER_CONFIGOPTION_RPC_POLLING_TIME;
    rpcPollingTime.rpcPollingTimeConfig = 9999;     // use rpc polling time recommended 0-10000 us
    const QnnHtpPerfInfrastructure_PowerConfig_t* powerConfigs2[] = {&rpcPollingTime, NULL};

    Qnn_ErrorHandle_t perfInfraErr = perfInfra.setPowerConfig(powerConfigId, powerConfigs2); // set RPC polling config on power config ID created
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("setPowerConfig failed");
        return StatusCode::FAILURE;
    }
    return StatusCode::SUCCESS;
}

Note

  1. RPC Latency and Polling is not supported on QNX platforms

  2. For detailed information on all the above performance setting parameters refer hexagon sdk documentation.

When RPC polling is enabled, the user may further enable adaptive polling for better performance, especially for large models.

Adaptive Polling Time - adaptivePollingTimeConfig parameter allows users to set the minimum threshold for inference time to determine if adaptive polling should be activated. adaptivePollingTimeConfig parameter can be used to save CPU power by skipping unnecessary RPC polling, and saves RPC time by waking up the CPU just in time to poll for a very short period of time.

sample_app::StatusCode sample_app::QnnSampleApp::setAdaptivePollingTime() {
    QnnDevice_Infrastructure_t deviceInfra = nullptr;
    QnnInterface_t qnnInterface;
    Qnn_ErrorHandle_t devErr = qnnInterface.QNN_INTERFACE_VER_NAME.deviceGetInfrastructure(&deviceInfra);
    if (devErr != QNN_SUCCESS) {
        QNN_ERROR("device error");
        return StatusCode::FAILURE;
      }
    QnnHtpDevice_Infrastructure_t *htpInfra = static_cast<QnnHtpDevice_Infrastructure_t *>(deviceInfra);
    QnnHtpDevice_PerfInfrastructure_t perfInfra = htpInfra->perfInfra;

    // set adaptive polling time
    QnnHtpPerfInfrastructure_PowerConfig_t adaptivePollingTime;  // refer to QnnHtpPerfInfrastructure.h
    memset(&adaptivePollingTime, 0, sizeof(adaptivePollingTime));
    adaptivePollingTime.option = QNN_HTP_PERF_INFRASTRUCTURE_POWER_CONFIGOPTION_ADAPTIVE_POLLING_TIME;
    adaptivePollingTime.adaptivePollingTimeConfig = 1000;
    const QnnHtpPerfInfrastructure_PowerConfig_t *powerConfigs[] = {&adaptivePollingTime, NULL};

    // set adaptive polling time config on power config ID created
    Qnn_ErrorHandle_t perfInfraErr = perfInfra.setPowerConfig(powerConfigId, powerConfigs);
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("setPowerConfig failed");
        return StatusCode::FAILURE;
    }
    return StatusCode::SUCCESS;
}

Note

  1. Adaptive Polling can only be activated if RPC Polling has already been enabled

  2. It is not recommended to enable adaptive polling for small models (e.g., < 1 ms inference time)

These performance APIs can be used to boost the performance. Example application of using these APIs for performance improvement in graph execution is shown below.

#include <HTP/QnnHtpPerfInfrastructure.h>
#include <QnnInterface.h>
#include <HTP/QnnHtpDevice.h>

void example_application () {

    -----
    std::unique_ptr<sample_app::QnnSampleApp> app;
    -----
    -----

    app->createPowerConfigId();       // Create power config ID before voting
    app->setRpcLatencyAndPolling();   // Use RPC polling and latency for high performing modes
    app->setPowerConfig();            // Set the different configurations for performance settings

    -----
    app->executeGraphs();             // Execute the graphs
    -----

    app->destroyPowerConfigId();      // Destroy the power config id
    -----
    -----
}

The above example app shown is purely for usage purpose. Clients can use their own settings for performance in these APIs and use them according to their requirements.

HMX Power Settings QnnHtpPerfInfrastructure.h allows setting HMX votes manually. To vote manually for both HVX and HMX, user can send different power configurations (PowerConfig’s) as shown below. The API design allows for one single call to set all power parameters.

Qnn_ErrorHandle_t setPowerConfig(uint32_t powerConfigId, const QnnHtpPerfInfrastructure_PowerConfig_t** config);

//Usage
sample_app::StatusCode sample_app::QnnSampleApp::setPowerConfig() {
    QnnDevice_Infrastructure_t deviceInfra = nullptr;
    QnnInterface_t qnnInterface;
    Qnn_ErrorHandle_t devErr = qnnInterface.QNN_INTERFACE_VER_NAME.deviceGetInfrastructure(&deviceInfra);
    if (devErr != QNN_SUCCESS) {
        QNN_ERROR("device error");
        return StatusCode::FAILURE;
    }
    QnnHtpDevice_Infrastructure_t *htpInfra = static_cast<QnnHtpDevice_Infrastructure_t *>(deviceInfra);
    QnnHtpDevice_PerfInfrastructure_t perfInfra = htpInfra->perfInfra;

    -------

    // Initialize the power config and select the voltage corner values for the performance settings
    QnnHtpPerfInfrastructure_PowerConfig_t powerConfig;
    memset(&powerConfig, 0, sizeof(powerConfig));

    powerConfig.option                     = QNN_HTP_PERF_INFRASTRUCTURE_POWER_CONFIGOPTION_DCVS_V3;
    powerConfig.dcvsV3Config.dcvsEnable    = 1; //1- To enable Dcvs and consider dcvs power mode, 0- To disable dcvs
    powerConfig.dcvsV3Config.setDcvsEnable = 1;
    powerConfig.dcvsV3Config.contextId     = powerConfigId;  //use the power config ID created

    // refer QnnHtpPerfInfrastructure.h
    powerConfig.dcvsV3Config.powerMode       = QNN_HTP_PERF_INFRASTRUCTURE_POWERMODE_PERFORMANCE_MODE;
    powerConfig.dcvsV3Config.setSleepLatency = 1; //True to consider Latency parameter
    powerConfig.dcvsV3Config.setBusParams    = 1; //True to consider Bus parameter
    powerConfig.dcvsV3Config.setCoreParams   = 1; //True to consider Core parameter
    powerConfig.dcvsV3Config.sleepDisable    = 1; //True to disable sleep, False to re-enable sleep
    powerConfig.dcvsV3Config.setSleepDisable = 1; //True to consider sleep disable/enable parameter

    //Set Sleep latency parameter
    powerConfig.dcvsV3Config.sleepLatency    =  40; // set dsp sleep latency ranges 10-65535 micro sec, refer hexagon sdk

    //set Bus Clock Parameters (refer QnnHtpPerfInfrastructure.h)
    powerConfig.dcvsV3Config.busVoltageCornerMin     = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.busVoltageCornerTarget  = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.busVoltageCornerMax     = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;

    //set Core Clock Parameters (refer QnnHtpPerfInfrastructure.h)
    powerConfig.dcvsV3Config.coreVoltageCornerMin    = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.coreVoltageCornerTarget = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;
    powerConfig.dcvsV3Config.coreVoltageCornerMax    = DCVS_VOLTAGE_VCORNER_MAX_VOLTAGE_CORNER;

    --------

    QnnHtpPerfInfrastructure_PowerConfig_t powerConfigHMX;
    memset(&powerConfigHMX, 0, sizeof(powerConfigHMX));

    powerConfigHMX.option                     = QNN_HTP_PERF_INFRASTRUCTURE_POWER_CONFIGOPTION_HMX_V2;
    powerConfigHMX.hmxV2Config.hmxPickDefault = 0;                                          // 1- HMX vote will scale with Dcvs Corner, 0- HMX vote needs to specified manually
    powerConfigHMX.hmxV2Config.hmxPerfMode    = QNN_HTP_PERF_INFRASTRUCTURE_CLK_PERF_HIGH;  //select max freq at target voltage corner, refer QnnHtpPerfInfrastructure.h

    //set HMX clock parameters (refer QnnHtpPerfInfrastructure.h)
    powerConfigHMX.hmxV2Config.hmxVoltageCornerMin    = DCVS_EXP_VCORNER_TUR;
    powerConfigHMX.hmxV2Config.hmxVoltageCornerTarget = DCVS_EXP_VCORNER_TUR;
    powerConfigHMX.hmxV2Config.hmxVoltageCornerMax    = DCVS_EXP_VCORNER_TUR;

    const QnnHtpPerfInfrastructure_PowerConfig_t *powerConfigs[] = {&powerConfig, powerConfigHMX, NULL};
    Qnn_ErrorHandle_t perfInfraErr = perfInfra.setPowerConfig(powerConfigId, powerConfigs);
    if (perfInfraErr != QNN_SUCCESS) {
        QNN_ERROR("setPowerConfig failed");
        return StatusCode::FAILURE;
    }
    return StatusCode::SUCCESS;
}

Note

The API for HMX power setting has few limitations as listed below:

  1. Only supports Hexagon v75 and later architectures.

  2. To set the HMX vote, the client must create DcvsV3 Context ID (powerConfig ID) first by calling createPowerConfigId() and use this powerConfig ID to set the HVX vote. The client can then use the same powerConfig ID to request the HMX vote either in the same call or different call.

  3. No independent HMX vote will be allowed from QNN API, the client will only be able to vote for the HMX when there is an active HVX vote.

  4. If no HVX vote is detected for a powerConfig ID, the HMX vote will be denied with error INVALID_INPUT.

  5. If no HMX vote is provided for a powerConfig ID, the default HMX vote will be applied (see the table below).

  6. Once the client places an explicit HMX vote, it is the client’s responsibility to set hmxPickDefault and make another call to setPowerConfig() if default behavior is desired.

  7. HMX vote change or revert to default can be applied, provided the context ID has a valid HVX vote.

  8. When destroyPowerConfigId() is called with powerConfig ID, all votes associated with that context ID will be removed.

HVX

HMX

HMX Pick Default

Operational Validity

Vote Applied

Vote Applied

No

Valid

No Vote

Vote Applied

N/A

Invalid

Vote Applied

No vote

Yes

Valid

Vote Removed

Vote Removed(Automatic)

N/A

Valid

Use Case Examples

Voting at Every Inference - This case depicts simple use case of setting certain performance setting (possibly higher performance configuration) before executing inference request, followed by another performance setting (possibly lower performance configuration). Figure below shows the call flow of setting performance setting at every inference.

../../_static/resources/htp_other_performance_profile.png

Sustain Setting for Multiple Inference - This case depicts the sustenance of performance setting (possibly higher performance configuration) for multiple inferences. This can be achieved using system timer. Client can start a timer for certain duration (higher than expected time between successive inferences) after setting performance vote (possibly higher performance). This vote gets reset (possibly with lower performance) either when timer expires or when client requests to change the performance settings. Figure below shows the call flow of sustaining performance setting for multiple inferences.

../../_static/resources/htp_sustained_performance_profile.png

QNN HTP Precision

QNN HTP supports running graphs having a mix of floating-point and fixed-point data types.

QNN HTP can support running float32 graphs using float16 math on select Qualcomm SoCs. The client is expected to set up the QNN graph with float32 tensors and QNN HTP accelerator will finalize and execute the QNN graph using float16 math.

Note

QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION is deprecated starting from 2.35 release. If you are using and SDK version >=2.35, there is no need to set this option.

QNN HTP backend will convert user provided float32 inputs in QnnGraph_execute() to float16 and execute the graph with float16 math. The final output is provided to user as float32 outputs.

Note

Please note that, float32 math is not supported by QNN HTP.

QNN HTP FP16 output difference between SM8550 and SM8650

The outputs of floating point models on HTP backend will be slightly different between SM8550 and SM8650. This may lead to slight accuracy difference between these two although one is not more accurate than the other. This is because of the changes in the hardware which changed the associativity of some of the computations to achieve higher efficiency.

Note

This same point can also be found in 2.9.1 release notes on the “HTP Float16” slide.

QNN HTP Deep Learning Bandwidth Compression (DLBC)

Deep Learning Bandwidth Compression is a feature that allows inputs to be compressed so the processing bandwidth can be lowered. QNN HTP provides a configuration option for users to turn ON or OFF DLBC through client usage like below:

 1 QnnHtpGraph_CustomConfig_t customConfig;
 2 customConfig.option = QNN_HTP_GRAPH_CONFIG_OPTION_OPTIMIZATION;
 3 customConfig.optimizationOption.type = QNN_HTP_GRAPH_OPTIMIZATION_TYPE_ENABLE_DLBC;
 4 customConfig.optimizationOption.floatValue = 1.0; // set to 0 to turn off
 5
 6 QnnGraph_Config_t graphConfig;
 7 graphConfig.option       = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
 8 graphConfig.customConfig = &customConfig;
 9
10 const QnnGraph_Config_t* pGraphConfig[] = {&graphConfig, NULL};

For offline preparation with DLBC, the backend-specific config file should specify the following option along with any other desired options:

{
   "graphs": [
       {
         "vtcm_mb": ...,
         "graph_names": [...],
         "dlbc": 1  // set to 1 to turn on
         ...
       }
   ],
   "devices": [
      {
         ...
         ...
      }
   ]
}

Value of 0 will turn OFF the feature and any positive floating point value greater than or equal to 1.0 will turn ON the feature. By default DLBC will be in disabled state i.e. when configuration option is not provided.

DLBC allows weight data to be compressed to lower processing bandwidth. QNN HTP provides a configuration option for clients to turn enable or disable DLBC weights.

 1 QnnHtpGraph_CustomConfig_t customConfig;
 2 customConfig.option = QNN_HTP_GRAPH_CONFIG_OPTION_OPTIMIZATION;
 3 customConfig.optimizationOption.type = QNN_HTP_GRAPH_OPTIMIZATION_TYPE_ENABLE_DLBC_WEIGHTS;
 4 customConfig.optimizationOption.floatValue = 1.0; // set to 0 to turn off
 5
 6 QnnGraph_Config_t graphConfig;
 7 graphConfig.option       = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
 8 graphConfig.customConfig = &customConfig;
 9
10 const QnnGraph_Config_t* pGraphConfig[] = {&graphConfig, NULL};

For offline preparation with DLBC, the backend-specific configuration should specify the dlbc_weights option along with any other options.

  • 0 – Disables DLBC weights; default when option is not provided

  • >= 1 – Enables DLBC weights

{
   "graphs": [
     {
       "vtcm_mb":...,
       "graph_names":[...],
       "dlbc_weights": 1      // set to 0 to turn off, dlbc for weights
       ...
     }
   ],
   "devices": [
      {
         ...
         ...
      }
   ]
}

Compression for inputs and weights can be independently set.

Limitations

  • Only supported for offline preparation.

  • Not supported with weight sharing.

  • Not supported with spillfill buffer sharing.

  • The number of graphs supported depends on whether compression is enabled for inputs and weights; graphs supported:

    • 32 – Input OR weight compression

    • 16 – Input AND weight compression

Note

The DLBC Weights with Weight Sharing feature is not supported. Starting with release 2.36, creating binaries with both DLBC Weights and Weight Sharing enabled will not be supported. Any binaries prepared before 2.36 with both features enabled will have DLBC WTS turned off regardless of the settings. To use DLBC WTS with such binaries, re-prepare without Weight Sharing.

QNN HTP - Setting Number of HVX Threads

This option allows user to set number of HVX thread(s) for a particular graph. The inference time depends on the number of HVX threads utilized. If more threads are used, the execution time of a graph will be lower (i.e. faster).

Number of HVX threads can be configured for both online and offline prepare cases. The value passed in the config during binary blob creation is what gets written in the serialized blob. Number of HVX threads can be re-configured by passing a new config to QnnGraph_setConfig QNN API.

It is important to note that number of threads can not be configured/re-configured after the first execution of that particular graph; it has to be prior to it.

Users can set the custom option as such:

1 QnnHtpGraph_CustomConfig_t customConfig;
2 customConfig.option = QNN_HTP_GRAPH_CONFIG_OPTION_NUM_HVX_THREADS;
3 customConfig.numHvxThreads = 3; // set a number. MAX = number of HVX HW blocks for that SoC
4
5 QnnGraph_Config_t graphConfig;
6 graphConfig.option       = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
7 graphConfig.customConfig = &customConfig;
8
9 const QnnGraph_Config_t* pGraphConfig[] = {&graphConfig, NULL};

The backend-specific config file should specify the following option along with any other desired options. In the case of offline prepare, If “hvx_threads” option is not provided, a default value of 4 is written to the binary blob. In the case of online prepare, if a config does not set any number of hvx thread(s), max supported value for that SoC is used during an inference.

Config can be used to set number of HVX threads as such:

{
   "graphs": [
       {
         "vtcm_mb":...,
         "graph_names":[...],
         "hvx_threads":3     // set a number. MAX = number of HVX HW blocks for that SoC
         ...
       }
   ],
   "devices": [
      {
         ...
         ...
      }
   ]
}

QNN HTP - Enabling the system level cache allocator

This option allows user to enable the usage of the System Level Cache Allocator for a given graph. It will help the by saving overall bandwith on the use case.

Users can set the custom option as such:

 1 QnnHtpGraph_CustomConfig_t customConfig;
 2 customConfig.option = QNN_HTP_GRAPH_CONFIG_OPTION_OPTIMIZATION;
 3 customConfig.optimizationOption.type = QNN_HTP_GRAPH_OPTIMIZATION_TYPE_ENABLE_SLC_ALLOCATOR;
 4 customConfig.optimizationOption.floatValue = 1;
 5
 6 QnnGraph_Config_t graphConfig;
 7 graphConfig.option       = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
 8 graphConfig.customConfig = &customConfig;
 9
10 const QnnGraph_Config_t* pGraphConfig[] = {&graphConfig, NULL};

The feature is only supported by specific SOCs. By default the option is turned off

Config can be used to set the option as such:

{
   "graphs": [
       {
         "vtcm_mb":...,
         "graph_names":[...],
         "slc_alloc_enable":1
         ...
       }
   ],
   "devices": [
      {
       "soc_id": 69, //representing the soc
         ...
         ...
      }
   ]
}

Note

This option can be configured for offline prepare cases. This option can’t be modified during inference.

However, it is possible to force the disablement of the feature during execution Users can set the custom option as such:

1 QnnHtpGraph_CustomConfig_t customConfig;
2 customConfig.option = QNN_HTP_GRAPH_OPTIMIZATION_TYPE_ENABLE_SLC_ALLOCATOR;
3 customConfig.optimizationOption.floatValue = 0;
4
5 QnnGraph_Config_t graphConfig;
6 graphConfig.option       = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
7 graphConfig.customConfig = &customConfig;
8
9 const QnnGraph_Config_t* pGraphConfig[] = {&graphConfig, NULL};

Note

To restore the previous state, the same sequence should be called with 1 as a value.

QNN HTP Backend Extensions

The qnn-net-run utility is backend agnostic, meaning it can only use generic QNN APIs. The backend extension feature facilitates usage of the backend specific APIs, namely custom configurations. More documentation on backend extensions can be found under qnn-net-run. Note that the scope of QNN backend extensions is limited to qnn-net-run and qnn-context-binary-generator. HTP Backend Extensions is an interface to provide custom options to HTP Backend. It is also required to enable different performance modes. These options and performance modes can be exercised by providing an extension shared library libQnnHtpNetRunExtensions.so and a config file, if necessary.

To use backend extension related parameters with qnn-net-run, use --config_file argument and give path to JSON file.

$ qnn-net-run --model <qnn_model_name.so> \
              --backend <path_to_model_library>/libQnnHtp.so \
              --output_dir <output_dir_for_result> \
              --input_list <path_to_input_list.txt>
              --config_file <path to JSON of backend extensions>

The above config file with minimum parameters to use backend extensions config is shown below:

{
    "backend_extensions" :
    {
        "shared_library_path" : "path_to_shared_library",  // give path to shared extensions library (.so)
        "config_file_path" : "path_to_config_file"         // give path to backend config
    }
}

Users can set the custom options and different performance modes to HTP Backend through the backend config. The various options available in the config are shown below:

{
   "type": "object", "properties": {
     "graphs": {
         "type": "array", "items": {
           "type": "object", "properties": {

             // Corresponds to the graph name provided to QnnGraph_create
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "graph_names": {"type": "array", "items": {"type": "string"}},

             // Provides performance infrastructure configuration options that are memory specific [optional]
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             // To use a device's maximum VTCM amount, set the value to 0 (QNN_HTP_GRAPH_CONFIG_OPTION_MAX)
             // and specify the target SoC through the device config.
             "vtcm_mb": {"type": "integer"},

             // Corresponds to the number of HVX threads to use for a particular graph during an inference.
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "hvx_threads": {"type": "integer"},

             // Set Graph optimization value in range 1 to 3 [optional] [default: 2]
             // 1 = Faster preparation time, less optimal graph, 2 = Longer preparation time, more optimal graph
             // 3 = Longest preparation time, most likely even more optimal graph
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "O": {"type": "number", "multipleOf": 1},

             // Provide deep learning bandwidth compression value 0 or 1 [optional] [default: 0]
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "dlbc": {"type": "number", "multipleOf": 1},

             // Specifies whether to enable weights packing [optional] [default: false]
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "weights_packing": {"type": "boolean"},

             // Specifies the number of cores the graph will use for execution [optional] [default: 1]
             // Used by qnn-context-binary-generator during offline preparation
             "num_cores": {"type": "integer"},

             // Specifies whether to configure short depth convolution for the graph [optional] [default: false]
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "short_depth_conv_on_hmx_off": {"type": "boolean"},

             // Specifies whether to configure fold relu activation for the graph [optional] [default: false]
             // Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
             "fold_relu_activation_into_conv_off": {"type": "boolean"}
           }
         }
     },
     "devices": {
       "type": "array", "items": {
         "type": "object", "properties": {

           // Selection of the device [optional] [default: 0]
           // Used by qnn-net-run
           "device_id": {"type": "integer"},

           // Select the core [optional] [default: 0]
           // Used by qnn-net-run to select among the cores available in a device
           "core_id": {"type": "array", "items": {"type": "integer"}},

           // Select the available core type [optional] [default: 0]
           // Used by qnn-net-run, 0 - NSP, 1 - HPASS
           "core_type": {"type": "integer"},

           // Selection of the SoC [optional] [default: 0]
           // Used by qnn-net-run and qnn-context-binary-generator
           "soc_id": {"type": "integer"},

           // Selection of the SoC model [optional] [default: 0]
           // Used by qnn-net-run and qnn-context-binary-generator
           "soc_model": {"type": "integer"},

           // Set dsp architecture value [optional] [default: NONE]
           // Used by qnn-net-run and qnn-context-binary-generator
           "dsp_arch": {"type": "string"},

           // Specifies the user pd attribute [optional] [default: "unsigned"]
           // Used by qnn-net-run and qnn-context-binary-generator
           "pd_session": {"type": "string"},

           // Used for linting profiling level [optional] [default: not set]
           // Used by qnn-net-run and qnn-context-binary-generator
           "profiling_level": {"type": "string"},

           // Specifies whether to use null context or not. true means using a unique power context id, and false means using null context.
           // NOTE: This parameter is not supported for v68 onwards
           // Used by qnn-net-run
           "use_client_context": {"type": "boolean"},
           "cores": {
             "type": "array", "items": {
               "type": "object", "properties": {

                 // Provide performance profile [optional] [default: "high_performance"]
                 // Used by qnn-net-run
                 // Note: This perf profile will be overridden by any profiles specified via the command line option --perf-profile
                 "perf_profile": {"type": "string"},

                 // Rpc control latency value in micro second [optional] [default: 100us]
                 // Used by qnn-net-run
                 "rpc_control_latency": {"type": "integer"},

                 // Rpc polling time value in micro second [optional]
                 // [default: 9999 us for burst, high_performance & sustained_high_performance, 0 us for other perf profiles]
                 // Used by qnn-net-run
                 "rpc_polling_time": {"type": "integer"},

                 // Hmx timeout value in micro second [optional] [default: 300000us]
                 // Used by qnn-net-run
                 "hmx_timeout_us": {"type": "integer"},

                 // Adaptive polling time value in micro second [optional] [default: 0 us]
                 // Used by qnn-net-run
                 "adaptive_polling_time": {"type": "integer"}
               }
             }
           }
         }
       }
     },
     "context": {
       "type": "object", "properties": {

         // Used for enabling Weight Sharing [optional] [default: false]
         // Used by qnn-context-binary-generator during offline preparation
         "weight_sharing_enabled": {"type": "boolean"},

         // Used to associate max spill-fill buffer size across multiple contexts within a group [optional] [default: Not Set]
         // Used by qnn-net-run and qnn-throughput-net-run during offline preparation. group_id value must be set to 0 for this option to be used.
         "max_spill_fill_buffer_for_group": {"type": "integer"},

         // Specifies the group id to which contexts can be associated [optional] [default: None]
         // Used by qnn-net-run and qnn-throughput-net-run during offline preparation.
         "group_id": {"type": "integer"},

         // Used to set read memory budget size in Mb [optional] [default: 0]
         // Used by qnn-net-run and qnn-throughput-net-run when using a serialized binary for graph preparation.
         "file_read_memory_budget_in_mb": {"type": "integer"}

         // Used to enable I/O memory estimation [optional] [default: false]
         // Used by qnn-net-run and qnn-throughput-net-run when creating context from a serialized context binary.
         "io_memory_estimation": {"type": "boolean"}

         // Used to enable init acceleration [optional] [default: false]
         // Used by qnn-net-run and qnn-throughput-net-run when creating context from a serialized context binary.
         "init_acceleration": {"type": "boolean"}

         // Used for enabling Lora Weight Sharing [optional] [default: false]
         // Used by qnn-context-binary-generator during offline preparation
         "lora_weight_sharing": {"type": "boolean"}
       }
     },
     "groupContext": {
       "type": "object", "properties": {

         // Used to enable shared resources across different contexts [optional] [default: false]
         // Used by qnn-net-run and qnn-throughput-net-run when creating multiple contexts from a list of serialized context binaries.
         "share_resources": { "type": "boolean"}
       }
     },
     "memory": {
       "type": "object", "properties": {

         // Use multi-tensor shared buffers for input/output [optional] [default: QNN_HTP_MEM_UNDEFINED], Refer QnnHtpMem_Type_t
         // Used by qnn-net-run and qnn-throughput-net-run
         "mem_type": {"type": "string", "enum": ["shared_buffer"]  }
       }
     }
   }
}

Note

  1. soc_id parameter will be deprecated, For setting the Soc use soc_model parameter.

  2. Qnn_SocModel_t will be deprecated, For setting Soc Model refer to the Supported Snapdragon Devices

  3. fp16_relaxed_precision is deprecated starting from 2.35.0 release. Moving forward, there is no need to set this parameter for fp functionality and it will be determined based on SoC support.

Backend extensions performance modes can be enabled using perf_profile parameter through backend config as shown above. Valid settings are low_balanced, balanced, high_performance, sustained_high_performance, burst, low_power_saver, power_saver, high_power_saver, extreme_power_saver and system_settings. Note that these performance modes are user defined and customers can choose to define their own performance modes according to their needs using QNN APIs.

These performance modes use different configurations of core clocks, bus clocks, DCVS participation algorithms and sleep latencies. There are 3 types of voltage corners defined as TURBO, NOM and SVS which further have different voltage levels. Apart from these, there are MAX and MIN voltage corners which sets the frequency to maximum and minimum frequency supported on target. For further details on the performance modes configuration and parameter details, refer hexagon sdk documentation. These settings used by different performance modes defined above are shown in table below:

BURST

SUSTAINED_HIGH_PERFORMANCE

HIGH_PERFORMANCE

BALANCED

LOW_BALANCED

HIGH_POWER_SAVER

POWER_SAVER

LOW_POWER_SAVER

EXTREME_POWER_SAVER

RELAXED_POWER_STATE*

RELEASED_POWER_STATE*

sleepLatency

40 us

100 us

100 us

1000 us

1000 us

1000 us

1000 us

1000 us

1000 us

2000 us

65535 us

dcvsEnable

False

False

False

False

False

False

False

False

False

True

True

RPC Polling

ON

ON

ON

OFF

OFF

OFF

OFF

OFF

OFF

OFF

OFF

busVCornerMin

MAX_VOLTAGE_CORNER

TURBO

TURBO

NOM_PLUS

NOM

SVS_PLUS

SVS

SVS2

DISABLE

SVS2

MIN_VOLTAGE_CORNER

busVCornerTarget

MAX_VOLTAGE_CORNER

TURBO

TURBO

NOM_PLUS

NOM

SVS_PLUS

SVS

SVS2

DISABLE

SVS

MIN_VOLTAGE_CORNER

busVCornerMax

MAX_VOLTAGE_CORNER

TURBO

TURBO

NOM_PLUS

NOM

SVS_PLUS

SVS

SVS2

DISABLE

SVS

MIN_VOLTAGE_CORNER

coreVCornerMin

MAX_VOLTAGE_CORNER

TURBO

TURBO

NOM_PLUS

NOM

SVS_PLUS

SVS

SVS2

DISABLE

SVS2

MIN_VOLTAGE_CORNER

coreVCornerTarget

MAX_VOLTAGE_CORNER

TURBO

TURBO

NOM_PLUS

NOM

SVS_PLUS

SVS

SVS2

DISABLE

SVS

MIN_VOLTAGE_CORNER

coreVCornerMax

MAX_VOLTAGE_CORNER

TURBO

TURBO

NOM_PLUS

NOM

SVS_PLUS

SVS

SVS2

DISABLE

SVS

MIN_VOLTAGE_CORNER

Note

RELAXED_POWER_STATE and RELEASED_POWER_STATE are internally applied based on performance profile to lower the votes. These are not configurable to user.

Above table is ordered from highest performance (BURST) to lowest performance (EXTREME_POWER_SAVER). BURST and SUSTAINED_HIGH_PERFORMANCE uses a timer during execution which helps in keeping the vote high for all inferences and avoids subsequent up-down of perf votes until timeout. They have low sleep latency, RPC polling is enabled and DCVS is disabled during execution. Note that DCVS if enabled can both increase and decrease the core/bus clock speeds while min_corner and max_corner votes are used as lower and upper limit thresholds for DCVS. BURST has the highest frequency and it sustains high voting, which gives the best performance.

HIGH_PERFORMANCE mode however do not sustain votes during multiple inferences instead it moves to idle state RELAXED_POWER_STATE in between inferences which reduces CPU power consumption. POWER_SAVER, LOW_POWER_SAVER and HIGH_POWER_SAVER have low frequencies, high sleep latencies and moves to idle state RELEASED_POWER_STATE in between inferences. EXTREME_POWER_SAVER is the lowest performing performance mode and saves the highest power.

There are 3 stages to graph execution i.e INIT, INFERENCE and DE-INIT. Above defined performance modes will be applied to graph before each stage i.e INIT, INFERENCE and DE-INIT as well. After each stage completion, the lower votes will be applied i.e RELAXED_POWER_STATE or RELEASED_POWER_STATE according to the performance mode selected by the user.

Below config can be used to set HTP performance profile and rpc polling time:

{
   ...
   "devices": [{
         ...
         "cores":[{
                "perf_profile": "burst",    // use this to set any of the above performance profile
                "rpc_polling_time": 9999,    // use this to set rpc polling, ranges 0-9999 us
                "rpc_control_latency": 100  // use to set rpc control latency
            }]
      }]
}

QNN HTP Profiling

Basic Profiling

Basic profiling report for execution provides the graph inference summary on both - Host and Accelerator.

HTP Execute Basic Profiling Events diagram illustrates the basic HTP execute profiling events and how they are measured during the inference.

HTP Execute Basic Profiling Events

QNN HTP Execute Basic Profiling Events

DSP heap profiling is available for QnnContext_createFromBinary use-cases for monitoring total memory use. Currently, a total DSP heap usage metric can be retrieved for following scenarios:

  • before any contexts are created (when creating the first context),

  • after all contexts are freed (when freeing the last context).

Enabling DSP heap usage profiling for a given context can be achieved by passing the following configuration to the relevant QnnContext_createFromBinary call:

1QnnHtpContext_CustomConfig_t customConfig;
2customConfig.option = QNN_HTP_CONTEXT_CONFIG_OPTION_DSP_MEMORY_PROFILING_ENABLED;
3customConfig.dspMemoryProfilingEnabled = true; // set to false to disable DSP heap profiling
4
5QnnContext_Config_t contextConfig;
6contextConfig.option = QNN_CONTEXT_CONFIG_OPTION_CUSTOM;
7contextConfig.customConfig = &customConfig;
8
9const QnnContext_Config_t* pDspMemProfilingContextConfig[] = {&contextConfig, NULL};

Total DSP heap usage before any contexts are created: If the aforementioned configuration is enabled for the first context to be created in the relevant QnnContext_createFromBinary call, the total DSP heap usage value can be retrieved from the DSP:before_context_created event defined in the Qnn_ProfileHandle_t instance, as shown below:

 1Qnn_ProfileHandle_t profileHandle;
 2QnnProfile_create(QNN_PROFILE_LEVEL_BASIC, &profileHandle);
 3
 4QnnContext_createFromBinary(..., pDspMemProfilingContextConfig, ..., &contextHandle1, &profileHandle); // first context creation
 5QnnContext_createFromBinary(..., &contextHandle2, ...);
 6QnnContext_createFromBinary(..., pDspMemProfilingContextConfig, ..., &contextHandle3, ...);
 7
 8const QnnProfile_EventId_t* events;
 9uint32_t numEvents;
10QnnProfile_getEvents(profileHandle, &events, &numEvents);
11
12for (uint32_t i = 0u; i < numEvents; ++i) {
13   QnnProfile_EventData_t eventData;
14   QnnProfile_getEventData(events[i], &eventData);
15   if (strcmp(eventData.identifier, "DSP:before_context_created") == 0) {
16         uint64_t totalDspHeapUsageBeforeContextCreated = eventData.value;
17   }
18}

Total DSP heap usage after all contexts are freed: If the aforementioned configuration was enabled for the last context to be freed either in the relevant QnnContext_createFromBinary call or later on with the use of QnnContext_setConfig, the total DSP heap usage value can be retrieved from the DSP:after_context_freed event defined in the Qnn_ProfileHandle_t instance, as shown below:

 1QnnContext_free(&contextHandle1, ...);
 2QnnContext_free(&contextHandle2, ...);
 3QnnContext_free(&contextHandle3, &profileHandle); // last context free, config was enabled for contextHandle3 during QnnContext_createFromBinary
 4
 5const QnnProfile_EventId_t* events;
 6uint32_t numEvents;
 7QnnProfile_getEvents(profileHandle, &events, &numEvents);
 8
 9for (uint32_t i = 0u; i < numEvents; ++i) {
10   QnnProfile_EventData_t eventData;
11   QnnProfile_getEventData(events[i], &eventData);
12   if (strcmp(eventData.identifier, "DSP:after_context_freed") == 0) {
13         uint64_t dspHeapUsageAfterContextFreed = eventData.value;
14   }
15}

Note

Please note that in case the configuration was not enabled for the given context, QnnContext_free will not output the DSP heap usage metric.

Note

DSP heap profiling feature has the following requirements and limitations:

Requirements:

  1. Profiling should be enabled both for the first context to be created and the last context to be freed.

Limitations:

  1. Only supported on Android and QNX platforms.

  2. By enabling this feature initialization and cleanup time might be impacted.

Detailed and Linting Profiling

Detailed profiling report provides per op profiling result by cycle counts instead of time in microsecs. There is no direct conversion method from cycle count to microsecs because of the parallelized execution of Ops. Hence it is recommended to use the per layer cycle timings as a reference to compare/measure the relative performance to know which of them are using lower/higher cycles to finish the execution.

HTP-specific linting profiling report provides per op cycle count on the main thread along with background execution information. On the main thread, each op has to wait for some cycles since the execution of the last op before the start of its own execution. This wait period can be attributed to various factors such as scheduling or waiting for some background HVX or DMA activity to finish. In linting profiling report, each op has a cycle count associated with it signifying the amount of cycles spent actually executing the op on the main thread. There is also a “Wait” entry associated with each op that correponds to the wait period mentioned before. Aside from these two cycle counts that describe the main thread activity, each op has two more entries to depict background activity. The first of these two entries is the “Overlap” entry denoting the number of cycles spent on at least one background op while the op is executing on the main thread. Next, each op has a “Overlap (wait)” entry that is similar to the “Wait” entry with the exception that the cycles reported in this entry correspond to the “Wait” period (ie. cycles spent on at least one background op while the main thread was waiting). Background ops that are being waited on by main thread ops are not considered as background activity and as such do not contribute to the counts reported by the overlap entries. Each of the overlap entries also has several indented lines (10 maximum) following it indicating the names of the ops that contributed to the respective overlap cycle count. Finally, each op also has a “Resources” entry listing the different resources used by that op. The HTP-specific linting profiling level can be enabled by specifying --profiling_level=backend when running qnn-net-run so that the profiling level specified in the backend-specific config file is used. Please refer to the documentation for qnn-net-run to learn more about libQnnHtpNetRunExtensions.so and backend-specific config files. For linting profiling, the backend-specific config file should specify the following option along with any other desired option:

{
   ....
   "devices": [{
         ...
         "profiling_level": "linting",
         "cores": [{
             ...
         }]
   }]
}

The profile outputs generated with this profiling level can be viewed using the qnn-profile-viewer tool with its libQnnHtpProfilingReader.so or libQnnChrometraceProfilingReader.so reader plugin. libQnnHtpProfilingReader.so reader provides raw output of every single run whereas libQnnChrometraceProfilingReader.so provides average output of all the runs. Additionally, a file containing the profiling data in chrometrace format can be generated if an output file is specified with the --output option when running the qnn-profile-viewer tool with the libQnnChrometraceProfilingReader.so reader plugin.

To retrieve linting information from an inference, the following steps are required:

  1. Set $QNN_SDK_ROOT to your desired QNN version

  2. Run “source $QNN_SDK_ROOT/bin/envsetup.sh”

  3. Push the reqired files to the device
    • $QNN_SDK_ROOT/lib/aarch64-android/libQnnHtpNetRunExtensions.so

    • backend_extension_config.json

    • htp_config.json

  4. Run inference on device, make sure to add the following parameters: “-–profiling_level=backend and –-config_file=backend_extension_config.json”

  5. Pull output logs to linux

  6. When using qnn-profile-viewer make sure to specify the following parameter: “–-reader $QNN_SDK_ROOT/lib/x86_64-linux-clang/libQnnHtpProfilingReader.so”

  7. When generating chromeTrace file, make sure to specify the following parameter: “–output ./chrometrace.json”

backend_extension_config.json

{
    "backend_extensions": {
        "shared_library_path" : "./libQnnHtpNetRunExtensions.so",
        "config_file_path" : "./htp_config.json"
    }
}

htp_config.json

{"devices": [ {"profiling_level" : "linting"} ] }

Example Inference Command

./qnn-net-run --retrieve_context sample_model.bin --backend libQnnHtp.so --input_list target_raw_list.txt --config_file backend_extension_config.json --output_dir output_htp --profiling_level backend

Example Profile Viewer Command

$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-profile-viewer --reader $QNN_SDK_ROOT/lib/x86_64-linux-clang/libQnnHtpProfilingReader.so --input_log ./output/qnn-profiling-data_0.log
$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-profile-viewer --reader $QNN_SDK_ROOT/lib/x86_64-linux-clang/libQnnChrometraceProfilingReader.so --input_log ./output/qnn-profiling-data_0.log --output ./chromeTrace.json

Showcase Model 1 diagram illustrates a model with two branches each performing a couple of convolutions before their results are used in a sub operation.

Showcase Model 1

QNN HTP Profiling Showcase Model 1

The linting profiling output for this model is given below:

Execute Stats (Average):
------------------------
Total Inference Time:
---------------------
   NetRun:  16792 us
   Backend (RPC (execute) time): 16242  us
   Backend (QNN accelerator (execute) time): 15190  us
   Backend (Num times yield occured): 0  count
   Backend (Time for initial VTCM acquire): 0  us
   Backend (Time for HVX + HMX power on and acquire): 0  us
   Backend (Accelerator (critical path execute) time (cycles)): 4327266  cycles
      Input OpId_2 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources:
      OpId_0 (cycles): 8036  cycles
            Wait (Scheduler) time: 629  cycles
            Overlap time: 4770  cycles
            Overlap (wait) time: 565  cycles
            Resources:
      model_convStart_Conv2D:OpId_21 (cycles): 147075  cycles
            Wait (Scheduler) time: 32  cycles
            Overlap time: 85292  cycles
               model_sub_sub:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 32  cycles
               model_convStart_Conv2D:OpId_21
            Resources: HVX, HMX, DMA
      model_tf_op_layer_stride_stride:OpId_24 (cycles): 146494  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 70807  cycles
               model_add_add:OpId_58
               Output OpId_3
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_convLeft1_Conv2D:OpId_34 (cycles): 288249  cycles
            Wait (Scheduler) time: 425  cycles
            Overlap time: 195988  cycles
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 304  cycles
               Output OpId_3
               model_add_add:OpId_58
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convRight1_Conv2D:OpId_41 (cycles): 220391  cycles
            Wait (Scheduler) time: 803  cycles
            Overlap time: 135268  cycles
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 557  cycles
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convRight2_Conv2D:OpId_48 (cycles): 181016  cycles
            Wait (Scheduler) time: 1090  cycles
            Overlap time: 69323  cycles
               model_sub_sub:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
            Overlap (wait) time: 489  cycles
               model_sub_sub:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
            Resources: HMX, DMA
      model_convLeft2_Conv2D:OpId_55 (cycles): 233736  cycles
            Wait (Scheduler) time: 1059  cycles
            Overlap time: 93020  cycles
               model_sub_sub:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 464  cycles
               model_sub_sub:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
            Resources: HMX, DMA
      model_sub_sub:OpId_57 (cycles): 2165162  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 465046  cycles
               model_sub_sub:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_add_add:OpId_58 (cycles): 525971  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 481468  cycles
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
            Overlap (wait) time: 0  cycles
            Resources: HVX
      Output OpId_3 (cycles): 407091  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 115120  cycles
            Overlap (wait) time: 0  cycles
            Resources: HVX

The linting profiling chrometrace output for this model is given below:

Showcase Model 1 Chrometrace

QNN HTP Profiling Showcase Model 1 Chrometrace

From the output, it is evident that the sub op (OpId_57) is the most significant contributor to the total execution time - around 50%. This op also does not have significant parallel op execution - its Overlap time is 465046 cycles which is about 21.5% of its total execution time - indicating that this op is a good bottleneck to optimize. We can design an equivalent model as shown in the Showcase Model 1 Optimized diagram merging the two branches and replacing the sub op with a convolution with weights manually designed such that it performs the same task as a sub op.

Showcase Model 1 Optimized

QNN HTP Profiling Showcase Model 1 Optimized

The linting profiling output for this optimized model is given below:

Execute Stats (Average):
------------------------
Total Inference Time:
---------------------
   NetRun:  11884 us
   Backend (RPC (execute) time): 11525  us
   Backend (QNN accelerator (execute) time): 10481  us
   Backend (Num times yield occured): 0  count
   Backend (Time for initial VTCM acquire): 0  us
   Backend (Time for HVX + HMX power on and acquire): 0  us
   Backend (Accelerator (critical path execute) time (cycles)): 1374349  cycles
      Input OpId_2 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources:
      OpId_0 (cycles): 3500  cycles
            Wait (Scheduler) time: 1284  cycles
            Overlap time: 3221  cycles
            Overlap (wait) time: 1268  cycles
            Resources:
      model_convStart_Conv2D:OpId_21 (cycles): 487448  cycles
            Wait (Scheduler) time: 32  cycles
            Overlap time: 475888  cycles
               Output OpId_3
               model_add_add:OpId_50
               model_tf_op_layer_stride_1_stride_1:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 32  cycles
               model_convStart_Conv2D:OpId_21
            Resources: HVX, HMX, DMA
      model_tf_op_layer_stride_1_stride_1:OpId_24 (cycles): 10422  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 10075  cycles
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_1_stride_1:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_convCombined1_Conv2D:OpId_34 (cycles): 337711  cycles
            Wait (Scheduler) time: 82  cycles
            Overlap time: 307394  cycles
               Output OpId_3
               model_tf_op_layer_stride_1_stride_1:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 50  cycles
               Output OpId_3
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convCombined2_Conv2D:OpId_41 (cycles): 295022  cycles
            Wait (Scheduler) time: 1184  cycles
            Overlap time: 286062  cycles
               model_add_add:OpId_50
               Output OpId_3
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_1_stride_1:OpId_24
            Overlap (wait) time: 1140  cycles
               model_add_add:OpId_50
               Output OpId_3
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_1_stride_1:OpId_24
            Resources: HMX, DMA
      model_subConv_Conv2D:OpId_48 (cycles): 48720  cycles
            Wait (Scheduler) time: 1186  cycles
            Overlap time: 46686  cycles
               model_add_add:OpId_50
               model_tf_op_layer_stride_1_stride_1:OpId_24
               Output OpId_3
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 1142  cycles
               model_add_add:OpId_50
               Output OpId_3
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_add_add:OpId_50 (cycles): 110698  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 108524  cycles
               model_add_add:OpId_50
               Output OpId_3
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_1_stride_1:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      Output OpId_3 (cycles): 77054  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 75438  cycles
            Overlap (wait) time: 0  cycles
            Resources: HVX

The total execution time has decreased significantly as a result of removing the sub op. All the ops also have significant amount of parallel op execution - as evidenced by their respective Overlap time numbers - indicating good optimization. Showcase Model 2 diagram illustrates a model that is similar to the one in the Showcase Model 1 diagram. The difference is that there is a div op in place of the problematic sub op.

Showcase Model 2

QNN HTP Profiling Showcase Model 2

The linting profiling output for this model is given below:

Execute Stats (Average):
------------------------
Total Inference Time:
---------------------
   NetRun:  19353 us
   Backend (RPC (execute) time): 18679  us
   Backend (QNN accelerator (execute) time): 17700  us
   Backend (Num times yield occured): 0  count
   Backend (Time for initial VTCM acquire): 0  us
   Backend (Time for HVX + HMX power on and acquire): 0  us
   Backend (Accelerator (critical path execute) time (cycles)): 7866535  cycles
      Input OpId_2 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources:
      OpId_0 (cycles): 8657  cycles
            Wait (Scheduler) time: 782  cycles
            Overlap time: 5155  cycles
            Overlap (wait) time: 717  cycles
            Resources:
      model_convStart_Conv2D:OpId_21 (cycles): 148293  cycles
            Wait (Scheduler) time: 34  cycles
            Overlap time: 86500  cycles
               model_tf_op_layer_RealDiv_RealDiv:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 34  cycles
               model_convStart_Conv2D:OpId_21
            Resources: HVX, HMX, DMA
      model_tf_op_layer_stride_stride:OpId_24 (cycles): 145084  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 70877  cycles
               model_convStart_Conv2D:OpId_21
               model_add_add:OpId_58
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_convLeft1_Conv2D:OpId_34 (cycles): 285476  cycles
            Wait (Scheduler) time: 431  cycles
            Overlap time: 196212  cycles
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 318  cycles
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convRight1_Conv2D:OpId_41 (cycles): 219298  cycles
            Wait (Scheduler) time: 804  cycles
            Overlap time: 134711  cycles
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 558  cycles
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convRight2_Conv2D:OpId_48 (cycles): 181198  cycles
            Wait (Scheduler) time: 1083  cycles
            Overlap time: 68306  cycles
               model_tf_op_layer_RealDiv_RealDiv:OpId_57
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 476  cycles
               model_tf_op_layer_RealDiv_RealDiv:OpId_57
               Output OpId_3
            Resources: HMX, DMA
      model_convLeft2_Conv2D:OpId_55 (cycles): 233731  cycles
            Wait (Scheduler) time: 1055  cycles
            Overlap time: 91960  cycles
               model_tf_op_layer_RealDiv_RealDiv:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 447  cycles
               model_tf_op_layer_RealDiv_RealDiv:OpId_57
               Output OpId_3
            Resources: HMX, DMA
      model_tf_op_layer_RealDiv_RealDiv:OpId_57 (cycles): 5344081  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 528123  cycles
               model_tf_op_layer_RealDiv_RealDiv:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_add_add:OpId_58 (cycles): 525199  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 481084  cycles
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_stride:OpId_24
               Output OpId_3
               model_add_add:OpId_58
            Overlap (wait) time: 0  cycles
            Resources: HVX
      Output OpId_3 (cycles): 771320  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 115729  cycles
            Overlap (wait) time: 0  cycles
            Resources: HVX

Again, the bottleneck for this graph can be identified by examining the main and background utilization of each op. In this case, the div op is the major contributor to the overall graph execution time with it taking up 5344081 cycles - about 68% of the total execution time. Only about 10% of this op’s execution has some parallel background activity which again indicates a good potential for performance gain through optimization. Replacing the div op with a mul op is a suggested optimization strategy found in the best practices guidelines. The linting profiler output for the graph optimized with a mult op instead of a div op is given below:

Execute Stats (Average):
------------------------
Total Inference Time:
---------------------
   NetRun:  15755 us
   Backend (RPC (execute) time): 15274  us
   Backend (QNN accelerator (execute) time): 14108  us
   Backend (Num times yield occured): 0  count
   Backend (Time for initial VTCM acquire): 0  us
   Backend (Time for HVX + HMX power on and acquire): 0  us
   Backend (Accelerator (critical path execute) time (cycles)): 2741387  cycles
      Input OpId_2 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources:
      OpId_0 (cycles): 8067  cycles
            Wait (Scheduler) time: 735  cycles
            Overlap time: 4781  cycles
            Overlap (wait) time: 669  cycles
            Resources:
      model_convStart_Conv2D:OpId_21 (cycles): 147478  cycles
            Wait (Scheduler) time: 32  cycles
            Overlap time: 86319  cycles
               model_multiply_mul:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 32  cycles
               model_convStart_Conv2D:OpId_21
            Resources: HVX, HMX, DMA
      model_tf_op_layer_stride_stride:OpId_24 (cycles): 145396  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 70208  cycles
               model_convStart_Conv2D:OpId_21
               model_add_add:OpId_58
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_convLeft1_Conv2D:OpId_34 (cycles): 287130  cycles
            Wait (Scheduler) time: 430  cycles
            Overlap time: 198222  cycles
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 308  cycles
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convRight1_Conv2D:OpId_41 (cycles): 219409  cycles
            Wait (Scheduler) time: 806  cycles
            Overlap time: 135286  cycles
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 558  cycles
               Output OpId_3
               model_tf_op_layer_stride_stride:OpId_24
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_convRight2_Conv2D:OpId_48 (cycles): 181465  cycles
            Wait (Scheduler) time: 1068  cycles
            Overlap time: 69160  cycles
               model_multiply_mul:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 467  cycles
               model_multiply_mul:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
            Resources: HMX, DMA
      model_convLeft2_Conv2D:OpId_55 (cycles): 233619  cycles
            Wait (Scheduler) time: 1055  cycles
            Overlap time: 92740  cycles
               model_multiply_mul:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 445  cycles
               model_multiply_mul:OpId_57
               model_convStart_Conv2D:OpId_21
               Output OpId_3
               model_add_add:OpId_58
            Resources: HMX, DMA
      model_multiply_mul:OpId_57 (cycles): 737978  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 437784  cycles
               model_multiply_mul:OpId_57
               Output OpId_3
               model_add_add:OpId_58
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_stride:OpId_24
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_add_add:OpId_58 (cycles): 527450  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 481714  cycles
               model_convStart_Conv2D:OpId_21
               model_tf_op_layer_stride_stride:OpId_24
               Output OpId_3
               model_add_add:OpId_58
            Overlap (wait) time: 0  cycles
            Resources: HVX
      Output OpId_3 (cycles): 249264  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 117890  cycles
            Overlap (wait) time: 0  cycles
            Resources: HVX

There is a noticeable reduction in the total graph execute time and the ops also have better background utilization indicating better optimization than before. Next, Showcase Model 3 diagram illustrates a model that is similar to the one in Showcase Model 1 Optimized diagram. The difference is that the ReLU ops have been replaced with PReLU ops.

Showcase Model 3

QNN HTP Profiling Showcase Model 3

The linting profiler output for this model is given below:

Execute Stats (Average):
------------------------
Total Inference Time:
---------------------
   NetRun:  15368 us
   Backend (RPC (execute) time): 15033  us
   Backend (QNN accelerator (execute) time): 13863  us
   Backend (Num times yield occured): 0  count
   Backend (Time for initial VTCM acquire): 0  us
   Backend (Time for HVX + HMX power on and acquire): 0  us
   Backend (Accelerator (critical path execute) time (cycles)): 2789467  cycles
      Input OpId_2 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources:
      OpId_0 (cycles): 3411  cycles
            Wait (Scheduler) time: 1226  cycles
            Overlap time: 3173  cycles
            Overlap (wait) time: 1194  cycles
            Resources:
      model_convStart_Conv2D:OpId_21 (cycles): 589431  cycles
            Wait (Scheduler) time: 957  cycles
            Overlap time: 41199  cycles
               Output OpId_3
               model_add_add:OpId_54
               model_preluCombined1_add:OpId_37
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 72  cycles
               Output OpId_3
               model_convStart_Conv2D:OpId_21
            Resources: HVX, HMX, DMA
      model_tf_op_layer_stride_1_stride_1:OpId_24 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources:
      model_convCombined1_Conv2D:OpId_34 (cycles): 165119  cycles
            Wait (Scheduler) time: 1089  cycles
            Overlap time: 155164  cycles
               model_preluCombined1_add:OpId_37
               Output OpId_3
               model_add_add:OpId_54
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 977  cycles
               model_preluCombined1_add:OpId_37
               Output OpId_3
               model_add_add:OpId_54
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_preluCombined1_add:OpId_37 (cycles): 27315  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 9431  cycles
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_convCombined2_Conv2D:OpId_43 (cycles): 805490  cycles
            Wait (Scheduler) time: 81  cycles
            Overlap time: 251743  cycles
               model_add_add:OpId_54
               Output OpId_3
               model_preluCombined1_add:OpId_37
               model_preluCombined2_add:OpId_46
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 62  cycles
               Output OpId_3
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_preluCombined2_add:OpId_46 (cycles): 0  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 0  cycles
            Overlap (wait) time: 0  cycles
            Resources: HVX
      model_subConv_Conv2D:OpId_52 (cycles): 666721  cycles
            Wait (Scheduler) time: 34  cycles
            Overlap time: 180805  cycles
               model_add_add:OpId_54
               Output OpId_3
               model_convStart_Conv2D:OpId_21
               model_preluCombined2_add:OpId_46
            Overlap (wait) time: 13  cycles
               model_convStart_Conv2D:OpId_21
            Resources: HMX, DMA
      model_add_add:OpId_54 (cycles): 62806  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 57481  cycles
               model_add_add:OpId_54
               Output OpId_3
               model_preluCombined1_add:OpId_37
               model_preluCombined2_add:OpId_46
               model_convStart_Conv2D:OpId_21
            Overlap (wait) time: 0  cycles
            Resources: HVX
      Output OpId_3 (cycles): 465781  cycles
            Wait (Scheduler) time: 0  cycles
            Overlap time: 430560  cycles
            Overlap (wait) time: 0  cycles
            Resources: HVX

The usual sign indicating bottlenecks is present here as well. There are multiple ops with low parallel execution. PReLU ops are some of the background ops that executed for these ops and the best practices guidelines suggest that PReLU ops should be replaced with ReLU ops. Changing the graph by replacing the PReLU ops with ReLU gives us the same model as the one shown in the Showcase Model 1 Optimized diagram which is much better optimized as explained before.

QNN HTP Optrace Profiling

Optrace High-level Operation

HTP Optrace Block Diagram illustrates the high-level operation of Optrace profiling.

HTP Optrace Block Diagram

HTP Optrace Block Diagram

HTP Optrace Tooling Call Flow Diagram illustrates the basic Optrace profiling events and how they are captured.

HTP Optrace Tooling Call Flow Diagram

HTP Optrace Tooling Call Flow Diagram

Detailed Command Line Usage Guide

To enable and use Optrace profiling, we need to specify additional parameters during model preparation, model execution, and post-processing. The following sections give a detailed usage guide for the command line parameters needed, with examples.

Converter

For converter, you can optionally set these two parameters:

  • --export_format dlc

  • --enable_framework_trace

This is an option specific to generate a dlc file, which allows the chrometrace to be aware of framework name, as well as the QNN Op types that those get converted into. It is recommended that you enable these options for additional context whenever possible.

Preparation (Context Binary Generation)

For context binary generation, we require two additional parameters:

  • --profiling_level detailed

  • --profiling_option optrace

By using these parameters, the result is an extra file outputted in the current working directory of qnn-context-binary-generator: the schematic binary

Sample Command Line Below:

qnn-context-binary-generator --profiling_level detailed --profiling_option optrace --backend [SDK_PATH]/lib/x86_64-linux-clang/libQnnHtp.so --model [MODEL].so --config_file HtpConfigFile.json --output_dir [OUTPUT_DIR] --binary_file [CONTEXT_BIN]

From the above command here are the following outputs:

  • [CONTEXT_BIN] - this is the context binary file containing the model

  • [MODEL]_schematic.bin - this is the schematic file (required for chrometrace generation)

Execution (Net Run)

For qnn-net-run, we require two additional parameters:

  • --profiling_level detailed

  • --profiling_option optrace

This will allow for profiling events to be embedded into the output profiling log.

Sample Command Line Below:

qnn-net-run --profiling_level detailed --profiling_option optrace --output_data_type float_and_native --retrieve_context [CONTEXT_BIN] --backend libQnnHtp.so --input_list ./inputs/input_list.txt --output_dir . --log_level info

From the above command here are the following outputs:

  • qnn-profiling-data.log - this is the log data containing the optrace information that will be further parsed in the post process step (required for chrometrace generation)

  • The normal outputs of model execution

Post Process (Chrometrace Generation)

During the post process phase, we now have the required data from the above two steps to generate a chrometrace.

  • [MODEL]_schematic.bin - this is the schematic we recieve from the context binary generation step

  • qnn-profiling-data.log - this is the detailed profiling data we recieve from the execution step

To generate the chrometrace, we run qnn-profile-viewer on the host with the libQnnHtpOptraceProfilingReader reader library.

Sample Command Line Below:

qnn-profile-viewer --config [PATH_TO]/config.json --reader [SDK_PATH]/lib/[TARGET]/libQnnHtpOptraceProfilingReader.so --input_log ./qnn-profiling-data.log --schematic ./[MODEL]_schematic.bin --output ./chrometrace.json

The config.json file provides parameters beyond the ones covered in the command line arguments, such as the following:

  • enable_input_output_flow_events - Adds flow events to the chrometrace.json, showing input-output dependencies between operations. Requires using the legacy UI to open chrometrace.

  • enable_sequencer_flow_events - Adds flow events to the chrometrace.json, showing ordering dependencies between operations, imposed by the sequencer. Requires using the legacy UI to open chrometrace.

  • htp_json - Dumps a [NAME]_htp.json file containing the toplogy and op-by-op information about the HTP graph. Default is on.

  • runtrace - Adds Runtrace execution and preemption events (if available) at the bottom of each core in the output chrometrace. Default is on.

  • memory_info - Adds memory bandwidth and allocation graphs (if available) at the bottom of each core in the output chrometrace. Default is on.

  • traceback - Adds trace back to source framework in the output chrometrace. Default is on.

  • qhas_schema - Dumps a qhas_schema.json that can be used to validate the QHAS json file. Default is off.

  • qhas_json - Dumps a [model]_qnn_htp_analysis_summary.json. Default is off.

Sample config.json below, with all available boolean parameters:

{
   "features":
   {
      "enable_input_output_flow_events": true,
      "enable_sequencer_flow_events": true,
      "htp_json": true,
      "runtrace": true,
      "memory_info": true,
      "traceback": true,
      "qhas_schema": true,
      "qhas_json": true
   }
}

From the above command here are the following outputs:

  • chrometrace.json - the chrometrace output that can be opened with either the Perfetto Trace Visualizer or with chrome://tracing

  • chrometrace_qnn_htp_analysis_summary.html - the QHAS HTML report

Note

A number of ops in HTP get classified as “System Service”. These are not ops associated with any specific operation performed in the base neural network. Each System Service category is briefly explained below:

  • DramToTcm: Loads data from DRAM into VTCM.

  • TcmToDram: Writes data from VTCM into DRAM.

  • Sync Op: Used to provide an ordering for HVX ops to resolve dependencies between ops.

  • DmaCheckpointSet: A producer writing to memory for a future consumer to use sets this checkpoint when it is finished writing.

  • DmaCheckpointWait: A consumer that waits for DmaCheckpointSet.

  • BlockZapOp: When a tensor’s data is not a multiple of block-size, this operation “pads” the blocks with a specified zero-value.

  • SystemService: Not associated with any specific parent op, it prefetches chunks of data into L2 cache for ops to use.

If you want additional context such as framework names and QNN Op types, provide the DLC file with the following parameter:

  • --dlc ./[MODEL].dlc

This DLC file is generated within the converter as mentioned above.

Optionally, you can perform a profile submodule chrometrace, where you specify two QNN node names, and the generated chrometrace only represents the subnetwork contained between the two. To use this, you use the following two parameters:

  • --zoom_start - this is the starting QNN node name for the submodule.

  • --zoom_end - this is the ending QNN node name for the submodule.

The parameters --zoom_start and --zoom_end can accept framework node names (e.g. ONNX op names) if the –dlc parameter is set earlier. If the –dlc parameter is set, the program will automatically detect whether the names provided match a framework name or QNN name, no further context is required.

HTP Graph Topology and per-Op Information in Netron

By enabling the htp_json parameter in the config.json above, a [NAME]_htp.json file containing the toplogy and op-by-op information about the HTP graph will be dumped. This json file can be viewed directly in Netron. This feature can be used in conjunction with the node zooming feature above.

HTP Graph Topology in Netron demonstrates viewing [NAME]_htp.json in Netron.

HTP Graph Topology in Netron

HTP Graph Topology in Netron

HTP Graph per-Op Information in Netron demonstrates the ability to click on an HTP node and view Op properties.

HTP Graph per-Op Information in Netron

HTP Graph per-Op Information in Netron

Memory Bandwidth and Allocation graphs

By enabling the memory_info parameter in the config.json above, memory bandwidth and allocation graphs (if available) will be displayed at the bottom of each core in the output chrometrace. The graphs below titled “VTCM”/”DRAM” “read”/”write” display instantaneous bandwidth, while the “VTCM alloc” graphs display current memory allocation at that point.

HTP Optrace Multicore Memory Graphs demonstrates the per-core memory bandwidth and allocation graphs.

HTP Optrace Multicore Memory Graphs

HTP Optrace Multicore Memory Graphs

Gzip Compression of Chrometrace

This feature will add the ability to automatically compress output chrometraces into the gzip format and will save massive amounts of disk space.

On a sample model, the size of the chrometrace output was ~1300KB before compression and ~60KB after compression (ratio of 0.05).

There are 3 ways to enable gzip compression:

  • Append .gz to the output filename command line argument for qnn-profile-viewer

  • Add the --gzip command line flag for qnn-profile-viewer

  • Enable the gzip parameter in the optrace config.json file

Gzip compression will be turned on if any one of these options are set.

Note

Command line parameters will be subject to change once multi-graph support is enabled.

Runtrace Execution and Preemption Events

By enabling the runtrace parameter in the config.json above, Runtrace execution and preemption events (if available) will be displayed at the bottom of each core in the output chrometrace.

The graphs below titled “<GRAPH_NAME> Runtraces” show the duration of the entire inference as well as other information such as the time it took to acquire physical resources (execution events). The graphs below titled “<GRAPH_NAME> Yields” show the time it took to save, re-acquire, and restore VTCM memory when yielding to a higher priority thread (preemption events).

HTP Optrace Runtrace Graph demonstrates the Runtrace graph execution events for one graph.

HTP Optrace Runtrace Graph

HTP Optrace Runtrace Graph

QNN HTP Analysis Summary (QHAS)

From the steps above for running optrace profiling, we can see that a QHAS HTML Report is generated by qnn-profile-viewer as part of the existing flow (no extra parameters required). Unlike the chrometrace which visually depicts the data from the HTP ops, the QHAS HTML Report summarizes this data into a report with analysis.

QHAS HTML Report Example illustrates the layout of the QHAS Report with the “HTP Overall Summary” section expanded. The other sections will expand once clicked.

QHAS HTML Report Example

QHAS HTML Report Example

Additionally, a column within a section can be sorted by clicking on the sort icon. And a pie chart can be generated for a column to illustrate the value of each row relative to the total in that column, by clicking on the chart icon.

In the QHAS HTML Report Sorting and Plotting Example, we are sorting the “Cycles” column under the “QNN Op Types” section in descending order, and are displaying a pie chart for it. This pie chart visually represents the fraction of the total cycle count that each op type uses.

QHAS HTML Report Sorting and Plotting Example

QHAS HTML Report Sorting and Plotting Example

In the QHAS HTML Report Filtering Example, we are filtering the columns under “HTP Op” and “QNN Op” by the keywords “conv” and “batch”, respectively. Only rows that contain the filter keyword will be shown. Filter keywords in multiple columns can be set simultaneously.

QHAS HTML Report Filtering Example

QHAS HTML Report Filtering Example

Note

The “Dominant Path” section of QHAS shows a timeline of the highest priority HTP op throughout the timeline. The priority list is as follows:

  • HMX Op

  • HVX Op

  • Ops performing DMA Reads

  • Ops performing DMA Writes

  • DmaCheckpointSet and DmaCheckpointWait (as explained in System Services above)

  • SyncOp (as explained in System Services above)

Note

If you enable the profile submodule feature above, QHAS will also only show an analysis for the nodes contained within the subnetwork.

Note

The QHAS feature is still in Beta so it is subject to change in future SDK versions.

QNN Context Binary size

The QNN Context Binary is used by QNN for execution of the neural network. Post preparation of graph, the ‘QNN Context Binary’ contains the information & optimizations for faster inference of the model. The ‘QNN Context Binary’ has larger size compared to the size of QNN model. This enlarged size results from the following reasons:

  • Number of Operations: HTP tries to run as many operations as possible in parallel. To be able to fit into the VTCM, heavy operations are split into smaller operations. This often results in increase in the number of operations which are needed to be present in Context Binary, resulting in increase in its size. For example, if each op takes 40 bytes of Context Binary and if the number of operations before and after the above optimization are 30 and 300,000, then we need 1.2 KB and 12 MB respectively in the Context Binary. The figure below shows a large Conv operation which needs to be broken up into smaller operations.

    ../../_static/resources/multi_thread_and_multi_batch/large_conv_op.jpg
  • Sequencing and Data Paging: As the number of operations increase, the Context Binary must also need to store information about the sequence of operations and the information about data paging (which operation needs to be written to DDR and which needs to be brought back into VTCM during execution). This information also contributes to the size of prepared Context Binary.

  • Constant Data in Graph: The QNN Context Binary contains all constant data of the graph. The constant data consists of, among other things, convolution filters. These filters could be padded in the QNN context binary to represent internal HTP format for performance efficiency, causing additional size increase for the QNN context binary.

Recommendations for Network Design

The HTP supports A8, A16 and FP16 activations. Generally, the accuracy and the power and energy requirements follow the order A8 < A16 < FP16. Therefore, to minimize power, one should first try the A8 mode and check for accuracy of the results. If the accuracy is not sufficient try A16 mode and if even that doesn’t achieve the desired accuracy move to FP16 mode.

The following sections cover some of the best practices in graph design that allows for the optimal use of HTP hardware from a performance and accuracy perspective.

It is recommended to always use symmetrical quantization of weights when quantizing the model to obtain best accuracy on HTP based targets. Activation data is recommended to be asymmetric.

It is recommended to use quantization aware training as much as possible to improve the accuracy of models especially in the case of high resolution image transformation models. Using quantization aware training may take away the need to use 16bit activations and may allow the use of 8bit activations which will improve both performance and power. When using quantization aware training, keep in mind the following:

  1. A comparison of outputs in original framework between original float model and model with fakequant nodes helps in determining the quality of the quantization aware trained model

  2. Ensure there are fakequant nodes for all layers/kernels

A16W16 (int16 weight along with uint16 activations) is supported for several Convolution type of operations. This is commonly used on image enhancement networks, but available to other type of usecases as well.

This feature is enabled only on selected SoCs. For further accuracy enhancement purpose, per-channel quantization method will be added in the future.

Expectations of comparison between A16W16 models and A16W8 and FP16 models as follows:
  1. A16W16 models are expected to achieve better accuracy than A16W8 models with Post-Training quantization.

  2. A16W16 models are expected to achieve better power efficiency than FP16 models while maintaining a similar accuracy result.

List of Convolution type of operations supported for A16W16:
  1. Conv2d

  2. DepthConv2d

  3. TransposeConv2D

  4. FullyConnected

  5. Matmul

  6. Batchnorm

  7. LayerNorm

All weights/filters need to be symmetrically quantized. For Matmul, Input A must be asymmetrically quantized, Input B must be symmetrically quantized. Please refer to HTP Backend Op Definition Supplement for details.

Limitation: Due to Hexagon hardware limitation, INT16 weight have a special limitation that the range of weight value will be 0x8000 to 0x7F7F instead of a full 16bit range 0x8000 to 0x7FFF. Please add --restrict_quantization_steps "-0x8000 0x7F7F" to quantizer options when using A16W16.

Yielding and Pre-Emption

Yielding and Pre-Emption is a cooperative user-based implementation of context switching. The following document aims to help the user understand different concurrency scenarios and their expected behaviour.

Byte-Granularity VTCM

Starting with hexagon-v81, QNN HTP now supports byte-granularity VTCM size requests, allowing the amount of VTCM to be reserved and utilized by a graph to be specified in bytes (previously only MB requests were supported). This flexibility enables more efficient use of VTCM Sharing, being able to specify overmaps precise to the requirements of the secondary entity, thereby maximizing the VTCM available to QNN graphs.

The byte-granularity VTCM size request can be configured in the QNN HTP Backend Extensions via the new parameter "vtcm_size", similar to the existing "vtcm_mb". The following example configuration illustrates the use of the new parameter:

{
"graphs": [
    {
        "vtcm_size": 8323072,
        "graph_names": [...],
        ...
    }
],
"devices": [{ ... }]
}

Note

The new byte-granularity and existing MB-granularity APIs are both supported simultaneously, with neither taking precedence. Multiple VTCM size requests with the different APIs behaves similarly to multiple requests with the existing API.

Warning

While the API supports byte-granularity, the minimum granularity may have higher hardware-specific constraints. Presently, requests need to be a multiple of 64KB, although this requirement may be relaxed in the future.

VTCM Sharing

Staring from hexagon-v73, it is possible for other threads in the same process to share VTCM resources with QNN HTP using the procedure described in the following pages:

SubSystem Restart (SSR)

A QNN HTP BE specific feature that allows the CDSP subsystem to automatically restart an invalidated connection after crashing. Details are provided in the following page:

Qmem Graph (shared_buffer only graph)

A QNN HTP BE specific feature that allows users to use data buffers for shared access in between processing domains in QNN HTP backend. Using shared buffers can eliminate data copy in between client code on the host CPU and HTP accelerator.

HTP Session & Artifact Usage Guidlines

Supported Library Use

QNN only supports one set of libraries per-process. These libraries must be of the same SDK version and have matching Hexagon architecture versions. For illustrative purposes, V73 Hexagon architecture libraries will be used in the diagrams; the same guidelines apply for non-V73 artifacts as well. The supported library layout for QNN is displayed below.

../../_static/resources/qnn_htp_supported_library_usage.png

To prevent incorrect QNN library layouts, Qualcomm recommends the following:

  • One copy of each library should be present for a single process (backend, stub, skel, etc).

  • The backend library (libQnnHtp.so) should be explicitly loaded with dlopen rather than being dynamically linked as a dependency.

  • During the loading of Stub.so and Prepare.so, QNN first searches for them in the path of libQnnHtp.so. If not found, it searches in LD_LIBRARY_PATH.

  • Libraries should be in the same directory as one another (the skel is an exception to this so long as the ADSP_LIBRARY_PATH is correctly set to find the library).

  • Do not rename libraries to load multiple copies as this is not supported.

Unsupported Library Use

QNN does not support multiple copies of the QNN backend library (libQnnHtp.so) being accessed in a single process. Two different layouts are depicted below where multiple backend libraries are present on device.

../../_static/resources/qnn_htp_unsupported_library_usage.png

For two backend libraries to be loaded, either a second copy of the library is explicitly loaded from a separate directory than where the first library is located, or a duplicate filesystem is created during process execution (adb remount for android targets). In either case, the two artifact layouts shown above are not supported. If these layouts are detected during runtime, QNN_COMMON_ERROR_LOADING_BINARIES will be returned from the following APIs:

  • QnnBackend_registerOpPackage

  • QnnBackend_validateOpConfig

  • QnnContext_create

  • QnnDevice_create

  • QnnGraph_create

Graph Switching (Beta)

Note

This feature is currently in experimental beta release, any proposed method of usage and behavior may change in future releases.

This feature is used only when client requires further reduction in RAM at the cost of execution speed. The feature currently has limitations as stated below:

  • This feature does not support concurrent graph execution between the switching graphs.

  • This feature can not be used together with QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES config.

  • Memory is reduced at the cost of slower first execution after graph switching.

This feature is a way to lazy load models when needing to execute. This allows multiple graphs to be enabled but in unloaded state, and keeping only one graph(ie. graph1) fully loaded and ready to execute in a low memory mode to reduce sustained memory usage. When an unloaded graph(ie. graph2) needs to execute, graph switching (lazy unload graph 1 and load graph2) will take place automatically if this graph switching feature is enabled. The amount of memory saved roughly depends on the total size of the models that are in unloaded state.

To enable this feature in code: (Beta)
 1QnnContext_Config_t* isPersistentBinaryConfig = new QnnContext_Config_t;
 2isPersistentBinaryConfig->option = QnnContext_ConfigOption_t::QNN_CONTEXT_CONFIG_PERSISTENT_BINARY;
 3isPersistentBinaryConfig->isPersistentBinary = 1;
 4
 5QnnContext_Config_t* memoryLimitHintConfig = new QnnContext_Config_t;
 6memoryLimitHintConfig->option = QnnContext_ConfigOption_t::QNN_CONTEXT_CONFIG_MEMORY_LIMIT_HINT;
 7memoryLimitHintConfig->memoryLimitHintConfig = 1;   //any non-zero value
 8
 9QnnContext_Config_t* cfgs[] = {isPersistentBinaryConfig, memoryLimitHintConfig, NULL};
10QnnContext_createFromBinary(..., cfgs, ..., &contextHandle, ...);
Example in backend extension config: (Beta)
{
  "backend_extensions" :{...},
  "context_configs" :
    {
      "is_persistent_binary" : boolean_value,
      "spill_fill_buffer" : int64_value,
      "weights_buffer" : int64_value,
      "memory_limit_hint"  : uint64_value,
      "enable_graphs" :  ["<graph_name_1>", "<graph_name_2>", ...]
    }
  "graph_configs" : [{...}]
}
  • is_persistent_binary: default false; to use graph switching feature, is_persistent_binary is required.

  • spill_fill_buffer: This field sets the spill fill weights to be shared in a buffer shared between the client and backend. The default is 0.

  • A value greater than 0 allocates the exact spill fill size given by the user.

  • A value of 0 disables this feature.

  • A value of -1 allocates the spill fill size given by QNN.

  • weights_buffer: This field sets the weights to be shared in a buffer shared between the client and backend. The default is 0.

  • A value greater than 0 allocates the exact buffer size given by the user.

  • A value of 0 disables this feature.

  • A value of -1 allocates the weight size given by QNN.

  • memory_limit_hint: default 0, set to any non zero value to enter low memory mode with graph_switching enabled.

  • enable_graphs (optional): the name of the graphs that will be enabled. When graph switching feature is enabled, only the first graph in the enable_graphs list is loaded. When this field is left out during graph switching mode, it signals all graphs in the serialized binary are enabled, and only the first graph in the serialized binary is loaded. User can strategically specify the order of graphs in this enable_graphs field to control which graph to load first.

Note

  • When is_persistent_binary is enabled, it is advised for the client to use mmap to map the binary file passed to the QnnContext_createFromBinary API call. Once the API call is finished, it is also advised that the client should use techniques such as madvise() to free up some memory held by mmap. Otherwise, the client may experience a high sustained memory due to holding onto the persistent binary. For the same reason, a non-mmaped implementation strategy is not recommended.

  • Client is responsible for keeping this mmapped buffer alive in the lifetime of context so graph reloading could happen. Client must not unmap the fd until QnnContext_free is invoked. Freeing/ummaping the buffer prematurelly will result in undefined behavior.

  • Client is also responsible for freeing up the mmapped buffer when destroying context, other wise may introduce memory leak.

  • For the memory_limit_hint, any non-zero value will enable graph switching. Values greater than zero, will only indicate the low memory mode. Any specific memory_limit_hint value will not affect the graph switching behaviour.

  • Client is responsible for creating and freeing buffer used to store spill fill buffer and weights buffer.

Multi-Graph Switching (Beta)

The Multi-Graph Switching feature enhances the existing graph switching, allowing multiple graphs to be loaded and retained in memory at once. It improves the execution speed compared to traditional graph switching, which unloads the current graph before loading a new one. This feature balances memory usage and execution efficiency.

To enable Multi-Graph Switching, users must configure a “graphs_retention_order” which is an ordered list of graph names that defines their retention priority. Along with this, users must set memory_limit_hint > 0 and is_persistent_binary = true similar to traditional graph switching.

Example in backend extension config: (Beta)
{
  "backend_extensions" :{...},
  "context_configs" :
    {
      "is_persistent_binary" : boolean_value,
      "memory_limit_hint"  : uint64_value,
      "enable_graphs" :  ["<graph_name_1>", "<graph_name_2>", ...],
      "graph_retention_order": ["<graph_name_1>", "<graph_name_2>", ...]
    }
  "graph_configs" : [{...}]
}
  • graph_retention_order: This field defines an ordered list of graph names to be retained in memory. Graphs earlier in the list have higher retention priority. When the Multi-Graph Switching feature is enabled, it will preload all of the graphs in this list, provided they all fit into the current PD’s virtual address space. Any other graph that is not part of this list will be lazy loaded at execution. Additionally, if the memory is limited, then the retained graphs may get unloaded, starting with the least important or graphs lower in the retention list. If the graph_retention_order is configured without memory_limit_hint, then the retention feature will be disabled. If other graph switching configurations are set and this field is left empty, traditional graph switching will be triggered.

  • is_persistent_binary: This field is required to be set to true for using the Multi-Graph Switching feature. The default is false.

  • memory_limit_hint: This field can be set to any non-zero value to enter low memory mode to enable multi-graph switching. The default is 0.

  • enable_graphs (optional): This field sets the name of the graphs that will be enabled. In case of Multi-graph switching, the graph names in the retention list should be in the enabled state as these are loaded during deserialization. When this field is left out during graph switching mode, it signals all graphs in the serialized binary as enabled but proceeds with loading graphs that are part of the retention list.

Example: Multi-Graph Switching with Retention Order

Assume a binary contains four graphs: A, B, C, and D - each of equal size. graph_retention_order: [“A”, “B”], memory_limit_hint > 0 and is_persistent_binary = true. Additionally, consider only two graphs can fit at a time in the PD limit.

Case 1: Execution order A, B, A, B

In multi-graph switching, all graphs from the retention list are loaded during deserialization as long as they fit into the memory. Here graph A and B can both fit, hence are loaded and ready to execute.

  1. When graph A comes for execution, it is already loaded and ready to execute.

  2. When graph B comes for execution, it is already loaded and ready to execute.

  3. When graph A comes for execution again, it is already loaded and ready to execute.

  4. When graph B comes for execution again, it is already loaded and ready to execute.

With Multi-Graph Switching, more graphs can remain loaded in memory. Here graphs A and B are retained in memory thereby, reducing the reload overhead compared to traditional graph switching.

Case 2: Execution order A, B, C, D

In multi-graph switching, all graphs from the retention list are loaded during deserialization as long as they fit into the memory. Here graph A and B can both fit, hence are loaded and ready to execute.

  1. When graph A comes for execution, it is already loaded and ready to execute.

  2. When graph B comes for execution, it is already loaded and ready to execute.

  3. When graph C comes for execution, it is loaded with graph switching. It will check if the existing graphs A and B along with graph C can fit into the memory. It will also check if any non-retained graphs can be removed. Currently no other non-retained graph is loaded and given our assumption, all three graphs cannot fit into memory. It will unload the least important retained graph B. If graph A and C can fit, C is loaded and executed. Otherwise it will continue unloading graph A for graph C to fit. Here A and C can fit hence C is loaded and executed.

  4. When graph D comes for execution, it is loaded with graph switching. Graph C is unloaded as this is not part of the retention list. If graph A and D can fit, D is loaded and executed. Otherwise it will continue unloading graph A for graph D to fit. Here both A and D can fit, hence D is loaded and executed.

The execution order in this example demonstrates that retained graphs may be unloaded if memory pressure occurs. Thus the number of graphs retained depends on number of graphs retained, graph sizes, graph execution order and underlying PD limit.

Case 3: Execution order A, B, C, A, B, D

In multi-graph switching, all graphs from the retention list are loaded during deserialization as long as they fit into the memory. Here graph A and B can both fit, hence are loaded and ready to execute.

  1. When graph A comes for execution, it is already loaded and ready to execute.

  2. When graph B comes for execution, it is already loaded and ready to execute.

  3. When graph C comes for execution, it is loaded with graph switching. It will check if the existing graphs A and B along with graph C can fit into the memory. It will also check if any non-retained graphs can be removed. Currently no other non-retained graph is loaded and given our assumption, all three graphs cannot fit into memory. It will unload the least important retained graph B. If graph A and C can fit, C is loaded and executed. Otherwise it will continue unloading graph A for graph C to fit. Here A and C can fit hence C is loaded and executed.

  4. When graph A comes for execution, it is already loaded and ready to execute. The non-retained graph C remains loaded until next graph switching is triggered. At the time of the next graph switching, graph C is a candidate to be unloaded as it is not part of the retention list.

  5. When graph B comes for execution, it is loaded with graph switching. It will also be marked for retention. However, if at any point, memory pressure occurs, it may be unloaded as seen in step 3.

  6. When graph D comes for execution, it is loaded with graph switching. Graph C is unloaded as this is not part of the retention list. If graph A and D can fit, D is loaded and executed. Otherwise it will continue unloading the least important retained graph A for graph D to fit. Here both A and D can fit, hence D is loaded and executed.

The execution order in this example demonstrates that any retained graph may get unloaded and require reloading with switching. Additionally, any graph that is not part of the retention list gets unloaded when multi-graph switching occurs.

It is important to note that graph_retention_order is used to preload graphs during deserialization. Users must ensure that all graphs listed in the graph_retention_order list fits within the memory. In the above examples, if the list included more retention graphs than the memory allowed (e.g., [“A”, “B”, “C”] when only two can fit), the configuration needs to be adjusted to ensure provided graphs can be loaded all at once.

Note

This feature is currently in experimental beta release, and currently has following limitations/constraints:

  • Graphs from the graph_retention_order list are preloaded based on the PD’s virtual address space limit. Users need to configure this list to ensure all graphs can be preloaded. If all graphs from this list cannot be deserialized on any of the PDs then this list needs to be reconfigured taking into account the graph’s memory usage.

  • The number of retained graphs is highly dependent on the number of graphs in the retention order, execution order, PD limit, and the actual memory usage of the graphs. It may not retain all the graphs from the provided list.

  • At any point, memory pressures may cause any of the retained graphs to be unloaded.

  • During multi-graph switching, graphs not in the retention list and not currently executing will be unloaded during switching.

  • If the graph retention order is empty, by default traditional graph switching will be triggered.

  • Compared to traditional graph switching, the RAM usage will be higher because more graphs stay loaded.

  • For the memory_limit_hint, any non-zero value will enable graph switching. Values greater than zero will only indicate the low memory mode. Any specific memory_limit_hint value will not affect the graph switching behaviour.

  • Multi-graph switching does not support concurrent graph execution between the switching graphs.

  • Multi-graph switching cannot be used together with the QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES configuration.

Benefits of batch inference and multi-threaded inference

Multi-threading hides the cost of CPU/HTP communication time (RPC). In a single threaded inference the utilization of HTP hardware is not efficient as shown in the example below.

../../_static/resources/multi_thread_and_multi_batch/single_vs_multi_thread.jpg

Compare this to the multi-threaded inference. Here, the time used by RPC of the first inference is masked by the inference in the second thread. Thus, HTP hardware spends less idle time. Similarly, using multiple inferences per batch improves the reuse of weights memory, thus avoiding reloading the weights. It is to be noted here that the Activation size is a factor which affects the benefit of using batches.

  • With low resolution models, weights and activations fit in VTCM. In this case, using batched activations reuses the weights for all the activations, thus increasing the efficiency per batch.

  • With large resolution models, as the activations take up more VTCM space, therefore weights are reused for lesser number of activations. This reduces the reuse efficiency of weights per batch.

Example:

With SNPE 2.9, SM8550 (Kailua.LA.1.1-00190 GKI.HY11-1) shows approximately 1.8x benefit for resnet50 when using snpe-parallel-run.

  • Baseline: Using 1 batch and 1 thread achieves approximately 1500 inf/s

  • Batching and multithreading: Using batches of 5 and 2 threads achieves approximately 2800 inf/s

Hexagon NPU Runtime Driver (Windows Only)

Hexagon NPU Runtime Driver (HNRD) is available for Windows based Snapdragon® X Series Platforms. HNRD is designed to be forward and backward compatible with Qualcomm AI Stack SDKs. With HNRD in the system, applications have the option to unbundle from QNN HTP platform dependent libraries, and allows these applications to be portable over to older and newer Windows platforms. HNRD is packaged and distributed independently from the QNN SDK and it is currently packaged with the device BSP. OEMs use the BSP to pre-install HNRD on their devices.

Switching Between Traditional and HNRD Paths

../../_static/resources/hexagon_npu_runtime_driver.jpg
Pre-driver:
  • Traditional path only; applications build with the QNN SDK and bundle the QNN HTP platform dependent libraries

Post-driver:
  • Traditional and HNRD paths available; the application can choose which path to use

  • Traditional path; applications build with the QNN SDK and bundle the QNN HTP platform dependent libraries with the application (Default – same as pre-driver)

  • HNRD path; applications build with the QNN SDK, but utilize the platform dependent libraries installed on the system

  • Note: there is no difference in the build step between traditional and HNRD paths. In addition, the bundling and choosing between traditional and HNRD paths is applicable to both QNN and SNPE

In other words, if an application bundles the QNN HTP platform dependent libraries (i.e., QnnHtpV73Stub.dll and QnnHtpV73Skel.so), it will default to choose the traditional path. Otherwise, if the platform dependent libraries are not bundled, it will fall back to choose the HNRD path.

The following sample logs illustrate the HNRD path being applied:

1  0.0ms [WARNING] QnnDsp <W> Traditional path not available. Switching to user driver path
2  0.0ms [WARNING] QnnDsp <W> HTP user driver is loaded. Switched to user driver path

Compatibility Support

The minimum QNN and SNPE version is 2.22.2. Applications can be built with older or newer versions of the QNN SDK and they will still work on the device. Depending on the HNRD version installed on the system, new features may not be supported.

Note that traditional and HNRD paths can co-exist on the platform and each application independently selects whether to use traditional or HNRD paths.

Context Binary Management

When utilizing Context caching with HNRD, it requires special attention on managing the context binaries. When using HNRD path, online prepare and context binary loading are done by HNRD since they are platform-dependent. A context binary generated by one version of HNRD might not be able to run on HNRD with an older version or might not utilize all software / hardware capabilities on HNRD with a new version. It is required to check the compatibility of saved context binaries with the HNRD every time before loading, since the HNRD installed on a device can be upgraded (or downgraded) at any time.

QnnContext_createFromBinary() checks the compatibility automatically before loading the context binaries. Setting QnnContext_BinaryCompatibilityType_t can control whether to fail sub-optimal context binaries during compatibility check. QnnContext_validateBinary does a similar check as QnnContext_createFromBinary() except it won’t create the context.

If a context binary fails to pass QnnContext_createFromBinary() or QnnContext_validateBinary, performing online prepare to create another valid context binary can solve the problem. To reduce the latency impact of online prepare, one can continue execution with the original sub-optimal context binary while doing online prepare in a background thread and switch to the new context binary once online prepare is done. Or one can first do a fast prepare (e.g., by setting QNN_HTP_GRAPH_OPTIMIZATION_TYPE_FINALIZE_OPTIMIZATION_FLAG to 1) to execute with a sub-optimal context, and then do a slow prepare (e.g., by setting QNN_HTP_GRAPH_OPTIMIZATION_TYPE_FINALIZE_OPTIMIZATION_FLAG to 3) in a background thread.

Note that online prepare may not succeed. This can happen when a model converted by a newer SDK version uses features that are not supported by the HNRD. In such a case, upgrading the HNRD. The following example shows how to handle the context binary management.

 1Qnn_ContextHandle_t context;                     // The context used to do inference
 2std::future<Qnn_ContextHandle_t> futureContext;  // The context being created in background
 3
 4// Create from binary and set compatibility mode to strict to check sub-optimality.
 5Qnn_ErrorHandle_t result =
 6    doCreateFromBinary(QNN_CONTEXT_BINARY_COMPATIBILITY_STRICT, &context);
 7
 8if (result != QNN_SUCCESS) {
 9  if (result == QNN_CONTEXT_ERROR_BINARY_SUBOPTIMAL) {
10    // The context binary is valid but sub-optimal.
11    // Continue execution with this context binary.
12    // Set compatibility mode to permissive to bypass sub-optimality check.
13    doCreateFromBinary(QNN_CONTEXT_BINARY_COMPATIBILITY_PERMISSIVE, &context);
14  } else if (result = QNN_CONTEXT_ERROR_CREATE_FROM_BINARY) {
15    // The context binary cannot run.
16    // Do fast prepare with optimization level 1 without saving the context binary to file.
17    context = doOnlinePrepare(HTP_OPTIMIZATION_LEVEL_1, NO_SAVE_CONTEXT);
18
19    // Here assumes nullptr will be returned if online prepare fails.
20    if (context == nullptr) {
21      // If online prepare fails, one possible reason could be HNRD is too old for the
22      // graph. In such case, prompt the users to upgrade HNRD.
23      message(ERROR, "The HNRD is too old. Please install latest HNRD.");
24    }
25  } else {
26    throw std::runtime_error("Skip handling of other errors.");
27  }
28
29  // Now we have a sub-optimal context in variable "context".
30  // Do online prepare in another thread to produce the optimal context and save it.
31  futureContext = std::async(std::launch::async, doOnlinePrepare, HTP_OPTIMIZATION_LEVEL_3,
32                             SAVE_CONTEXT);
33}
34
35// Do inference.
36while (waitInputData()) {
37  // Switch to optimal context if prepare is done.
38  if (futureContext.valid() &&
39      futureContext.wait_for(std::chrono::seconds(0)) == std::future_status::ready) {
40    context = futureContext.get();
41  }
42
43  doInference(context);
44}
45
46// Clean up.
47doFreeContext(context);

QnnContext_createFromBinaryListAsync API

This API provides a method for asynchronously initializing multiple context. Currently only supported on Mobile and Windows platforms. binaries (models) in a single API call.

It offers two primary features:

Notifications and Handles:

  • For each graph within a context, and for all contexts, a notification will be sent after they are initialized along with the initialization status.

  • These notifications may arrive before or after the API returns.

  • The order of the notification depends on the initialization time of each context and is not guaranteed to follow any sequence.

  • For a context with multiple graphs, there will be a notification for each graph and an additional one for the context.

  • Valid graph and context handles will be sent back to the client through these notifications. Clients can then freely use these handles.

Guidelines and Limitations of using this API

  • It is highly advisable to use a single thread to call this API and no other QNN API should be called in parallel.

    • Multiple application threads trying to initialize multiple use cases in parallel is not fully supported and can result in Input/Output Memory allocation or mapping failures during graph execution or during QnnMem_register() call.

    • We internally parallelize model initialization (the driver decides at runtime whether to do it on CPU, HTP, or both), fully utilizing the backend to minimize the initialization time. Therefore, calling any other QNN API in parallel can result in over-subscription and degrade performance, which is counterproductive.

  • Clients should not use the context configuration option QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_MULTI_CONTEXTS or QNN_HTP_CONTEXT_CONFIG_OPTION_REGISTER_CONCURRENT_RESOURCE_SHARING to register multiple contexts, as they are explicitly designed for the QnnContext_createFromBinary() API.

  • When the HTP custom context configuration (QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES) is enabled, it is recommended that clients use I/O buffers that utilize the Multi-Tensor shared buffer. This method maps a group of tensors to a single shared buffer, optimizing both space and initialization time. Additionally, when using this option, the batch size of the graph should be set to 1 to prevent reallocation of the shared buffer.

Usage Diagram:

Refer to the following diagram for a visual representation of the API usage.

Context Create From Binary List Async Callflow Diagram

Multiple cDSP Sessions

Note

  • This feature is often referred to as Multi-PD, a term that can be misleading as PD (process domain) technically denotes a different transport session.

  • Please refer to Hexagon SDK Documentation for more information.

  • Certain Snapdragon SoCs are capable of creating and opening multiple sessions within cDSP from a single CPU application. The number of concurrent sessions that can be supported per CPU application process, as well as the total number of sessions that can be supported within cDSP, are dependent on the hardware configuration of the SoC.

  • cDSP is a 32-bit processor, and each session supports a virtual address space of 4 GB; however, only 3.75 GB of this space is usable.

  • The maximum amount of memory per CPU process is constrained by either the total free RAM available, or the product of the number of sessions supported per CPU process and 3.75 GB, depending on which of the two is lesser.

Model Loading Across Multiple Sessions

Here are some key aspects:

  • When loading the offline-prepared model (via QnnContext_createFromBinary(), QnnContext_createFromBinaryWithSignal(), or QnnContext_createFromBinaryListAsync()), the backend will attempt to deserialize the model on the first available session. If the attempt fails due to memory availability, the backend will attempt to deserialize the model on the next available session.

  • The initialization time for a model, which will be deserialized on sessions other than the first one, will be slightly higher. This is because the backend will try to load the model on all of the available sessions sequentially.

  • A single context binary cannot be split across multiple sessions. Therefore, it is expected that the client will split the model before running QNN converter tools. Optimum split points are decided based on the network topology, precision used (a8w4, a16w4, a8w8, a16w8, etc.), and the maximum virtual address space per session.

  • Although the maximum usable virtual address space available per session is 3.75 GB, it is recommended that clients limit the heap required for a single context binary to be under 3 GB.

  • The decision for which session to use for a context binary is determined by the backend. There is no API to specify the session while loading a context. The sessions used are transparent to the client application.

The following example will help understand the workings of multi-sessions. The algorithm may vary slightly based on the API used to initialize the models; however, the backend will generally try to efficiently pack the models on a single session.

  • Consider an ideal scenario with no fragmentation and the memory required for the model deserialization is exactly equal to the context binary size. Assume we have five (5) models with sizes 1.5 GB, 1.5 GB, 1 GB, 1 GB, and 250 MB respectively, with the models loaded sequentially in this order.

  • The first model’s deserialization attempt will happen on session #0. Assuming session #0 has nothing loaded right now, it will get deserialized successfully.

  • The second model will also get deserialized on session #0 as there is still space available on it. At this point, session #0 now has 3 GB of virtual space occupied.

  • The third model’s deserialization attempt will happen on session #0 first. This will be unsuccessful as there is not enough space available on it. Then backend will create a new sesion (say session #1). Session #1 has 3.75 GB available and hence the third model will be deserialized successfully on it.

  • The fourth model’s deserialization attempt will be made on session #0 first. This will be unsuccessful for the same reason as the third model. The next deserialization attempt will be made on session #1. This will be successful as only 1 GB was occupied on session #1.

  • The fifth model’s deserialization attempt will be made on session #0. As the model requires 250 MB and we have that much space available on session #0, the model will be successfully deserialized on it.

Please refer to the diagram below for an illustration of this example:

Multi PD illustration

Guidelines and Limitations for I/O Buffer Allocation/Mapping Failures:

As described earlier, QNN HTP runtime uses an Efficient Packing mechanism to deserialize context binaries on different sessions. This means that if a context binary can fit on the current session, then the backend will deserialize the context binary on that session. If it does not fit, then it will try on the next session.

Typically in use-cases where shared buffers are used for inputs and outputs (I/O), clients register externally allocated memory with the backend through the QnnMem_register API (avoids time spent on memory copy). When the models are large (LLMs/LVMs), their I/O also tend to be larger. In those cases, clients might encounter memory registration failures. This happens due to the virtual address space being occupied enough that there is no space available for the I/O to be mapped. Although this is predominant in shared buffer use cases, it can happen without those as well. Note that the I/O buffers specific to a context need to be mapped to the same session where that context is deserialized.

Consider the following example:

  • There are five (5) context binaries out of which the first four (4) context binaries can fit into the first session (session #0), and for the fifth context binary, the backend spawns a new session (say session #1) and deserializes it there.

  • After deserializing all the context binaries, the client application is trying to register I/O buffers for each of the contexts and one of the mappings fails due to unavailability of the space on session #0.

  • If context 4 would not have been deserialized on session #0, and provided the fifth context is on session #2, session #0 would have been less packed leaving space for those buffers.

Please refer to the diagram below for an illustration of this scenario:

I/O Buffer Registration Failure

Recommendations to address the issue:

  • Use the custom context configuration (QNN_HTP_CONTEXT_CONFIG_OPTION_IO_MEM_ESTIMATION) to make sure the space is available on the session for I/O buffers. This does have a limitation that models can only be initialized sequentially, and right after the model is initialized, the I/O buffers for that context should be registered. This also increases peak RAM during model initialization; however, sustained RAM usage would be identical to the case where this option is disabled.

  • Another potential way to alleviate the issue is to introduce a dummy memory registration call. If a client runs into a memory registration failure issue, they can register a dummy buffer that can force the context to deserialize on to the next session, thereby bypassing the issue. Clients would be expected to free the dummy registered memory after the initialization of all contexts is completed.

  • Use the QnnContext_createFromBinaryListAsync API to initialize the large model use cases. This API can be enabled with virtual address space optimization with which the backend internally takes care of making sure enough space is available for I/O to be mapped later. So after all models are initialized, which can be done through a single API call, I/O can be safely mapped to the PD.

Init Cancellation

The QnnContext_createFromBinaryWithSignal API allows users to interrupt the context creation process using two types of signals: abort or timeout.

Abort Signal: This signal enables users to interrupt the API based on their requirement. Timeout Signal: This signal will automatically interrupt the API after a predefined timeout period.

For more details regarding signal config refer QnnSignal.h

../../_static/resources/qnn_htp_init_cancellation_timeout.png
../../_static/resources/qnn_htp_init_cancellation_abort.png

Graph Priority

Setting Graph Priority involves assigning priorities to both HMX and HVX threads during graph processing.

Priority Behavior

When a priority level X is specified: (Assuming Y is the corresponding HTP Thread Priority, refer below table for mapping)

  • HMX thread(s) are assigned priority Y

  • HVX thread(s) are assigned priority Y + 1.

NOTE : A higher numeric value indicates a lower execution priority.

QNN to HTP Thread Priority Mapping

The table below outlines how QNN priority levels map to HTP thread priorities:

HTP Thread Priority Derivation

QNN Priority Level

HMX Thread(s) Priority

HVX Thread(s) Priority

QNN_PRIORITY_LOW (0)

0xC5

0xC6

QNN_PRIORITY_NORMAL / QNN_PRIORITY_DEFAULT (100)

0xC0

0xC1

QNN_PRIORITY_NORMAL_HIGH (150)

0xBD

0xBE

QNN_PRIORITY_HIGH (200)

0xBB

0xBC

For more details on priority levels, refer to the Hexagon SDK documentation.

Setting Graph Priority

HTP backend supports setting graph priorities by given parameter Struct QnnGraph_Config_t with QNN_GRAPH_CONFIG_OPTION_PRIORITY when calling QnnGraph_create.

The graph priority is set to QNN_PRIORITY_DEFAULT by default if no configuration options are provided.

Clients may also modify the priority of an existing graph using QnnGraph_setConfig. The usage of QNN_PRIORITY_NORMAL_HIGH and QNN_PRIORITY_HIGH is restricted from general developers.

HTP backend allows clients to modify graph priorities using QnnContext_setConfig by passing Struct QnnContext_Config_t with QNN_CONTEXT_CONFIG_OPTION_PRIORITY.

When calling this all graph priorities in this context are updated.

LLM native KVcache

The native KVcache feature is designed to optimize the LLMs performance executed on HTP backend.

HTP backend transforms the input KV tensors to HMX format for HMX operations. This transformation process is expensive, and the overhead increases with longer context lengths. Native KVcache avoids this costly transformation by keeping Input KV tensors in HMX layout.

This feature is typically integrated into an LLM pipeline, where the KV management module is responsible for updating the KVcache in the HMX layout format.

Benefits

  • Improved TTFT and Token Rate, especially for large context lengths

  • Reduced Power Consumption, particularly beneficial for large context lengths.

Constraints

  • Applicability to only GenAI LLM Models with KVcache mode

  • The KVcache tensors must be uint8 and symmetrically quantized

  • KVcache must be either fully in native format or not at all. Partial use, where some decoder layers use native KVcache while others do not, is not supported.

  • Context lengths should be a multiple of 256, attention head_dim should be a multiple of 64

  • The operator for concatenating new and old KV in attention must be ScatterElement instead of Concat, the supported single-head structures are illustrated in below graph.

  • For ARN where N is multiple of 32, the KV input must be in native format. The KV output can be either native or flat. Using native KV output can accelerate kvupdate during TTFT, compared to using flat KV output.

  • For ARN where N is not multiple of 32, AR1/4/8/16 are typically used. Please ensure 1) use native KV input 2) use flat KV output 3) scatter_index <= (context length - roundup(ARN,32)). In this case, the client is responsible for the conversion of flat-output to native-input

../../_static/resources/nativekv_graph_pattern.png

How to enable

To enable and use native KVcache, please ensure the input model meets all the constraints. Users can refer to the QC notebook to export a model that supports native KVcache. Then users need to specify the dataFormat for KV cache I/O tensors as QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT during qnn-context-binary-generator, use --data_format_config argument and give path to JSON file

$ qnn-context-binary-generator --data_format_config DataFormatFile.json \
                               --backend [SDK_PATH]/lib/x86_64-linux-clang/libQnnHtp.so \
                               --dlc_path [MODEL].dlc \
                               --model libQnnModelDlc.so \
                               --output_dir [OUTPUT_DIR] \
                               --binary_file [CONTEXT_BIN] \
                               --config_file HtpConfigFile.json

A sample of DataFormatFile.json to enable both native KV input and native KV output. All KV tensors must be included.

{
   "graphs": [
      {
         "graph_name": [...],
         "tensors": [
            {
               "tensor_name": "past_key_0_in",
               "dataFormat": "QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT"
            },
            ...
            {
               "tensor_name": "past_key_0_out",
               "dataFormat": "QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT"
            },
            ...
         ]
      }
   ]
}

A sample of DataFormatFile.json to enable native KV input only.

{
   "graphs": [
      {
         "graph_name": [...],
         "tensors": [
            {
               "tensor_name": "past_key_0_in",
               "dataFormat": "QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT"
            },
            ...
         ]
      }
   ]
}

Note

Currently QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT is only supported in native KVcache scenario.

HMX_WEIGHT_LAYOUT explanation

Assume the flat shape of a single head KV is [DIN, DOUT], where DIN represents the number of input channels and DOUT represents the number of output channels. The transformation from flat to HMX_WEIGHT layout is

[DIN, DOUT] -> DOUT/KV_TILE_SIZE * [DIN, KV_TILE_SIZE] -> DOUT/KV_TILE_SIZE * [DIN/32, KV_TILE_SIZE/32, 8:DIN, 32:KV_TILE_SIZE, 4:DIN] -> [DOUT/KV_TILE_SIZE, DIN*KV_TILE_SIZE/1024, 1024]

KV_TILE_SIZE is fixed in this SDK version. K_TILE_SIZE=256, V_TILE_SIZE=64

Other Limitations

  • Cannot support v68 targets or lower

  • The maximum context length that can be supported is limited by the fixed VTCM size. For example, 2M VTCM won’t support 16K context length or above.

MaskedSoftmax

The MaskedSoftmax feature is designed to optimize the LLMs accuracy and performance executed on HTP backend. MaskedSoftmax is used to replace the Softmax(Add(In, Mask)) structure in attention block in LLMs. MaskedSoftmax can use the attention_mask tensor to mask out the invalid tokens in the softmax operation directly instead of using adding operation.

Benefits

  • Improved LLMs accuracy

  • Improved TTFT and Token Rate, especially when valid tokens account for a small part of large context lengths

Constraints

  • Applicability to only uint16 and uint8 quantization for now

How to enable

To enable MaskedSoftmax, model structure must be one of the following three patterns.

../../_static/resources/MaskedSoftmax_model_structure.png

Note

  • The difference of the above three kinds of structure is the datatype of attention_mask and how to treat the attention_mask as the condition of Where op. In pattern A and pattern B, the attention_mask is quantized to be uint16. In pattern C, the attention_mask is quantized to be uint8. In pattern A, entries with zero value in the attention_mask will be kept to perform softmax operation, and entries with non-zero value in the attention_mask will be masked out. In pattern B and pattern C, entries with zero value in the attention_mask will be masked out and entries with non-zero value in the attention_mask will be kept.

  • The Equal op in pattern A or NotEqual op in pattern B following attention_mask is used to convert the attention_mask to a mask tensor with 0 and 1, and the mask tensor is used as the condition of Where op.

  • The Where op is used to mask out the invalid entries in the input tensor, and the output of Where op is the input tensor of Softmax op. If the condition is false, the corresponding entry in the input tensor will be masked out and MaskedSoftmax will output 0 in the entry. If the condition is true, the corresponding entry in the input tensor will be kept and will be carried out softmax operation.

  • The Add(ReduceMin(Input), B) structure is used to avoid quantization/dequantization loss. To make sure that the output of softmax in the invalid entries is 0, the parameter B of Add op is currently constrained to be less than or equal to -20.0.

QnnContext_createFromBinaryWithCallback API

This API provides a callback-based registration mechanism for client-defined buffer allocators and data loading strategies, facilitating synchronous context creation. The client-defined callback is used to override the default buffer allocation and data loading behavior of QnnContext_createFromBinary, allowing for customized and flexible context binary loading.

The Qnn_ContextBinaryCallback_t* callback parameter is extended to support external buffer allocation/deallocation and custom data loading workflows. It enables clients to define how context binary data is provided and released during context creation.This structure primarily includes the following client-defined callback functions:

  • Qnn_ContextBinaryDmaDataProviderFn_t / Qnn_ContextBinaryRawDataProviderFn_t

  • Qnn_ContextBinaryDmaDataReleaseFn_t / Qnn_ContextBinaryRawDataReleaseFn_t

In each DataProviderFn_t callback, the client must ensure that the buffer for the requested data section has been properly allocated and the corresponding data has been fully loaded by the time the callback returns. Therefore, both buffer allocation and data loading schemes must be provided together to ensure correct and complete context creation behavior. For detailed definitions, refer to the Qnn_ContextBinaryCallback_t structure in QnnContext.h.

Key Features:

  • Customized buffer management: Allows clients to manage external buffers and perform customized data loading strategies for context data (e.g., weights).

  • Initialization time optimization: When using Qnn_ContextBinaryDmaBufferCallback_t, clients can leverage direct I/O scheme to load context binary data directly into external DMA buffers. This avoids runtime copy from user-space to backend buffer and reduces memory and time overhead. Performance gain depends on the client’s callback implementation.

Note

In the Qnn_ContextBinaryCallback_t definition, two callback types are introduced:
  • Qnn_ContextBinaryRawBufferCallback_t: For allocating standard raw buffers together with loading corresponding data. These buffers are typically user-space buffers not directly shared with the backend.

  • Qnn_ContextBinaryDmaBufferCallback_t: For allocating DMA buffers together with loading corresponding data. These buffers can be directly shared with the backend, enabling zero-copy context data loading. A typical use case is allocating and loading model weights buffers.

Currently, only Qnn_ContextBinaryDmaBufferCallback_t is supported. The raw buffer callback type is reserved for future extensibility. The DMA buffer callback is currently only used for model weights allocation and data loading, referred to as the external weights-loaded buffer feature.

Guidelines and limitations for using this API:

  • It is highly recommended that the callback implementation uses a direct I/O scheme in combination with Qnn_ContextBinaryDmaBufferCallback_t to achieve the zero-copy context binary data loading.

  • Proper data alignment support was introduced in the new context binary data format starting from qairt-2.37. Therefore, when using the direct I/O scheme, the context binary must be generated using qairt-2.37 or later.

  • This API is extended based on the QnnContext_createFromBinary API and does not support QnnContext_createFromBinaryListAsync. As a result, features associated with the ListAsync API—such as QNN_HTP_CONTEXT_CONFIG_OPTION_SHARE_RESOURCES and VA reservation for shared resources—are not supported.

  • It is not supported together with the graph switch feature.

  • It is not supported together with the udma64 feature, which can be disabled during the graph preparation phase.

  • It is not supported together with the QNN_CONTEXT_CONFIG_OPTION_DEFER_GRAPH_INIT.

  • It is not supported together with the securePD model protection feature.

  • Currently it is only supported and validated on Android platform. Other platforms may require additional adaptation and validation.

High-Level Usage Pipeline:

Refer to the following diagram for a visual representation of the API usage.

Context Create From Binary With Callback Callflow Diagram

Steps to use the QnnContext_createFromBinaryWithCallback API:

1. Define Callback Functions

Implement the logic for buffer allocation together with context binary data loading.

Users have to ensure that the following requirements are met for external buffers:

  • When using Qnn_ContextBinaryDmaBufferCallback_t, they have to be DMA buffers.

  • The buffer’s start address must be aligned to at least 4KB (page alignment). It is highly recommended that the valid data begins exactly at the start of the external buffer (i.e., dataStartOffset should ideally be 0). This recommendation is based on two key considerations.
    • Starting from qairt-2.37, the context binary data format has been optimized to ensure that the main data section is 4KB-aligned relative to the file start, enabling efficient offset-based reads.

    • The backend currently has limitations in handling arbitrary dataStartOffset value during the mapping phase.

  • They must be at least the required size specified in the request parameter passed through the callback.

  • For external DMA buffers, the FDs returned by the dataProvider callback must always be distinct—each invocation must return a different FD.

  • After data loading, any modification to the external buffer should only be induced by QNN, otherwise behavior is undefined.

  • They must not be deallocated until QNN explicitly invokes the dataRelease callback.

2. Create Context

Use the QnnContext_createFromBinaryWithCallback API to create the context.

  • The dataProvider and dataRelease callbacks are registered with the context.

  • During context creation, QNN will invoke the registered dataProvider callback to allocate external buffers and load the required data.The callback might be invoked multiple times—once for each data section being loaded (e.g., shared weights and non-shared weights are handled separately).

  • Once the data is loaded(callback returns), QNN will handle the mapping of external buffers to ensure they are shared with the backend.

  • During context release, the dataRelease callback will be triggered to properly release the external buffers.

Code example:

HTP external weights-loaded buffer example
  1// QnnInterface_t is defined in ${QNN_SDK_ROOT}/include/QNN/QnnInterface.h
  2 QnnInterface_t qnnInterface;
  3 // Init qnn interface ......
  4 // See ${QNN_SDK_ROOT}/examples/QNN/SampleApp code
  5
  6// Step 1. Define the DMA callback function
  7Qnn_ErrorHandle_t dmaDataProviderFn(Qnn_ContextBinaryDataRequest_t req,
  8                                    Qnn_ContextBinaryDmaDataResponse_t* dmaDataResponse,
  9                                    void* notifyParam) {
 10   // Implement buffer allocation and data loading processes
 11   Qnn_ErrorHandle_t err = QNN_SUCCESS;
 12
 13   // notifyParam can be used to pass a custom instance for identifying which model to load.
 14   std::pair<CustomClass*, uint32_t>* pair = reinterpret_cast<std::pair<CustomClass*, uint32_t>*>(notifyParam);
 15   CustomClass* CustomClass           = pair->first;
 16   uint32_t contextId                 = pair->second;
 17
 18   if (req.size == 0) {
 19     // handle error
 20     return QNN_GRAPH_ERROR_INVALID_ARGUMENT;
 21   }
 22
 23   //allocate dma buffer
 24   int32_t memFd = -1;
 25   const uint64_t alignOptimizedBufferSize = getAlignedSizeInBytes(PAGE_ALIGNED_SIZE, req.size);
 26
 27   BufferInfo bufferInfo;
 28   err = CustomClass->derectIOScheme->allocateDmaBuffer(CustomClass->m_filePath[contextId], req.offset, alignOptimizedBufferSize, &bufferInfo);
 29   if (bufferInfo.addr == nullptr) {
 30      // handle error
 31      return QNN_CONTEXT_ERROR_MEM_ALLOC;
 32   }
 33
 34   dmaDataResponse->dmaBuffer.data  = bufferInfo.addr;
 35   dmaDataResponse->dmaBuffer.fd    = bufferInfo.dma_fd;
 36   dmaDataResponse->dataStartOffset = bufferInfo.paddingSize;
 37   dmaDataResponse->alignedSize     = bufferInfo.alignedSize;
 38
 39   //loading data to dma buffer
 40   err = CustomClass->derectIOScheme->storeBufferData(bufferInfo);
 41   if (err != QNN_SUCCESS) {
 42      // handle error
 43      return QNN_CONTEXT_ERROR_MEM_ALLOC;
 44   }
 45
 46   return err;
 47}
 48
 49Qnn_ErrorHandle_t dmaDataReleaseFn(Qnn_ContextBinaryDmaDataMem_t dmaDataMem,
 50                                   void* notifyParam) {
 51   // Implement buffer release process
 52   Qnn_ErrorHandle_t err = QNN_SUCCESS;
 53
 54   // free dma buffer
 55   err = CustomClass->derectIOScheme->deallocateDmaBuffer(dmaDataMem);
 56   if (err != QNN_SUCCESS) {
 57      // handle error
 58   }
 59
 60   return err;
 61}
 62
 63// Step2. Create the context with QnnContext_createFromBinaryWithCallback API
 64std::pair<QnnApi*, uint32_t>* notifyParam =
 65         new std::pair<QnnApi*, uint32_t>(this, static_cast<size_t>(contextIdx));
 66
 67Qnn_ContextBinaryCallback_t callback {
 68   .type = QNN_CONTEXT_CALLBACK_DMA_BUFFER,
 69   .dmaBufferCallback      = Qnn_ContextBinaryDmaBufferCallback_t {
 70      QNN_CONTEXT_CALLBACK_DMA_BUFFER_VERSION_1,
 71      Qnn_ContextBinaryDmaBufferCallbackV1_t {dmaDataProviderFn,
 72                                              dmaDataReleaseFn,
 73                                              static_cast<void*>(notifyParam)}
 74     }
 75};
 76
 77Qnn_ContextHandle_t context;
 78Qnn_ErrorHandle_t error = m_qnnFunctionPointers.qnnInterface.contextCreateFromBinaryWithCallback(
 79   backend,
 80   device,
 81   config,
 82   &callback,
 83   binaryBuffer,
 84   binaryBufferSize,
 85   &context,
 86   profile,
 87   signal);
 88
 89if (error != QNN_SUCCESS) {
 90// handle the error
 91}
 92
 93// Execute graph
 94// See ${QNN_SDK_ROOT}/examples/QNN/SampleApp code
 95
 96// Free context
 97// See ${QNN_SDK_ROOT}/examples/QNN/SampleApp code for details
 98if (QNN_CONTEXT_NO_ERROR != m_qnnFunctionPointers.qnnInterface.contextFree(context, profileBackendHandle)) {
 99   // handle error
100}