Offline Graph Caching for DSP Runtime on HTP¶
Qualcomm® Neural Processing SDK DSP runtime on targets having Hexagon Tensor Processor (HTP) supports offline graph caching feature which helps to prepare the backend graph on Linux x86-64 platform. This helps to reduce the init time and directly loads the cache on device while executing the model.
The workflow change for Qualcomm® Neural Processing SDK Users:
model conversions using snpe-<framework>-to-dlc
model quantization using snpe-dlc-quant
model offline graph cache preparation using snpe-dlc-graph-prepare
model execution on target using snpe-net-run or custom application
The DLC Quantize in the above image consists of 2 step process i.e first to quantize the model and then generating the offline cache. snpe-dlc-graph-prepare tool is used to generate a DLC cache blob for the Qualcomm® Neural Processing SDK HTP runtime after the DLC is quantized by snpe-dlc-quant tool. snpe-dlc-graph-prepare tool can also be used with the float model to generate the cache for HTP FP16 runtime.
For example, the following commands convert an Inception v3 DLC file into a quantized Inception v3 DLC file, and generates the HTP graph cache and stores in the DLC.
snpe-dlc-quant --input_dlc inception_v3.dlc --input_list image_file_list.txt --output_dlc inception_v3_quantized.dlc
snpe-dlc-graph-prepare --input_dlc inception_v3_quantized.dlc --output_dlc inception_v3_quantized_cache.dlc --htp_socs sm8750
Running snpe-dlc-graph-prepare triggers generation of HTP graph on the model provided, and adding the generated cache to HTP records into the DLC. If the HTP compiler cannot process/compile any section of the network, snpe-dlc-graph-prepare issues an error message.
snpe-dlc-graph-prepare can help to re-prepare the offline cache for the graph quickly using same/different version of Qualcomm® Neural Processing SDK, without re-doing the quantization step which may take significant time if the input dataset is huge.
Similarly, snpe-dlc-quantize tool uses enable_htp option to generate a DLC cache blob for the Qualcomm® Neural Processing SDK HTP runtime as part of the quantization process.
For example, the following command converts an Inception v3 DLC file into a quantized Inception v3 DLC file, and generates the HTP graph cache and stores in the DLC.
snpe-dlc-quantize --input_dlc inception_v3.dlc --input_list image_file_list.txt --output_dlc inception_v3_quantized.dlc --enable_htp --htp_socs sm8750
Notes:
The offline prepared graph cache and the SNPE object at runtime must specify the same graph outputs. If at runtime the same graph outputs are not as specified in the prepared graph, the prepared graph is considered invalid and will be ignored. Then, graph preparation will done at runtime (called online prepare) thereby rejecting the cache blob in DLC leading to apparent increase in init time in this situation.
In order to enable CPU fallback for offline prepare, the DSP subnet that precedes the CPU subnet needs to have all output tensors that are inputs into the subsequent subnets marked as graph outputs.
The graph outputs are specified to the snpe-dlc-graph-prepare tool either as:
Output Layer Names (--set_output_layers) in which case all output tensors for that layer are considered graph outputs. Or as
Output Tensor Names (--set_output_tensors) if not all outputs from the layer are considered graph outputs
An example would be if there was an intermediate layer, for which one (or more) of its output tensors should be considered as a graph output. By default, the tool will choose all output tensors from the last layer in the serialized graph.
Outputs to the graph can be specified using an optional input_list to snpe-dlc-graph-prepare as well. To specify Output Layer Names to snpe-dlc-graph-prepare, a special line starting with “#” is added into the input_list argument that specifies the layer name(s):
#<output_op_name>[<space><output_op_name>]*
Alternately, to specify Output Tensor Names to snpe-dlc-graph-prepare, a special line starting with “%” is added into the input_list argument that specifies the output tensor name(s):
%<output_tensor_name>[<space><output_tensor_name>]*
To specify the Output Tensor Names at runtime:
For snpe-net-run, one must pass the names using the --set_output_tensors argument on the command line Syntax: [ --set_output_tensors=<val> ] Specifies a comma separated list of tensors to be output after execution.
When using the API - use this SNPE Builder API Snpe_SNPEBuilder_SetOutputTensors() / SNPEBuilder::setOutputTensors() to specify the same output tensor names.
A cache record created for a particular SoC can run on another SoC. Such interoperability is governed by the VTCM size and the DSP architecture of prepared and running SoCs. HTP Offline cache compatibility follows these empirical rules:
A cache generated for a newer DSP Arch cannot run an SoC with a lower DSP Arch. For example a cache record generated for v69 device (say sm8450) will not run on a v68 device (like sm8350) even if the cache was prepared with 2MB vtcm.
For the same DSP Arch, a cache prepared for one SoC can run on another SoC if the prepared with VTCM is less or equals the VTCM of running SoC.
A cache generated for v68 or v69 device will not run on a v73 device.
The --optimization_level command line option in the snpe-dlc-graph-prepare tool has some inherent tradeoffs and non-deterministic behavior:
Default optimization level is 2. Higher optimization levels incur longer offline prepare time but yields more optimal graph and hence faster execution time for most graphs.
Level 3 should provide more optimal graph in most cases, but can also result in less optimal graph in some cases.
Level 3 might yield a larger HTP offline cache record size and hence can lead to possible degradation on initialization time.
Offline graph prepare using snpe-dlc-quantize will be deprecated in future. Currently, snpe-dlc-quantize is used to support legacy work flow. It is recommended to migrate to snpe-dlc-graph-prepare for offline htp graph cache blob preparation.
Note that Output Tensor Names is not supported on the AIP runtime for legacy HTA SOCs.
It is possible to cache resized networks by making use of the –input_name and –input_dimensions arguments or use the Snpe_SNPEBuilder_SetInputDimensions API. Cache records are sensitive to the set of input dimensions they were prepared with. Multiple cache records with the same record identifier may coexist if they were prepared with differing input dimensions. During execution, a cache record may only be used if the input dimensions during execution match those used to prepare the cache record. This also applies to online prepare using both the net-run arguments (–input_name and –input_dimensions) as well as the API for resizing input tensor dimensions (Snpe_SNPEBuilder_SetInputDimensions API).
For example, assume a hypothetical network with one input whose original dimensions are 1x3x4x5. If the user resizes this input to 2x3x4x5 during cache preparation and attempts to subsequently run inference without also resizing that input to 2x3x4x5, then this otherwise compatible cache record will be rejected on the grounds of mismatching input dimensions.