Acceleration Support¶
Qualcomm® AI Engine Direct Delegate provides acceleration on Qualcomm platforms using the Qualcomm® AI Engine Direct SDK. The following sections describe the operators and features Qualcomm® AI Engine Direct Delegate supports.
GPU¶
In general, Qualcomm® AI Engine Direct GPU backend supports float32 and float16 operators and
activations. There is an option to set the activation and operator compute
precisions used by the accelerator core,
see the TfLiteQnnDelegateGpuBackendOptions structure.
For TFLite operator support, see Supported Operators. The library for
this backend is libQnnGpu.so.
There is one option to set the accelerator’s performance mode, see
TfLiteQnnDelegateGpuBackendOptions for all the options enums.
The order of performance level of each mode is (from high to low):
High > Normal > Low, and Default mode is aligned with GPU backend default setting.
HTP¶
In general, Qualcomm® AI Engine Direct HTP backend supports quantized 8 bits fixed-points activations and operators. For TFLite operator support, see Supported Operators, in addition to that, the following table lists op restrictions. The libraries for this backend are libQnnHtp.so, libQnnHtpPrepare.so, libQnnHtp*Stub.so and libQnnHtp*Skel.so.
There is one option to set the accelerator’s performance mode, see
TfLiteQnnDelegateHtpBackendOptions for all the options enums.
When performance_mode is used, the delegate will vote for the provided
performance level at initialization. Users can change that on-the-fly by C-APIs.
See C Interface. Or, Delegate also support another strategy
called kHtpPerfCtrlAuto, which vote automatically during inference and then
return back to relaxed vote after the inference has completed.
The order of performance level of each mode is (from high to low):
Burst > Sustained High Performance > High Performance > Balanced > Low Balanced
> High Power Saver > Power Saver > Low Power Saver.
In case of power consumption, it’s of the opposite order. One exception from
this order is Default mode, it basically means no input vote from client and
HTP will vote for the optimal modes itself automatically.
Performance mode |
Corner on up vote |
|---|---|
Default |
|
Sustained High Performance |
|
Burst |
|
High Performance |
|
Power Saver |
|
Low Power Saver |
|
High Power Saver |
|
Low Balance |
|
Balance |
|
On certain SoCs, the Qualcomm® AI Engine Direct HTP backend supports 16-bit floating point precision.
This can be enabled by setting the
TfLiteQnnDelegateHtpBackendOptions.precision option to
TfLiteQnnDelegateHtpPrecision.kHtpFp16. This requires the
.tflite model to have tensors with floating point precision. Note that fp32 models
can still be delegated, but the underlying math is in 16-bit precision.
Note that kHtpFp16 is only supported by a limited set of chips. At this moment, SnapDragon 8 Gen 1 or newer SnapDragon 8 generations can support kHtpFp16.
DSP¶
The Qualcomm® AI Engine Direct DSP backend supports legacy chipsets with the Hexagon DSP hardware, as opposed to the newer HTP hardware. The DSP backend only supports quantized uint8 activations and operators. Qualcomm® AI Engine Direct Delegate supports the V66 generation of the Qualcomm® AI Engine Direct DSP only. For TFLite operator support, see Supported Operators. The libraries for this backend are libQnnDspV66Stub.so and libQnnDspV66Skel.so.
For dsp_performance_mode, it’s the same voting options and performance order as htp_performance_mode, see HTP.
Note that DSP backend returns per operator profiling events in cycles, whereas the other backends may display the units as time measurements (microseconds).
Supported Operators¶
The following table shows the supported TFLite operators. See Operator Restrictions for limitations and restrictions of operators supported by this delegate.
Operators |
|---|
Abs |
Add |
AddN |
ArgMax |
ArgMin |
AveragePool2d |
BatchMatMul |
BatchToSpaceNd |
Broadcast_to |
Cast |
Ceil |
Concatenation |
Conv2d |
Conv3d |
Conv3dTranspose |
Cos |
Cumsum |
DepthToSpace |
DepthwiseConv2d |
Dequantize |
DetectionPostprocess |
Div |
Elu |
Exp |
EmbeddingLookup |
ExpandDims |
Equal |
Floor |
FullyConnected |
Gather |
GatherNd |
Gelu |
Greater |
GreaterEqual |
HardSwish |
L2Normalization |
L2Pool2d |
LeakyRelu |
Less |
LessEqual |
LocalResponseNormalization |
Log |
LogicalAnd |
LogicalNot |
LogicalOr |
Logistic |
LogSoftmax |
Lstm |
MaxPool2d |
Maximum |
Mean |
Minimum |
MirrorPad |
Mul |
Neg |
NotEqual |
OneHot |
Pack |
Pad |
Padv2 |
Pow |
Prelu |
Quantize |
ReduceMax |
ReduceMin |
ReduceProd |
Relu |
Relu0To1 |
Relu6 |
ReluN1To1 |
Reshape |
ResizeBilinear |
ResizeNearestNeighbor |
ReverseV2 |
Round |
Rsqrt |
ScatterNd |
SegmentSum |
Select |
SelectV2 |
Sin |
Slice |
Softmax |
SpaceToBatchNd |
SpaceToDepth |
Split |
SplitV |
Sqrt |
Square |
SquaredDifference |
Squeeze |
StridedSlice |
Sub |
Sum |
Tanh |
Tile |
TopkV2 |
Transpose |
TransposeConv |
Unpack |
Operator Restrictions¶
The following table lists any operator restrictions imposed by the delegate. All other operator restrictions are determined at runtime by the Qualcomm® AI Engine Direct backend. See the Qualcomm® AI Engine Direct SDK documentation for backend specific limitations and restrictions.
Operators |
Restriction |
|---|---|
AddN |
|
ArgMax |
|
ArgMin |
|
BatchToSpaceNd |
|
Ceil |
|
Conv3d |
|
Conv3dTranspose |
|
Cos |
|
Cumsum |
|
Elu |
|
ExpandDims |
|
GatherNd |
|
L2pool2d |
|
LeakyRelu |
|
Mean |
|
OneHot |
|
Pad |
|
PadV2 |
|
Relu_0_to_1 |
|
ReverseV2 |
|
Round |
|
ScatterNd |
|
SegmentSum |
|
Slice |
|
SpaceToBatchNd |
|
Split |
|
StridedSlice |
|
Sum |
|
Tile |
|
Transpose |
|