Acceleration Support

Qualcomm® AI Engine Direct Delegate provides acceleration on Qualcomm platforms using the Qualcomm® AI Engine Direct SDK. The following sections describe the operators and features Qualcomm® AI Engine Direct Delegate supports.

GPU

In general, Qualcomm® AI Engine Direct GPU backend supports float32 and float16 operators and activations. There is an option to set the activation and operator compute precisions used by the accelerator core, see the TfLiteQnnDelegateGpuBackendOptions structure. For TFLite operator support, see Supported Operators. The library for this backend is libQnnGpu.so.

There is one option to set the accelerator’s performance mode, see TfLiteQnnDelegateGpuBackendOptions for all the options enums. The order of performance level of each mode is (from high to low): High > Normal > Low, and Default mode is aligned with GPU backend default setting.

HTP

In general, Qualcomm® AI Engine Direct HTP backend supports quantized 8 bits fixed-points activations and operators. For TFLite operator support, see Supported Operators, in addition to that, the following table lists op restrictions. The libraries for this backend are libQnnHtp.so, libQnnHtpPrepare.so, libQnnHtp*Stub.so and libQnnHtp*Skel.so.

There is one option to set the accelerator’s performance mode, see TfLiteQnnDelegateHtpBackendOptions for all the options enums. When performance_mode is used, the delegate will vote for the provided performance level at initialization. Users can change that on-the-fly by C-APIs. See C Interface. Or, Delegate also support another strategy called kHtpPerfCtrlAuto, which vote automatically during inference and then return back to relaxed vote after the inference has completed. The order of performance level of each mode is (from high to low): Burst > Sustained High Performance > High Performance > Balanced > Low Balanced > High Power Saver > Power Saver > Low Power Saver. In case of power consumption, it’s of the opposite order. One exception from this order is Default mode, it basically means no input vote from client and HTP will vote for the optimal modes itself automatically.

Performance mode

Corner on up vote

Default

  • Doesn’t perform any specific voting.

Sustained High Performance

  • Sustained the TURBO corner vote.

Burst

  • Vote for TURBO Plus corner vote.

High Performance

  • Vote for TURBO corner vote.

Power Saver

  • Vote for SVS corner vote.

Low Power Saver

  • Vote for SVS2 corner vote.

High Power Saver

  • Vote for SVS Plus corner vote.

Low Balance

  • Vote for NORMINAL corner vote.

Balance

  • Vote for NORMINAL Plus corner vote.

On certain SoCs, the Qualcomm® AI Engine Direct HTP backend supports 16-bit floating point precision. This can be enabled by setting the TfLiteQnnDelegateHtpBackendOptions.precision option to TfLiteQnnDelegateHtpPrecision.kHtpFp16. This requires the .tflite model to have tensors with floating point precision. Note that fp32 models can still be delegated, but the underlying math is in 16-bit precision.

Note that kHtpFp16 is only supported by a limited set of chips. At this moment, SnapDragon 8 Gen 1 or newer SnapDragon 8 generations can support kHtpFp16.

DSP

The Qualcomm® AI Engine Direct DSP backend supports legacy chipsets with the Hexagon DSP hardware, as opposed to the newer HTP hardware. The DSP backend only supports quantized uint8 activations and operators. Qualcomm® AI Engine Direct Delegate supports the V66 generation of the Qualcomm® AI Engine Direct DSP only. For TFLite operator support, see Supported Operators. The libraries for this backend are libQnnDspV66Stub.so and libQnnDspV66Skel.so.

For dsp_performance_mode, it’s the same voting options and performance order as htp_performance_mode, see HTP.

Note that DSP backend returns per operator profiling events in cycles, whereas the other backends may display the units as time measurements (microseconds).

Supported Operators

The following table shows the supported TFLite operators. See Operator Restrictions for limitations and restrictions of operators supported by this delegate.

Operators

Abs

Add

AddN

ArgMax

ArgMin

AveragePool2d

BatchMatMul

BatchToSpaceNd

Broadcast_to

Cast

Ceil

Concatenation

Conv2d

Conv3d

Conv3dTranspose

Cos

Cumsum

DepthToSpace

DepthwiseConv2d

Dequantize

DetectionPostprocess

Div

Elu

Exp

EmbeddingLookup

ExpandDims

Equal

Floor

FullyConnected

Gather

GatherNd

Gelu

Greater

GreaterEqual

HardSwish

L2Normalization

L2Pool2d

LeakyRelu

Less

LessEqual

LocalResponseNormalization

Log

LogicalAnd

LogicalNot

LogicalOr

Logistic

LogSoftmax

Lstm

MaxPool2d

Maximum

Mean

Minimum

MirrorPad

Mul

Neg

NotEqual

OneHot

Pack

Pad

Padv2

Pow

Prelu

Quantize

ReduceMax

ReduceMin

ReduceProd

Relu

Relu0To1

Relu6

ReluN1To1

Reshape

ResizeBilinear

ResizeNearestNeighbor

ReverseV2

Round

Rsqrt

ScatterNd

SegmentSum

Select

SelectV2

Sin

Slice

Softmax

SpaceToBatchNd

SpaceToDepth

Split

SplitV

Sqrt

Square

SquaredDifference

Squeeze

StridedSlice

Sub

Sum

Tanh

Tile

TopkV2

Transpose

TransposeConv

Unpack

Operator Restrictions

The following table lists any operator restrictions imposed by the delegate. All other operator restrictions are determined at runtime by the Qualcomm® AI Engine Direct backend. See the Qualcomm® AI Engine Direct SDK documentation for backend specific limitations and restrictions.

Operators

Restriction

AddN

  • Inputs can only be float32 or int32

ArgMax

  • axis tensor must be constant

ArgMin

  • axis tensor must be constant

BatchToSpaceNd

  • block shape tensor must be constant

  • crops tensor must be constant

Ceil

  • in[0]: only can be supported by HTPFP16 backend currently

Conv3d

  • only can be supported by HTPFP16 backend currently

  • input, in[0]: float32

  • filter, in[1]: float32

  • bias, in[2]: same as input type

Conv3dTranspose

  • only can be supported by HTPFP16 backend currently

  • filter, in[1]: must be constant

  • bias, in[2]: must be given

Cos

  • in[0]: supports float32

Cumsum

  • only supported by HTP backend currently

Elu

  • alpha=1, in[0]: only float32 and int8 supported

ExpandDims

  • axis tensor must be constant

GatherNd

  • only supported by HTP/DSP backends currently

  • params, in[0]: int32/uint8/int8

  • indices, in[1]: int32

L2pool2d

  • in[0] only float32 supported

LeakyRelu

  • alpha range must be in [0, 1]

Mean

  • axis tensor must be constant

OneHot

  • Only supported by HTP/HTPFP16 backend currently

  • in[0]: int32, must be in range [0, depth-1]

  • out[0], on_value, off_value: float32, uint8/int8 must be quantized

  • depth, on_value, off_value must be static tensor

Pad

  • paddings tensor must be constant

  • constant values tensor must be constant

PadV2

  • paddings tensor must be constant

  • constant values tensor must be constant

  • in[0]: supports uint8, int8

Relu_0_to_1

  • in[0]: supports uint8, int8, and float32

ReverseV2

  • axis must be INT32 constant tensor with only 1 element.

Round

  • input, in[0]: float32(limited by tflite op model), support by gpu/gpu16

ScatterNd

  • only supported by HTP/HTPFP16 currently

  • indices, in[0]: int32

  • updates, in[1]: Float32/uint32

  • shape, in[2]: int32, must be static(constant)

SegmentSum

  • only supported by HTP/HTPFP16 currently

  • input, in[0]: float32/int32, must be static(constant)

  • segment id, in[1]: int32, must be an 1D tensor and static(constant)

Slice

  • begin and size tensors must be constant

SpaceToBatchNd

  • block shape tensor must be constant

  • paddings tensor must be constant

Split

  • axis tensor must be constant

StridedSlice

  • begin, end, and strides tensors must be constant

  • ellipsis is not supported

Sum

  • axis tensor must be constant

Tile

  • axis tensor must be constant

Transpose

  • axis tensor must be constant