QNN HTP Op Package - Relu Op Example

Overview

This document outlines how to write ops in QNN HTP op package with a basic example of relu op. Here we will go through procedures of writing op implementations, registering ops, defining optimization rules, and registering optimization rules. The source code for this example is located at examples/OpPackage/HTP/ExampleOpPackageRelu.cpp in QNN SDK.

For detailed descriptions about writing op implementations, defining optimization rules, specifying op parameter orders, please read implementing_ops.html. In additon, optimization_grammar.html provides more information on defining optimization rules.

Writing Relu Op

ExampleOpPackageRelu.cpp contains a standard Relu op, one variation of Relu op called ReluMinMax which clips data to a specified range, and another variation of Relu op called ReluTableGen which can be used with tableLookup. The standard Relu op demonstrates the basics of reference op implementation, for example, tensor reading and writing. ReluMinMax op consists of optimized implementation which heavily depends on HVX. ReluTableGen serves as a faster alternative of Relu op, it leverages lookup table to achieve fast lookups. Besides, there are many optimization rules associated with relu, these rules are used to convert, split and optimize graph around relu, thus best performance can be achived.

This document focuses only on the standard relu op and some of its basic optimization rules.

Op Implementation Function

/*
 * @brief                  implementation of relu op
 *
 * @param[out] out         output HTP tensor
 *
 * @param[in] in           input HTP tensor
 *
 * @return GraphStatus     error code
 */
template <typename T_Ttype>
int reluImpl(T_Ttype &out, const T_Ttype &in);

This example uses a function template, and the template parameter takes a HTP core tensor type. The op implementation function parameter list consists of a series of HTP core tensors in the following order: outputs inputs parameters. Input tensors and parameter tensors shall be marked as const. Please note, in implementation fucntions, there is no separation between input tensors and parameters. Also, both QNN scalar and tensor parameters are converted into HTP core tensors. In addition, HTP core tensors are always 4 dimensions, and the layout is always bhwc. QNN tensors with lower dimensions are backfilled into 4-dimensional HTP core tensors. Op implementation functions shall return GraphStatus which is an enum defined in include/HTP/core/graph_status.h in QNN SDK.

HTP Core Tensor Types

HTP core has a base tensor type Tensor and a bunch of ConcreteTensor types. ConcreteTensor types are derived from base Tensor, and each ConcreteTensor type has a fixed rank, memory layout and data type. Base Tensor can be used in generic op implementations and served as a fallback option. ConcreteTensor types can be used to specialize op implementations for faster performance purpose.

For a list of HTP core tensor types supported in op package, please refer to AllTensors defined in include/HTP/core/template_help_tensor_ext.h in QNN SDK. For details about HTP core tensors’ usage and their accessor functions, please refer to include/HTP/core/tensor.h.

More descriptions about HTP memory layouts and tensors can be found in tensors_and_memory_layout.html.

Relu Op Implementation

The functionality of relu op is as follows:

f(x) = max(x, 0)

The implementation is as follows:

template <typename T_Ttype>
int reluImpl(T_Ttype &out, const T_Ttype &in) {
  out.set_dims(in);  // sets output tensor dimension to be the same as input tensor
  // loops thru each input and output tensor elements via their four dimensions
  for (Idx b = 0; b < in.dim(0); b++) {
    for (Idx h = 0; h < in.dim(1); h++) {
      for (Idx w = 0; w < in.dim(2); w++) {
        for (Idx d = 0; d < in.dim(3); d++) {
          // read input tensor element located at coordinates (b, h, w, d)
          float inval     = in(b, h, w, d);
          // find max(in, 0) and assign to output tensor at coordinates (b, h, w, d)
          out(b, h, w, d) = fmaxf(inval, 0.0f);
        }
      }
    }
  }
}

Op Registration

Op implementation functions need to be registered with an op name, op cost and flags. Op registration can be achieved using HTP core macros listed below, and these macros should be placed in global scope in individual op implementation source files.

Method 1

Registration with default cost value (i.e. GLACIAL) and default flag (Flags::RESOURCE_HVX)

Syntax

/*
 * F  - op implementation function
 *
 * OP - op name
 */
DEF_PACKAGE_OP(F,OP)

Example

DEF_PACKAGE_OP((reluImpl<Tensor>), "Relu")

Method 2

Registration with user specified cost value and flags.

Syntax

/*
 * F    - op implementation function
 *
 * OP   - op name
 *
 * COST - pre-defined cost value names, one of GLACIAL, SNAIL, FAST, FREE
 *        (listed in descending order of value).
 *        Op implementation with relatively lower cost will be chosen given all
 *        other criteria are met.
 *
 * ...  - zero or more flags, available flags include IS_CONST, INHIBIT_CONST_PROP,
 *        RESOURCE_HVX.
 *        IS_CONST is used to mark an op should be treated as a constant op.
 *        INHIBIT_CONST_PROP marks an op should not participate in constant propagation.
 *        RESOURCE_HVX marks this op will use HVX resources.
 */
DEF_PACKAGE_OP_AND_COST_AND_FLAGS(F,OP,COST,...)

Example

DEF_PACKAGE_OP_AND_COST_AND_FLAGS((reluImpl<PlainFloatTensor>), "Relu", SNAIL, Flags::RESOURCE_HVX)

Method 3

Registration with user specified cost function and flags. (not shown in relu op example)

Syntax

/*
 * F      - op implementation function
 *
 * OP     - op name
 *
 * COST_F - user defined cost function
 *          cost function pointer type: typedef float (*cost_function) (const Op * op);
 *          Op implementation with relatively lower cost will be chosen given all
 *          other criteria are met.
 *
 * ...    - zero or more flags, available flags include IS_CONST, INHIBIT_CONST_PROP,
 *          RESOURCE_HVX.
 *          IS_CONST is used to mark an op should be treated as a constant op.
 *          INHIBIT_CONST_PROP marks an op should not participate in constant propagation.
 *          RESOURCE_HVX marks this op will use HVX resources.
 */
DEF_PACKAGE_OP_AND_COST_F_AND_FLAGS(F,OP,COST_F,...)

Example

float reluCost(const Op *op) {
  // can use some properties of an op to determine cost
  return 0.0;
}

DEF_PACKAGE_OP_AND_COST_F_AND_FLAGS((reluImpl<PlainFloatTensor>), "Relu", reluCost, Flags::RESOURCE_HVX)

Defining Optimization Rules

Optimization rules are intended for graph level transformations and will be applied in passes during graph preparation in QNN context finalization. Relu optimization rules contain examples of splitting op into smaller chuncks, moving data to and from vtcm, converting one op to another more optimized op, and more. These rules would transform the graph around Relu in order to get best performance.

Optimization rules can be defined with a HTP core macro listed below, and this macro should be placed in global scope in individual op implementation source files.

Syntax

/*
 * PRIORITY       - optimization pass priority, smaller number means getting applied earlier
 *                  predefined values include EARLY(2000), MIDDLE(3000), LATE(4000)
 *
 * MATCHCODE      - matching pattern for transformation to occur
 *
 * CONSTRAINTCODE - constraints which limits the conditions for transformation to occur
 *
 * REPLACECODE    - tranformed pattern which replaces the original matching pattern
 */
DEF_PACKAGE_OPTIMIZATION(PRIORITY,MATCHCODE,CONSTRAINTCODE,REPLACECODE)

Example

DEF_PACKAGE_OPTIMIZATION(
    EARLY,
    Op("Relu", "X"),
    IS_QUANT_TYPE("X"),
    Op( "ReluMinMax", "X", gen_ConstScalar_f32(0.0f), gen_ConstScalar_f32(INF)))

In this example, this optimization rule shall be applied at EARLY optimization pass during graph finalization, and the pattern this rule matches is a “Relu” op with one input, and we temporarily call this input “X”, and the constraint for this rule is that the input data must be quantized. If at EARLY optimization pass, there is an “Relu” op with one quantized input in the graph, this “Relu” op will get converted to a “ReluMinMax” op with the three inputs, the first input is “X”, and the next two inputs are 0.0f and float32 infinity.

HTP core provides some replacement functions and constraint macros for op package to use, for more information about optimization rules, please refer to optimization_grammar.html.

Next Steps

This is a basic and yet helpful example which outlines how to write an op.

Please continue to read implementing_ops.html, and read Relu, Max Pool and Softmax example files to view the code more in detail.