Implementing Ops¶
Overview¶
This document describes implementing ops in context of QNN HTP op package in details. The following sections are covered:
Writing a New Op¶
The internal representation of a QNN graph is a sequence of OpDef
nodes that represent the entire graph. Some of OpDef nodes represent
graph calculations, with one or more inputs; the operation is determined
by the name given to the OpDef. An OpDef class expresses the
definition of an operation and lists the inputs to the op (InputDef)
and the output characteristics of the op (OutputDef).
OpDef can represent const data (for representing parameters,
weights, etc.) or shapes (which don’t have data but still need a
means to be expressible as a shape)
The graph preparation phase starts after the graph construction phase.
During the prepare stage, a set of optimization rules are applied based
on priority, which cause transformations to be made to the graph: parts
of the graph are removed and replaced by other arrangement of OpDef.
The rules are organized into passes. Each pass attempts to match a
sequence of nodes and applies the transformation on the sequence if the
constraint is satisfied.
User needs to set a priority number for each ops to indicate the order
in which optimizations should be applied. All the OpDefs in the
graph are loop through and optimizations rules are checked against one
another. The loop ends when none of the OpDefs in the graph can be
optimized using optimization rules from the current pass. Adjustments
and replacements to the opDef is applied only when possible.
Steps for Implementing an Op¶
Writing ops can be categorized into four-step process.
Step 1: Op Implementation¶
User needs to implement the functionality of the op. Here is a simple example of element-wise addition:
int elementwise_add(Tensor &out, const Tensor &a, const Tensor &b)
{
out.set_dims(a);
auto [aB,aH,aW,aD] = a.dims();
for (Idx b = 0; b < aB; b++) {
for (Idx h = 0; h < aH; h++) {
for (Idx w = 0; w < aW; w++) {
for (Idx d = 0; d < aD; d++) {
out(b,h,w,d) = a(b,h,w,d) + b(b,h,w,d);
}
}
}
}
return 0;
}
This is just a basic reference op implementation - more improvements are provided later in the doc. Let’s look at some of the details.
The op implementation function parameter list consists of a series of
HTP core tensors in the following order: outputs inputs parameters. Input tensors
and parameter tensors shall be marked as const. Please note, in implementation
fucntions, there is no separation between input tensors and parameters, they are
both considered inputs in HTP core. Also, both
QNN scalar and tensor parameters are converted into HTP core tensors. In addition,
HTP core tensors are always 4 dimensions, and the layout is always bhwc. QNN tensors
with lower dimensions are backfilled into 4-dimensional HTP core tensors. Op
implementation functions shall return GraphStatus which is an enum defined in
include/HTP/core/graph_status.h in QNN SDK.
HTP core has a base tensor type Tensor and a bunch of ConcreteTensor types.
ConcreteTensor types are derived from base Tensor, and each ConcreteTensor
type has a fixed rank, memory layout and data type. Base Tensor can be used
in generic op implementations and served as a fallback option. ConcreteTensor
types can be used to specialize op implementations for faster performance purpose.
In this implementation, there is no parameter, and generic Tensor is used for
both input and outputs. Users can access elements from these generic tensors using
parentheses. Regardless of the underlying types of tensors, the element access
interface type is float.
There is a downside to this generic approach: it is slow! Every element access needs to call some functions to find a location, and another set of functions to decode/encode the value.
Fortunately, if more visibility is provided to the compiler about the nature of the tensors, this overhead can be greatly reduced.
int elementwise_add_faster(PlainFloatTensor &out, const PlainFloatTensor &a, const PlainFloatTensor &b)
{
out.set_dims(a);
auto [aB,aH,aW,aD] = a.dims();
for (Idx b = 0; b < aB; b++) {
for (Idx h = 0; h < aH; h++) {
for (Idx w = 0; w < aW; w++) {
for (Idx d = 0; d < aD; d++) {
out(b,h,w,d) = a(b,h,w,d) + b(b,h,w,d);
}
}
}
}
return 0;
}
Here is an slightly optimized version using a ConcreteTensor type
PlainFloatTensor, which is a tensor holding float
values and a flat memory layout. The compiler now has the visibility to
eliminate the function calls for accessing each element and decoding it,
so this implementation is more efficient than the previous
implementation.
For a list of HTP core tensor types and their accessor functions, please refer to include/HTP/core/tensor.h in QNN SDK.
More descriptions about HTP memory layouts and tensors can be found in tensors_and_memory_layout.html.
How does the infrastructure choose what op to select? In this case, if
the input tensors and output tensors of the node of this current op type are
all PlainFloatTensor, then both implementation can be work, the
implementation registered with relatively lower cost will be selected.
If any of the input tensors or output tensors of the node of this current
op type are not PlainFloatTensor, then the first implemenation with
generic Tensor type will be the fall back option.
It is important to understand that reference ops can be called when the operands do not match the optimized implementation. Generic implementations can be desirable or can be implemented as always failing to catch problems in an input graph or with the optimization process.
elementwise_add and elementwise_add_faster implementations are
extremely similar in implementation. These functions could be refactored
to be one templatized function:
template<typename TType>
int elementwise_add(TType &out, const TType &a, const TType &b)
{
out.set_dims(a);
auto [aB,aH,aW,aD] = a.dims();
for (Idx b = 0; b < aB; b++) {
for (Idx h = 0; h < aH; h++) {
for (Idx w = 0; w < aW; w++) {
for (Idx d = 0; d < aD; d++) {
out(b,h,w,d) = a(b,h,w,d) + b(b,h,w,d);
}
}
}
}
return 0;
}
To further optimize any op, HVX can be used to achieve parallel calculations.
Step 2: Op Registration¶
Op implementation functions need to be registered with an op name, op cost and flags. Op registration can be achieved using HTP core macros listed below, and these macros should be placed in global scope in individual op implementation source files.
Method 1¶
Registration with default cost value (i.e. GLACIAL) and default flag (Flags::RESOURCE_HVX)
Syntax
/*
* F - op implementation function
*
* OP - op name
*/
DEF_PACKAGE_OP(F,OP)
Example
DEF_PACKAGE_OP(elementwise_add<Tensor>, "Add")
Method 2¶
Registration with user specified cost value and flags.
Syntax
/*
* F - op implementation function
*
* OP - op name
*
* COST - pre-defined cost value names, one of GLACIAL, SNAIL, FAST, FREE
* (listed in descending order of value).
* Op implementation with relatively lower cost will be chosen given all
* other criteria are met.
*
* ... - zero or more flags, available flags include IS_CONST, INHIBIT_CONST_PROP,
* RESOURCE_HVX.
* IS_CONST is used to mark an op should be treated as a constant op.
* INHIBIT_CONST_PROP marks an op should not participate in constant propagation.
* RESOURCE_HVX marks this op will use HVX resources.
*/
DEF_PACKAGE_OP_AND_COST_AND_FLAGS(F,OP,COST,...)
Example
DEF_PACKAGE_OP_AND_COST_AND_FLAGS (
elementwise_add<PlainFloatTensor>,
"Add",
SNAIL,
RESOURCE_HVX)
Method 3¶
Registration with user specified cost function and flags.
Syntax
/*
* F - op implementation function
*
* OP - op name
*
* COST_F - user defined cost function
* cost function pointer type: typedef float (*cost_function) (const Op * op);
* Op implementation with relatively lower cost will be chosen given all
* other criteria are met.
*
* ... - zero or more flags, available flags include IS_CONST, INHIBIT_CONST_PROP,
* RESOURCE_HVX.
* IS_CONST is used to mark an op should be treated as a constant op.
* INHIBIT_CONST_PROP marks an op should not participate in constant propagation.
* RESOURCE_HVX marks this op will use HVX resources.
*/
DEF_PACKAGE_OP_AND_COST_F_AND_FLAGS(F,OP,COST_F,...)
Example
float elementAddCost(const Op *op) {
// can use some properties of an op to determine cost
return 0.0;
}
DEF_PACKAGE_OP_AND_COST_F_AND_FLAGS (
elementwise_add<PlainFloatTensor>,
"Add",
elementAddCost,
RESOURCE_HVX)
Step 3: Specify the DEF_TENSOR_PROPERTIES for operators¶
We can achieve centralizing the decision-making on the Layout and Memory Placement of our tensors by specifying the requirements and constraints for operators in DEF_TENSOR_PROPERTIES. Here is an example:
DEF_TENSOR_PROPERTIES(Op("Argmax", "in", "axis"),
Flat("*", "axis"),
MainMemory("..."))
In this example, the first literal, “Argmax”, is the name of the operator and the remaining literals provide local names to the tensor inputs. * We use “*” to refer to the (first) output tensor. * “in” identifies the first input tensor but is not mentioned in any layout constraint and so may have either flat or crouton layout. * “axis” identifies the parameter tensor and require to be in flat layout. * The ellipsis refers to “all tensors not yet constrained” so in this case all tensors are constrained to be in main memory.
Constraint terms:¶
Flat: flat layout
Crouton: crouton layout
Tcm: in TCM
MainMemory: in main memory.
These are constraints: if a tensor is not mentioned for some property then the system will assign a property for it. In this example “in” identifies the first input tensor but is not mentioned in any layout constraint and so may have either flat or crouton layout. The ellipsis refers to “all tensor not yet constrained on the relevant property” so in this case all tensors are constrained to be in main memory.
Before and after TCM Migration¶
// DEF_OPT rules we have in the old way
DEF_PACKAGE_OPTIMIZATION(LATE+900,
Op("FastExampleOp", "in_0", "in_1", "in_2"),
OK,
Op("FastExampleOp",
Op(FROM_DEFAULT_PACKAGE("crouton_to_vtcm"), Op(FROM_DEFAULT_PACKAGE("ForceFormat_Crouton"), "in_0")),
Op(FROM_DEFAULT_PACKAGE("flat_to_vtcm"), Op(FROM_DEFAULT_PACKAGE("ForceFormat_Flat"),"in_1")),
"in_2"
)
)
Now instead of using flat/crouton_to_vtcm, we use DEF_TENSOR_PROPERTIES to control layout and placement
//The new way
DEF_TENSOR_PROPERTIES(Op("FastExampleOp","in0","in1","in2"),
Flat("*","in1","in2"),
Crouton("in0"),
MainMemory("in2"),
Tcm("in_0", "in_1"))
in0 should be in Crouton and tcm memory
in1 should be in Flat and tcm memory
For facilitating migration to the new way, a helper script is provided under examples/customer_migration_tool/customer_migration.py in QNN SDK. More examples can be find under examples/QNN/OpPackage/HTP
Step 4: Op Parameter Order Specification - Optional¶
From QNN level, some ops uses parameters in addition to inputs, and there might be more than one parameter. Parameters are constants provided as part of Qnn_OpConfig_t, and Qnn_Param_t has a name field associated with it. Due to the nature of HTP op implementation function interface, the parameters are differentiated based on order rather than names. To allow QNN users to use op parameter names during QnnGraph_addNode function calls, HTP backend allows op package writers to specify op parameter orders as well as default their values for any ops. HTP backend does re-arrangement of the op parameters used in QnnGraph_addNode based on the order listed. If an op does not have an op parameter order specification, no re-arrangement occurs in QnnGraph_addNode.
This is not applicable to the elementwise_add op mentioned above.
Op parameter order can be specified using HTP core macro listed below, and this macro should be placed in global scope in individual op implementation source files.
Syntax
/*
* OP - op name
*
* PARAM - parameter name
*
* MANATORY - boolean, whether this parameter is required to be provided at Qnn_addNode
*
* DEFAULT - default parameter value as Qnn_Param_t*, is used when MANATORY is false.
* If provided as Qnn_Param_t*, DEFAULT will be used for graph construction
* when this parameter is not provided at Qnn_addNode.
* If provided as nullptr, graph construction will skip this parameter when
* this parameter is not provided at Qnn_addNode.
*/
DEF_PACKAGE_PARAM_ORDER(OP,PARAM1,MANDATORY1,DEFAULT1,PARAM2,MANDATORY2,DEFAULT2...)
This macro is one per op and it takes any number of parameters. If an op has a parameter order definition, any parameter passed into Qnn_addNode with unlisted name will be abandoned. If two or more op packages with the same package name will be registered, they cannot list conflicting parameter orders.
Example
static Qnn_Scalar_t sg_opParamDefault1Scalar{.dataType = QNN_DATATYPE_FLOAT_32, .floatValue = 6.0};
static Qnn_Param_t sg_opParamDefault1{.paramType = QNN_PARAMTYPE_SCALAR,
.scalarParam = sg_opParamDefault1Scalar};
DEF_PACKAGE_PARAM_ORDER("paramOrderDemoOp",
"MinusVal",
false,
&sg_opParamDefault1,
"AxisVal",
true,
nullptr,
"AddVal",
true,
nullptr,
"OptionalParam",
false,
nullptr)
This example defines op parameter order for op paramOrderDemoOp from the current
op package. It expects four parameters in the order of “MinusVal”, “AxisVal”, “AddVal”,
“OptionalParam”. “MinusVal” is an optional parameter with a default scalar parameter
value defined in sg_opParamDefault1. “AxisVal” and “AddVal” are mandatory parameters.
“OptionalParam” is optional and will be skipped if not provided in Qnn_addNode.
Step 5: Optimization Rule Definition - Optional¶
Once the basic functionality of the op is completed; the user might want to specify rules for graph level transformations to implement things like tiling strategies or manipulating the data to simplify execution. The transformations should be applied in a way such that, originality of the output does not deteriorate.
Optimization rules can be defined using HTP core macro listed below, and this macro should be placed in global scope in individual op implementation source files.
Syntax
/*
* PRIORITY - unsigned integer value, used for indicating optimization pass number,
* smaller number indicates earlier optimization pass.
* Predefined values include EARLY(2000), MIDDLE(3000), LATE(4000).
*
* MATCHCODE - subgraph matching pattern which this optimization rule should apply on
*
* CONSTRAINTCODE - constraints applied to the match pattern
*
* REPLACECODE - new subgraph pattern which should replace the matching pattern if the
* constraints are met
*/
DEF_PACKAGE_OPTIMIZATION(PRIORITY,MATCHCODE,CONSTRAINTCODE,REPLACECODE)
Example
DEF_PACKAGE_OPTIMIZATION(
EARLY,
Op("Add","X","B"),
AND(EQ(RANK_OF("X"),4),EQ(RANK_OF("B"),4),EQ(DIM_DEPTH("X"),DIM_DEPTH("B")),
EQ(DIM_WIDTH("B"),1),EQ(DIM_HEIGHT("B"),1),EQ(DIM_BATCHES("B"),1)),
Op("BiasAdd","X","B")
)
This rule is ordered EARLY, which is a value that happens early in
the optimization process. EARLY, MIDDLE, and LATE are defined to
help order rules globally. User may wish to use
EARLY, EARLY+1, EARLY+2, etc. to order optimizations.
This matches the pattern of an op with the operation string Add with
two inputs. The constraint ensures that the inputs are 4D, that the
last dimensions match between the two inputs, and that the other
dimensions in the B input are 1.
If the constraint passes, the original op is replaced with a new op,
with the same inputs and same output specifications as the original
output but with the operation string replaced with BiasAdd.
Note that the strings supplied as inputs in the match are usable during
constraint and replacement patterns to indicate whatever was matched.
Additionally, there is a special string "*" which indicates the
entire match. If a placeholder string occurs more than once in a match,
it must be the same in all places.
Op specifiers may be used more than once in a match or replacement pattern, this will match or generate more than one op in the dependent manner expected. For example:
DEF_PACKAGE_OPTIMIZATION(
EARLY+1,
Op("BiasAdd",Op("Conv2d_valid","Activations","Weights","Stride"),"Bias"),
OK,
Op("ConvLayer_valid","Activations","Weights","Stride","Bias")
)
This rule will match the sequence of Conv2d_valid followed by
BiasAddinto a new op called ConvLayer with four input
parameters.
Cross-Package Optimization
Cross-package optimization is allowed and supported. That means any op package can define optimization
rules which involve ops from other op packages. By default, all the op
names used in matching patterns and replacement patterns are assigned with
a package name associated with the current package. In scenarios when an op
from a different op package shall be used in the matching pattern and/or
replacement pattern, users can explicitly use this format packageName::opName
in place where an op name is expected. If users want to use any HTP native ops,
FROM_DEFAULT_PACKAGE(OPNAME) macro can be used to indicate that. For example,
DEF_PACKAGE_OPTIMIZATION(
EARLY+1,
Op("BiasAdd",Op("OpPackageNo2::Conv2d_valid","Activations","Weights","Stride"),"Bias"),
OK,
Op(FROM_DEFAULT_PACKAGE("ConvLayer_valid"),"Activations","Weights","Stride","Bias")
)
This modifies the previous optimization rule to match Op Conv2d_valid from
a package named OpPackageNo2, and it modifies the replacement pattern with
a HTP native ConvLayer_valid op.
Other common default package ops examples: please read QNN HTP Op Package - Common Default Package Ops Usage Examples.
More Complex Optimization Rule Example
User can even take this further and create a more complex optimization rule. For example:
DEF_PACKAGE_OPTIMIZATION(
LATE,
Op("ConvLayer_valid","Act","Weights","Stride","Bias"),
AND(IS_QUINT8("Act"), // the constraint
IS_QUINT8("Weights"),
EQ(DIM_HEIGHT("Stride"),2),
EQ(DIM_WIDTH("Stride"),2),
LT(int(DIM_DEPTH("Act")),4),
GT(int(DIM_NFILTS("Weights")),31)),
Op("ConvLayer_valid", // the replacment rule
WITH_TYPE("Act",
WITH_SIZE(
gen_Shape(
DIM_BATCHES("Act"),
ADD(1,DIV(SUB(DIM_HEIGHT("Act"),DIM_FILTHEIGHT("Weights")),2)),
ADD(1,DIV(SUB(DIM_WIDTH("Act"),DIM_FILTWIDTH("Weights")),2)),
ROUNDUP(MUL(DIM_FILTHEIGHT("Weights"),DIM_FILTWIDTH("Weights"),DIM_FILTDEPTH("Weights")),32)
),
Op("ConvLayer.opt.im2col_stride2","Act",gen_ShapeOf("Weights"))
)
),
WITH_TYPE("Weights",
WITH_SIZE(
gen_Shape(
1,
1,
ROUNDUP(MUL(DIM_FILTHEIGHT("Weights"),DIM_FILTWIDTH("Weights"),DIM_FILTDEPTH("Weights")),32),
DIM_NFILTS("Weights")),
Op("ConvLayer.opt.weights_for_im2col","Weights")
)
),
gen_Shape(1,1,1,1),
"Bias"
)
)
This has a simple match pattern (“ConvLayer_valid” with 4
inputs) but there is a constraint which must be met before the
replacement rule is applied:
ActandWeightsinputs must both be of datatype “Quint8”Stridemust have dimension of2x2Actdepth must be< 4Weightsmust haveDIM_NFILTS > 31(meaning its depths > 31)
The replacement pattern for the optimization above generates the op
Op( "ConvLayer_valid", <<new_act>>, <<new_weights>, <<new_stride>>, "Bias")
where <<new_act>>, <<new_weights>, <<new_stride>> are constructed as
below:
<<new_stride>>is just a[1x1x1x1]shape, produced by thegen_Shape(1,1,1,1)<<new_act>>is made by applying the original “Act” input to an opOp("ConvLayer.opt.im2col_stride2","Act",gen_ShapeOf("Weights")). In other words, a new opConvLayer.opt.im2col_stride2is inserted and “Act” becomes its first input which rearranges the data so that the equivalent convolution can be done with a point-wise convolution. It reduces theheightandwidthdimensions by/2, and increases the depth because ofWITH_TYPEandWITH_SIZE:the output type for
<<new_act>>is the same as the originalActoutput typethe shape for
<<new_act>>is according to the constructed shapegen_Shape( DIM_BATCHES("Act"), ADD(1,DIV(SUB(DIM_HEIGHT("Act"),DIM_FILTHEIGHT("Weights")),2)), ADD(1,DIV(SUB(DIM_WIDTH("Act"),DIM_FILTWIDTH("Weights")),2)), ROUNDUP(MUL(DIM_FILTHEIGHT("Weights"),DIM_FILTWIDTH("Weights"),DIM_FILTDEPTH("Weights")),32)
Note that
gen_ShapeOf("Weights")looks at the output shape of theWeightsinput and creates a constantshapeobject of the same shape.
<<new_weights>>is similarly made by using the originalWeights’ input to an opOp("ConvLayer.opt.weights_for_im2col", "Weights")and the output shape is calculated as[1,1,d_in,d_out], whered_inis the same as the output depth of theConvLayer.opt.im2col_stride2, andd_outis the same as the original output depth of the weights.
Just from a simple element-wise add, it is possible to many generate complex ops.
Considerations for Transformations¶
By default, replacement ops are created with the same output parameters (shape and quantization information) as the entire rule. This is not always appropriate, especially if the user is manipulating part of an input sequence.
WITH_SIZE(Size,Replacement)generates Replacement with the same size asSize.WITH_TYPE(Type,Replacement)generates Replacement with the same type asType.
For example, if someone is trying to manipulate a parameter of an input, they might have a rule like:
DEF_PACKAGE_OPTIMIAZATION(
EARLY,
Op("MyOp","Input0","Input1"),
OK,
Op("MyOp.real","Input0",
WITH_SIZE("Input1",
WITH_TYPE("Input1",
Op("MyOp.AdjustInput","Input1"))))
)
This would replace the sequence MyOp(A,B) with
MyOp.real(A,MyOp.AdjustInput(B)), but would keep the size and type
of the second op the same as the B input.
To generate a shape or constant scalar op, there are some helper replacement patterns:
gen_Shape(A,B,C,D)generates a4Dshape.gen_ConstScalar_f32(val)generates a constant float scalar with value val.gen_ConstScalar_i32(val)generates a constant integer scalar with value val.
Tiling as Graph Transformations¶
Turn some ops to smaller ops to help with practicality, And those ops to smaller ops until locality is achieved
For tiling, it is common to want to replace an op with the concatenation
of a set of smaller ops. To facilitate this, the AUTOSPLIT
replacement pattern helper will set this up. AUTOSPLIT takes as
parameters:
The output dimension to split on
A variable to hold information about the splitting process for a replacement
The size of the split
The replacement pattern
For example,
DEF_PACKAGE_OPTIMIAZATION(
EARLY+4,
Op("MaxPool_valid","Act","W","S"),
GT(DIM_DEPTH("*"), 32),
AUTOSPLIT(3,"I",32,Op("MaxPool_valid",TYPICAL_SLICE("Act","I"),"W","S"))
)
This will match a MaxPool_valid op with three inputs, enforce that
the number of output channels is greater than 32, and then split
along the output channels into some number of replacements, each with at
most 32 channels. The replacement pattern is a MaxPool_valid op
where the input is replaced with a slice of input with the helper
TYPICAL_SLICE, which takes a slice of the specified input. All the
replacement ops are then concatenated together along the split dimension
automatically. If our input/output is 64 channels, the replacement
would look like:
Concat(3,
MaxPool_valid(
Slice("Act", /* Slice control values here */)
"W","S"),
MaxPool_valid(
Slice("Act", /* Slice control values here */)
"W","S"))
If a typical slice doesn’t match what is needed, use
AUTOSPLIT_SHAPEFN_APPLY helper to apply a user-specified function to
generate the shape needed.
Breaking down ops to smaller ops helps the framework to be able to reduce the memory footprint (by more quickly eliminating temporary results), as well as increase parallelism by enabling ops to run in parallel.
It’s important to note that when the graph transformations are applied, manipulations happen to the same graph, converting one valid graph to another.
Tips for Optimization¶
More fine-grained tiling opens up more opportunities for parallelism and finer-grained data management, however it increases the amount of metadata and per-op overhead.
It is recommended that the graph always remain correct. If there is a rule that is optional (for example, tiling when the op handles arbitrary sizes), use the same op name, but if there is a change in the behavior, it is best to change the name. This way an implementation of the original name or an implementation of the new op name has well-defined behavior.
A part from all this, below are a set of graph level optimization that are done by the core framework:
Constant Propagation: If there is a chains of data that are all constant, it will be evaluated when the graph is prepared.
Dead Code Elimination: Code that is not used during execution process is deleted.
Common Sub-expression Elimination: Replaces identical expressions with a single variable holding the computed value.