Other Performance and Energy Guidelines

Guidelines Regarding Datatypes

  • Fixed-point computations use lower energy than floating-point for a given bit-width. Therefore always use quantized models (preferably 8bit), if they achieve sufficient accuracy. For more information about quantization, read Quantization

  • MAC energy is proportional to the square of the activation bit-width and weight bit-width. Therefore use quantized 8 bit activations preferably.

  • Quantization also helps in reducing memory energy consumption as it is proportional to the total bit-width transferred.

  • Lower bit-width is also an effective way to reduce model size.

  • Use per channel scaling to offset the loss of dynamic range due to fixed-point. This effectively becomes block floating point.

  • When converting between 8bit and 16 bit fixed point, keep the following in mind:

    • When expanding from 8 bit width to 16 bit width, use the MSBs and put zeros on the LSBs. Similarly when converting 16 bit to 8 bit move the MSB bits to the 8 bit values.

    • When converting from 8 bit to 16 bit, set the stepsize and offset values of 16 bit layer exactly 256 times the corresponding values of the 8 bit layer.

    • When converting from 16 bit to 8 bit, set the stepsize and offset values of 8 bit layer exactly 1/256 times the corresponding values of the 16 bit layer.

Positioning Batch Dimension

  • When defining the shape of activations, it is recommended to always define the batch as the first dimension. For example, instead of [1,1,b,d] use [b,1,1,d]. This creates consistency by aligning the position of the batch throughout the entire network, reducing unnecessary transformations and achieving better parallelization and hardware utilization.

Balancing Spatial dimensions

  • Whenever possible its recommended that activations are defined with balanced spatial dimensions (height & width). For example [6,30,50,256] is preferable to [6,1,1500,256]. Many operations such as elementwise ops, activation functions, softmax are considered spatially independent, meaning, regardless of the form the height and width dimensions take they produce the same mathematical result. By balancing the height and width dimensions the HTP hardware achieves better memory and hardware utilization. This is especially recommended if a network has a long sequence of spatially independent ops like those often found in attention blocks.

Size of Width and Height dimensions

  • Choose width and height dimensions as power-of-2 to better align with the hardware.

Use Batches

  • Tile (with halos) large images to produces batches (for at least the subnetwork where tiles don’t interact). The tiled batches can then be converted to channels. The convolution following this can be converted to group convolution. This helps in counteracting low number of channel layers to look like a more efficient layer with channel groups.

Choice of Machine Learning Platforms and Tools

  • We recommend tensorflow platform instead of other platforms such as pytorch, due to the following reasons:

    • The data format of Tensorflow is same as QNN, which is NHWC. On the other hand pytorch uses NCHW format. QNN models generated from pytorch will have additional transpose ops to switch between the data formats which will affect the performance negatively.

    • The graphs generated by starting from pytorch, then converting to onnx and then to QNN, are not optimal for QNN HTP performance.

  • Use onnx-simplifier and transform (tensorflow) tools to simplify the graphs. The graph may have various redundancies and these tools help remove those reduncdancies.

Miscellaneous Guidelines

  • Avoid global operations (pooling) for small accuracy benefits. These create memory bottlenecks (inhibits tile-based/depth-first memory savings).

  • Zeroes are generally helpful to reduce energy consumption of multiply and memory operations, but don’t naturally help energy significantly unless zeros are repeated in the same position. For example, 50% zeros uniformly distributed over one of the multiplicand results in reduction of upto 15% of the peak full capacity fully random data power. ReLU is a major source of zeros for activations and is recommended for this reason. The pattern of beneficial zeros is hardware dependent. Reducing bit-width tends to be a bigger lever to reduce model size and energy for a given level of accuracy.

  • Use relu activation function instead of leakyrelu/prelu.

  • Use multiply by reciprocal instead of divide, when dividing by a constant.

  • Use quantized inputs (tf8/tf16), where possible, to avoid quantization of inputs during inference.

  • Reshape and other data movement operations, such as Transpose, are not typically free and might involve substantial overhead. If you are using Reshape to implement some novel functionality (such as filtering in a new dimension), consider using an alternative fused framework op or writing your own op rather than using Reshape to build the op out of existing operations.

  • Performance of an op in a shallow graph is not indicative of an op in a full graph.