Quantized 16 bit activations (A16) vs FP16 and Activation Fusion: Performance and power differences

As mentioned previously in the document: Recommendations for Network Design, quantized A16 bit activations generally achieve higher performance and have lower power requirements. In order to get most benefit from using 16 bit activations, it is critical to maximize utilization of the hardware. We recommend that the network is designed to take maximum advantage of HMX. Some of these practices are outlined below with examples.

Minimize data movement operations

Non computational operations that move or permute data in some way are expected to have similar performance on FP16 vs A16, as the data size is the same. If a network is heavily dominated by these operations then one might not see a very significant inf/sec gains and power benefits using A16 activations compared to FP16 ones.

These operations include (but are not limited to):
  • Reshape

  • Transpose

  • Depth to Space

  • Space to Depth

  • Concat

Therefore, it is recommended to avoid these operations whenever it is feasible, and to consider approaches which minimize usage of data movers.

Activation Fusion

The latest hardware can fuse some non-trivial activation functions (i.e, non-Relu) into the previous convolution. We currently support fusing PRelu and HardSwish in all precisions, namely, A8, A16 and FP16. For A8 and A16 precisions, Relu and ReluMinMax activations have better performance. For FP16 precisions, Relu activations have better performance. Many of the networks provided in A16 make heavy use of PRelu, which will benefit from this added support. However, there are some caveats, such as below and the networks should be designed to make the best use of this feature.:

  • There is a constraint that the conv preceding the activation should not be used

elsewhere in the network in order for the Activation Fusion to take place.

Note

Note that simple activation functions like Relu are not subject to the above constraint, so we still recommend usage of Relu over PRelu/HardSwish whenever this can be achieved.

The following table shows the support for various activation fusions with the preceding Conv2D or DepthWiseConv2D:

Activation

Op Fusion for A8 and A16

Efficiency for A8 and A16 precisions

Op Fusion for FP16 precision

Op Efficiency for FP16 precision

Relu

Yes

best

Yes

Best

ReluMinMax

Yes

best

Support on 8550 and 8650

Good

HardSwish

Support on 8550 and 8650

best

Support on 8550 and 8650

Good

LeakyRelu

Support on 8550 and 8650

best

Support on 8550 and 8650

Good

Prelu

Support on 8550 and 8650

best

Support on 8550 and 8650

Good

Elu

No

N/A

No

N/A

Gelu

No

N/A

No

N/A

Sigmoid

No

N/A

No

N/A

Tanh

No

N/A

No

N/A

Example: Network Topology

When fusing an activation, if the results of the preceding convolution are needed elsewhere in the network, then the activation fusion is not done, as we would need to do the convolution twice. This type of pattern is quite common in many networks and prevents activation fusion.

../../_static/resources/htp_guidelines_fig_11.png

Figure 1

From the Fig. 1, one can see that the results of the convolution are fed to a PRelu (highlighted in yellow), and to an Add (the bottom AddV2), which prevents activation fusion. It is therefore recommended to optimize the network topology so that activation fusion is always possible.

Example: Replacing Add with Convolution to Maximize Activation Fusion

Current activation fusion is supported for only convolution type ops. It does not apply to other ops such as Add or Mul. If these types of ops are followed by a non-trivial activation, it may be beneficial to replace these elementwise ops with convolution, which will allow the activation functions to be fused into the convolution. For example in the Fig. 2, the first Add can be replaced with a convolution which would allow the following PRelu to be fused with the convolution. However, we still have the restriction that the output of the replaced Add can not be used in another place. So some patterns like the one shown in Fig. 3 may not be fused.

../../_static/resources/htp_guidelines_fig_12.png

Figure 2

../../_static/resources/htp_guidelines_fig_13.png

Figure 3