Avoid converting PyTorch models to onnx first

It has been observed that converting PyTorch graph files to Onnx first and then converting the Onnx to QNN may result in patterns in the graph which affect the performance negatively. Examples are given in following sections.

Changing the rank of the tensor

The ‘PyTorch->Onnx->QNN’ conversions can change the ranks of Tensors in a way which can prevent optimization of certain patterns which rely on specific sequence of ops and specific ranks of their inputs. For example consider the following 2 patterns in Fig. 14. The entire pattern on the left is a layer norm and is optimized successully. Note that the channel dimension is the last dimension and the Mul and the Add Ops at the end have scalar values. Now compare this to the pattern on the right. The channel dimension is the second dimension and the scale and bias values (Mul and Add at the end) are Tensors with rank 1. The optimization of the latter case is not supported yet and can lead to significant performance bottlenecks.

../../_static/resources/htp_guidelines_fig_18.png

Figure 1

Introducing avoidable reshape-transpose-slice sequence

The following types of patterns (Fig. 15, 16) in a graph could be introduced during conversion from ‘PyTorch->Onnx->QNN’. The slice ops introduced in Fig. 15 make it difficult to optimize the multiplication op and they serve no purpose. These types of patterns can be entirely avoided during conversion if we could make the multiplication to be a part of the convolution at the beginning. Additionally, the reshape and transpose ops in both figures are very expensive as permute ops are added before and after the reshape to correct to NHWC format. Removing these types of artifacts in the converted graphs can improve the inference times by about 10% in many cases.

../../_static/resources/htp_guidelines_fig_14.png

Figure 2

../../_static/resources/htp_guidelines_fig_16.png

Figure 3

Adding Reshape op as the first node of graph

Reshape at first node consumes a lot of cycles. See Fig. 17. Additionally, as mentioned previously, permute ops are added before and after reshape, further degrading the performance. This could degrade the total inference time of a graph by 5-10%.

../../_static/resources/htp_guidelines_fig_15.png

Figure 4