Use Space-to-depth transformation where possible

Convolution

If a low input channel activation goes through a space-to-depth transformation, we can make a convolution equivalent by doing the equivalent space-to-depth transformation in the weights. For example, if we have original activations that are 1024x1024x8, we can run them through a space-to-depth transformation with 2x2 size. The resulting activations would be 512x512x32, which matches better with the hardware functionality. We would then need to take the weights and modify them to match the new data arrangement. If the original weights were 3x3x8x8 (h*w*n_I*n_o), we could do a similar space-to-depth transformation, replicating the weights four times to match the new location. We need to round up to handle all overlap, so we would end up in this case with weights that were sized 2x2x32x32.

../../_static/resources/htp_guidelines_fig_9.png

Transformations to convolution to increase channel use

When several convolutions are sequential, the back-to-back depth-to-space and space-to-depth can be eliminated. Also, space-to-depth followed by convolution is a way to implement convolution with stride, and convolution followed by depth-to-space is a way to implement “transpose convolution with stride”. For some models, the space-to-depth and smaller filter size may be able to be modified in the original model and trained with the structure in-place. Otherwise, the description above should yield exact results to the original strategy.

Elementwise Operations

If both inputs of an elementwise operator have had the same space-to-depth operation applied to them, they can be operated on with the same operator.

Reshaping Data

It is common to see low channel convolution near the output of some networks, particularly ones that do image processing. The data arrangement is the same for an image with low channels, or an image that has had a width-to-depth transformation. Consider using a width-to-depth transformation to increase the channels and then treating the output as the desired original low-channel and wider image. For example, a 1024x1024x3 could be computed as 1024x256x12, and then the output used as 1024x1024x3.