Reducing TCM requirements for performance and functionality

The firmware divides work into smaller pieces (tiles) and attempts to schedule and allocate those smaller work items in a way that improves performance and reduces traffic to external memory as much as possible. Working on horizontal strips of activation data is typically the most efficient in execution, and so we try to execute in that format as much as possible. However, for very large activations the amount of memory required to process a horizontal strip of data can be very large. This is especially true on devices with smaller TCM sizes.

Reducing the required input and output sizes

To compute a block of output data, we need one or more blocks of input data. Striding, Dilation, padding, and filter sizes all increase the amount of input data required. By making the required input data smaller, you can reduce the pressure on TCM. At the current time, this is especially true for the width and depth dimension; reducing width and depth can have a very strong impact on TCM usage requirements.

Padding choices

When a convolution is padded by a small amount to produce the same output size as input size, we typically require three tiles of input data to be available: the data above, the data below, and the data in the same location. If there is a large sequence of convolutions, consider applying a large padding early in the series of convolutions and then using “VALID” or zero-padding convolutions for the subsequent operations. The zero-padding convolutions typically need two tiles of input data instead of three.

Reduce precision

16-bit activation data takes twice the amount of memory that 8-bit activation data does. Consider trying smaller data types if possible.

Reducing TCM pressure throughout the network

While designing a network, please also consider the activation sizes throughout the body of the network and not just the input/output sizes of the network. If there are activations in the body of the network that are large enough to not fit in TCM or create significant TCM pressure it can result in the network failing to prepare and/or having significant performance impact. It is possible for a network with a low input resolution to have TCM size issues if it has significantly large activations in the body of the network. We continue to improve the engine’s capability to handle larger activations in the future.

Generic rule of thumb guidelines for various op activations to fit in TCM

Weights are stored in TCM and take up TCM space. Some generic rule of thumb guidelines based on activations are provided below for various kinds of ops to determine sizing for different TCM configurations.

  • Activation widths need to be rounded up to the nearest multiple of 8(uint8)/4(uint16)

  • Activation depths need to be rounded up to the nearest multiple of 32 for both uint8/16

  • Activation height need to be rounded up to the nearest multiple of 8 for both uint8/16

The following equations can be used to get an approximate idea of TCM fit.

  • Unary Elementwise Ops
    • ACTIVATION_WIDTH*256*elsize*2 <= TCM_SIZE/2

    • ACTIVATION_WIDTH needs to be rounded up to nearest multiple of 8(uint8)/4(uint16)

    • If the LHS above is > TCM_SIZE/2 but less than TCM_SIZE it might still fit but will adversely affect performance

  • Binary Elementwise Ops
    • ACTIVATION_WIDTH*256*elsize*3 <= TCM_SIZE/2

    • ACTIVATION_WIDTH needs to be rounded up to nearest multiple of 8 for uint8 and 4 for uint16

    • If the LHS above is > TCM_SIZE/2 but less than TCM_SIZE it might still fit but may adversely affect performance

  • Convolution
    • ACTIVATION_INPUT_WIDTH * ACTIVATION_INPUT_DEPTH * FILTER_HEIGHT * tiling_dimension + FILTER_WIDTH * FILTER_HEIGHT * FILTER_DEPTH * number of channels + ACTIVATION_OUTPUT_WIDTH * tiling_dimension <= TCM_SIZE/2

    • ACTIVATION_INPUT_WIDTH and ACTIVATION_OUTPUT_WIDTH needs to be rounded up to nearest multiple of 8

    • (uint8)/4(uint16)ACTIVATION_INPUT_DEPTH needs to be rounded up to nearest multiple of 32

    • FILTER_DEPTH needs to be rounded up to nearest multiple of 32

    • Stride-2 and dilated convolution may have higher VTCM requirements

    • tiling_dimension: 8 (8 byte height as part of the tile dimension WxHxD = 8x8x32)

    • number of channels: 32 (32 channels as minimum chunk for the output)

  • GlobalAvgPool
    • ACTIVATION_WIDTH * 256 * elsize <= TCM_SIZE/2

    • ACTIVATION_WIDTH needs to be rounded up to nearest multiple of 8(uint8)/4(uint16)

  • Concat
    • If the dimension of each input on the axis being concatenated except for the last input being concatenated is a multiple of

      • 8, if the axis being concatenated on is height,

      • 8, (uint8)/4(uint16) if the axis being concatenated on is width or

      • 32, if the axis being concatenated on is depth

      then the size of the concat’s output is rounded up to 8 (uint8)/4(uint16) in width, 8 in height and 32 in depth

    • If the above constraint (multiple of 8 if height or 8(uint8)/4(uint16) width and multiple of 32 if depth) is not met, then the size of the concat’s output is rounded up to 8 (uint8)/4(uint16) in width,8 in height and 32 in depth + sum of all the concat’s inputs rounded up to 8 (uint8)/4(uint16) in width, 8 in height and 32 in depth

  • NMS (Non-Maxima Suppression)
    • Needs all the data to fit in <= TCM_SIZE/2

  • TopK
    • For TopK Accuracy classification score computation, the following must be taken into consideration:

      • AXIS_DIM*256*elsize*2 <= TCM_SIZE/2