Tensors and Memory Layout

Details about Memory Layout

Memory layouts are how data for Tensors are laid out in memory in HTP Core. There are many different memory layouts. There is d32 layout, crouton layout, flat layout, and specific layouts for weights in convolutions. Memory layouts have a rank, an optional strided order of dimensions should be chunked out, and an order of how chunks should be laid out beside each other.

Examples

Flat Layout

FlatMemoryLayout<4> can be thought of as ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0>

  • 4, the zeroth parameter, is the rank of the layout

  • The rest of the parameters are done in (dimension,size) pairs, and it’s easiest to explain the pairs right-to-left:

  • 3,0 means “all the rest of dimension 3

  • 2,0 means “all the rest of dimension 2

  • 1,0 means “all the rest of dimension 1

  • 0,0 means “all the rest of dimension 0

What the above explanation means is that dimension 3 is the fastest moving dimension, dimension 2 is 2nd fastest, then 1, then 0. If dimension of the tensor is 2x3x5x30 then the data is laid out like the following:

(0,0,0,0), (0,0,0,1) ... (0,0,0,29),
(0,0,1,0), (0,0,1,1) ... (0,0,1,29),
...
(0,0,4,0), (0,0,4,1), ... (0,0,4,29),
(0,1,0,0), (0,0,0,1), ... (0,1,0,29),
(0,1,1,0), (0,1,1,1), ... (0,1,1,29),
...
(0,2,1,0), (0,2,1,1), ... (0,2,1,29),
...
(0,2,4,0), (0,2,4,1), ... (0,2,4,29),
(1,0,0,0), (1,0,0,1), ... (1,0,0,29),
...
(1,2,4,0), (1,2,4,1), ... (1,2,4,29),

Most commonly rank of 4 tensors with “NHWC” format is used: - dimension 3 is depth or channels - dimension 2 is width - dimension 1 is height - dimension 0 is batches.

So in the above example, the numbers indicate: (batch index, height index, width index, depth index). Ellipses indicate non contiguous data. Newlines are still contiguous, only there for readability.

If user wants the dimensions to mean “NHWC” format, but really want it laid out in memory as “NCHW”, user can use the Memory Layout to do this. ChunkedMemoryLayout<4, 0,0, 3,0, 1,0, 2,0> is a to represent this.

By changing the MemoryLayout user can change how data is organized in memory, without changing how ops using the basic tensor interfaces work, and while having the C++ infrastructure guarantee type safety (so that user don’t feed NHWC data to an op expecting NCHW format, for example).

Crouton Layout

While the flat memory format is good for interaction with other environments, user might like memory to be in a format highly amenable to how hardware is able to work with it. This requires making the data more uniform in size and ensuring that the data being run through computation together is contiguous in memory.

Crouton layout R4CroutonLayout is ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,8, 3,32>

  • 4, the zeroth parameter, is the rank of the layout

  • The rest of the parameters are done in (dimension,size) pairs, and it’s easiest to explain the pairs right-to-left:

  • 3,32 means 32 elements in dimension 3

  • 2,8 means 8 contiguous chunks of everything to the right in dimension 2

  • 1,8 means 8 contiguous chunks of everything to the right in dimension 1

  • 3,0 means “all the rest of dimension 3

  • 2,0 means “all the rest of dimension 2

  • 1,0 means “all the rest of dimension 1

  • 0,0 means “all the rest of dimension 0

The numbers 1,8, 2,8, 3,32 mean that the croutons have a chunk size of 1x8x8x32. If the dimensions is less than 8 or 32, it is padded to the respective dimensions.

For example, if the tensor dimension is 1x3x5x30, the data gets padded to 1x8x8x32, and then the data is laid out like the following:

(0,0,0,0), (0,0,0,1), ... (0,0,0,29),  (0,0,0,30), (0,0,0,31),

(0,0,0,0) to (0,0,0,29) is valid data, (0,0,0,30) (0,0,0,31) is pad introduced by the memory layout

(0,0,1,0), (0,0,1,1), ... (0,0,1,29), (0,0,0,30), (0,0,0,31),
...
(0,0,4,0), (0,0,4,1), ... (0,0,4,29), (0,0,4,30 ),(0,0,4,31),

(0,0,4,0) to (0,0,4,29) is valid data, (0,0,4,30 ), (0,0,4,31), (0,0,5,0 )...(0,0,7,31) is pad introduced by the memory layout

(0,0,5,0), (0,0,5,1), ... (0,0,5,31),
...
(0,0,7,0), (0,0,7,1), ... (0,0,7,31),
(0,1,0,0), (0,1,0,1), ... (0,1,0,31),
(0,1,1,0), (0,1,1,1), ... (0,1,1,31),
...
(0,2,1,0), (0,2,1,1), ... (0,2,1,31),
...
(0,2,4,0), (0,2,4,1), ... (0,2,4,29), (0,2,4,30), (0,2,4,31)

(0,2,4,0) to (0,2,4,29) is valid data, (0,2,4,30) ...(0,2,7,31),(0,3,0,0 )...(0,7,7,31) is pad introduced by the memory layout

This explains what 1,8, 2,8, 3,32means. It means how data is laid out in fixed size chunks. However, the order of those chunks in memory also needs to be determined.

Similar to the FlatMemoryLayout above, user can define an arbitrary order for those chunks to be ordered. That is what the 0-sized dimensions mean in the MemoryLayout.

So given the example here, where the ordering is 0,0, 1,0, 2,0, 3,0, groups along dimension 3 to be ordered “together”, followed by all the groups required to do dimension 2, and so on.

If there is 2x9x20x50 tensor, for example, it gets padded to 2x16x24x64. It would go in memory:

(0,0,0,0) ... (0,0,0,31)
(0,0,1,0) ... (0,0,1,31)
...
(0,0,7,0) ... (0,0,7,31)
(0,1,0,0) ... (0,1,7,31)
...
(0,7,7,0) ... (0,7,7,31) > end of chunk


(0,0,0,32) ... (0,0,0,63) > start of chunk
(0,0,1,32) ... (0,0,1,63)
...
(0,0,7,32) ... (0,0,7,63)
(0,1,0,32) ... (0,1,7,63)
...
(0,7,7,32) ... (0,7,7,63) > end of chunk, finished traversing all 64 in dimension 3}

(0,0,8,0) ...  (0,0,8,31)
(0,0,9,0) ...  (0,0,9,31)
...
(0,0,15,0) ... (0,0,15,31)
(0,1,8,0) ...  (0,1,15,31)
...
(0,7,8,0) ...  (0,7,15,31)

(0,0,8,32) ... (0,0,15,63)
...
(0,7,8,32) ... (0,7,15,63)
(0,0,16,0) ... (0,0,23,63)
...
(0,7,16,32) ... (0,7,23,63) > finished traversing all 24 in dimension 2

(0,8,0,0) ... (0,8,23,63) (0,15,0,0) ... (0,15,23,63) > finished traversing all 16 in dimension 1
(1,0,0,0) ... (1,8,23,63) (1,15,0,0) ... (1,15,23,63) > end of memory layout

Note that the FlatMemoryLayout is just the special case of ChunkedMemoryLayout where the Chunk Size is the minimal one (1 element in every dimension).

Practical tips working with Crouton:

  • chunks are not consecutive in memory (i.e. there’s gap in memory between each chunk)

  • usually use get_raw(first element's idx in chunk) to retrieve the start memory location for such chunk

  • operations such as aligned copy can’t go across chunks

  • crouton padding is automatic and is 31 (not 0)

  • user padding need to be explicitly set to quantized 0 (or other specified value)

A More complicated example of Memory Layout

For convolution, the weight layout is ChunkedMemoryLayout<4, 3,0, 2,0, 0,0, 1,0, 2,8, 3,32, 2,4> In HTP Core, the weight dimension 0 is considered to be filter height, dimension 1 to be filter width, dimension 2 to match the number of input channels, and dimension 3 to be the number of output channels.

So in the case below, ChunkedMemoryLayout<4, 3,0, 2,0, 0,0, 1,0, 2,8, 3,32, 2,4>, it means: * 4, the zeroth parameter, is the rank of the layout * The rest of the parameters are done in (dimension,size) pairs, and it’s easiest to explain the pairs right-to-left: * 2,4 means 4 contiguous elements in dimension 2 (which matches the input depth) * 3,32 means 32 contiguous chunks of everything to the right in dimension 3 (the output depth) * 2,8 means 8 contiguous chunks of everything to the right in dimension 2 * 1,0 means “all the rest of dimension 1” * 0,0 means “all the rest of dimension 0” * 2,0 means “all the rest of dimension 2” * 3,0 means “all the rest of dimension 3

So if there is a 3x3x32x32 filter, the memory is laid out as follows:

(0,0,0,0), (0,0,1,0), (0,0,2,0), (0,0,3,0),
(0,0,0,1), (0,0,1,1), (0,0,2,1), (0,0,3,1),
(0,0,0,2) ...                    (0,0,3,31),
(0,0,4,0), (0,0,5,0), (0,0,6,0), (0,0,7,0),
(0,0,4,1), (0,0,5,1), (0,0,6,1), (0,0,7,1),
(0,0,4,2) ... (0,0,7,31), ...   (0,0,31,31),
(0,1,0,0), (0,1,1,0), ...       (0,1,31,31),
...
(0,2,0,0), ...                  (0,2,31,31),
(1,0,0,0), ...                  (1,0,31,31)

So the rightmost fixed-size things indicate the block size of a “chunk”. Then those chunks are ordered in the desired way for computation. The 3,0, 2,0, 0,0, 1,0, just means that if there is more output channels (dimension 3) or more input channels (dimension 2) those happen after the group of blocks x width x height, and that the input channels (dim 2) is more contiguous than output channels (dim 3). “0” here means “not a fixed size, just all the rest of this dimension”

So a 3x3x32x50 tensor is just fine. It will get padded into 3x3x32x64. But the format says do (0,0,0,0)...(2,2,31,31) and then do (0,0,0,32)...(2,2,31,63)

For even more clarity in this example, if there is 3x3x64x96 tensor, it would go in memory:

(0,0,0,0)...(2,2,31,31),
(0,0,32,0)...(2,2,63,31),
(0,0,0,32)...(2,2,31,63),
(0,0,32,32)...(2,2,63,63),
(0,0,0,64)...(2,2,31,95),
(0,0,32,64)...(2,2,63,95),

Because dimension 2 is “more major” than dimension 3.

Different memory layouts

Based on the above description, here is a quick summary of how each of the croutons are laid out in memory:

Type

Memory Layout

R4FlatMemoryLayout

FlatMemoryLayout<4>

R4NCHWMemoryLayout

ChunkedMemoryLayout<4, 0,0, 3,0, 2,0, 1,0>

R4Depth32MemoryLayout

ChunkedMemoryLayout<4, 0,0, 1,0, 3,0, 2,0, 2,4, 3,32>

R4CroutonLayout

ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,8, 3,32>

R4Crouton4x1Layout

ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,2, 3,32, 2,4>

R4Crouton2x2Layout

ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,4, 2,4, 3,32, 1,2, 2,2>

R4Crouton2Layout

ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,2, 3,32, 2,2>

Details about Tensors

Concrete tensors (real ones) have things like an underlying type, interface, padding, and memory layout.

The underlying type is just that: what kind of data is actually kept in the tensor.

The interface is the way information is encoded/decoded into the underlying data type. For example, PlainInterface just returns the value, but ScaleOffsetInterface applies the offset and scale value (for quantized types).

One of the more interesting parts of a tensor is the Memory Layout. In HTP core, user can have arbitrary memory layouts.

Memory Layouts have a set of fixed sizes, which define the size of each chunk. They also have the ordering those chunks are arranged to fill out the entire set of the tensor.

Let’s look at an example Crouton formats:

ChunkedMemoryLayout<
    /* RANK */ 4,
    /* Least Major: Batch Dim, all the rest */ 0,0,
    /* Next least major: height, all the rest */ 1,0,
    /* Next least major: width, all the rest */ 2,0,
    /* Next least major: depth, all the rest */ 3,0,
    /* 8 rows high */ 1,8,
    /* 8 columns wide */ 2,8,
    /* 32 channels deep */ 3,32> ChannelMajorCrouton;

ChunkedMemoryLayout<
    /* RANK */ 4,
    /* Least Major: Batch Dim, all the rest */ 0,0,
    /* Next least major: height, all the rest */ 1,0,
    /* Next least major: width, all the rest */ 2,0,
    /* Next least major: depth, all the rest */ 3,0,
    /* 4 high */ 1,4,
    /* 4 wide */ 2,4,
    /* 32 channels deep */ 3,32,
    /* 2 rows */ 1,2,
    /* 2 cols */ 2,2> SpatialXYMajor;

ChunkedMemoryLayout<
    /* RANK */ 4,
    /* Least Major: Batch Dim, all the rest */ 0,0,
    /* Next least major: height, all the rest */ 1,0,
    /* Next least major: width, all the rest */ 2,0,
    /* Next least major: depth, all the rest */ 3,0,
    /* 8 high */ 1,4,
    /* 2 wide */ 2,2,
    /* 32 channels deep */ 3,32,
    /* 4 cols */ 2,4> SpatialXMajor;

The infrastructure supports the use of all of these formats. Generic ops can use any of the formats indicated here.