Tensors and Memory Layout¶
Details about Memory Layout¶
Memory layouts are how data for Tensors are laid out in memory in HTP
Core. There are many different memory layouts. There is
d32 layout, crouton layout, flat layout, and specific layouts
for weights in convolutions. Memory layouts have a rank, an optional
strided order of dimensions should be chunked out, and an order of how
chunks should be laid out beside each other.
Examples¶
Flat Layout¶
FlatMemoryLayout<4> can be thought of as
ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0>
4, the zeroth parameter, is therankof the layoutThe rest of the parameters are done in
(dimension,size)pairs, and it’s easiest to explain the pairs right-to-left:3,0means “all the rest of dimension3”2,0means “all the rest of dimension2”1,0means “all the rest of dimension1”0,0means “all the rest of dimension0”
What the above explanation means is that dimension 3 is the fastest
moving dimension, dimension 2 is 2nd fastest, then 1, then
0. If dimension of the tensor is 2x3x5x30 then the data is laid
out like the following:
(0,0,0,0), (0,0,0,1) ... (0,0,0,29),
(0,0,1,0), (0,0,1,1) ... (0,0,1,29),
...
(0,0,4,0), (0,0,4,1), ... (0,0,4,29),
(0,1,0,0), (0,0,0,1), ... (0,1,0,29),
(0,1,1,0), (0,1,1,1), ... (0,1,1,29),
...
(0,2,1,0), (0,2,1,1), ... (0,2,1,29),
...
(0,2,4,0), (0,2,4,1), ... (0,2,4,29),
(1,0,0,0), (1,0,0,1), ... (1,0,0,29),
...
(1,2,4,0), (1,2,4,1), ... (1,2,4,29),
Most commonly rank of 4 tensors with “NHWC” format is used: -
dimension 3 is depth or channels - dimension 2 is
width - dimension 1 is height - dimension 0 is
batches.
So in the above example, the numbers indicate:
(batch index, height index, width index, depth index). Ellipses
indicate non contiguous data. Newlines are still contiguous, only there
for readability.
If user wants the dimensions to mean “NHWC” format, but really want
it laid out in memory as “NCHW”, user can use the Memory Layout to
do this. ChunkedMemoryLayout<4, 0,0, 3,0, 1,0, 2,0> is a to
represent this.
By changing the MemoryLayout user can change how data is organized
in memory, without changing how ops using the basic tensor interfaces
work, and while having the C++ infrastructure guarantee type safety
(so that user don’t feed NHWC data to an op expecting NCHW
format, for example).
Crouton Layout¶
While the flat memory format is good for interaction with other environments, user might like memory to be in a format highly amenable to how hardware is able to work with it. This requires making the data more uniform in size and ensuring that the data being run through computation together is contiguous in memory.
Crouton layout R4CroutonLayout is
ChunkedMemoryLayout<4, 0,0, 1,0, 2,0, 3,0, 1,8, 2,8, 3,32>
4, the zeroth parameter, is the rank of the layoutThe rest of the parameters are done in
(dimension,size)pairs, and it’s easiest to explain the pairs right-to-left:3,32means32elements in dimension32,8means8contiguous chunks of everything to the right in dimension21,8means8contiguous chunks of everything to the right in dimension13,0means “all the rest of dimension3”2,0means “all the rest of dimension2”1,0means “all the rest of dimension1”0,0means “all the rest of dimension0”
The numbers 1,8, 2,8, 3,32 mean that the croutons have a chunk size
of 1x8x8x32. If the dimensions is less than 8 or 32, it is
padded to the respective dimensions.
For example, if the tensor dimension is 1x3x5x30, the data gets
padded to 1x8x8x32, and then the data is laid out like the
following:
(0,0,0,0), (0,0,0,1), ... (0,0,0,29), (0,0,0,30), (0,0,0,31),
(0,0,0,0) to (0,0,0,29)is valid data,(0,0,0,30) (0,0,0,31)is pad introduced by the memory layout
(0,0,1,0), (0,0,1,1), ... (0,0,1,29), (0,0,0,30), (0,0,0,31),
...
(0,0,4,0), (0,0,4,1), ... (0,0,4,29), (0,0,4,30 ),(0,0,4,31),
(0,0,4,0) to (0,0,4,29)is valid data,(0,0,4,30 ), (0,0,4,31), (0,0,5,0 )...(0,0,7,31)is pad introduced by the memory layout
(0,0,5,0), (0,0,5,1), ... (0,0,5,31),
...
(0,0,7,0), (0,0,7,1), ... (0,0,7,31),
(0,1,0,0), (0,1,0,1), ... (0,1,0,31),
(0,1,1,0), (0,1,1,1), ... (0,1,1,31),
...
(0,2,1,0), (0,2,1,1), ... (0,2,1,31),
...
(0,2,4,0), (0,2,4,1), ... (0,2,4,29), (0,2,4,30), (0,2,4,31)
(0,2,4,0)to(0,2,4,29)is valid data,(0,2,4,30) ...(0,2,7,31),(0,3,0,0 )...(0,7,7,31)is pad introduced by the memory layout
This explains what 1,8, 2,8, 3,32means. It means how data is laid
out in fixed size chunks. However, the order of those chunks in memory
also needs to be determined.
Similar to the FlatMemoryLayout above, user can define an arbitrary
order for those chunks to be ordered. That is what the 0-sized
dimensions mean in the MemoryLayout.
So given the example here, where the ordering is 0,0, 1,0, 2,0, 3,0,
groups along dimension 3 to be ordered “together”, followed by all
the groups required to do dimension 2, and so on.
If there is 2x9x20x50 tensor, for example, it gets padded to
2x16x24x64. It would go in memory:
(0,0,0,0) ... (0,0,0,31)
(0,0,1,0) ... (0,0,1,31)
...
(0,0,7,0) ... (0,0,7,31)
(0,1,0,0) ... (0,1,7,31)
...
(0,7,7,0) ... (0,7,7,31) > end of chunk
(0,0,0,32) ... (0,0,0,63) > start of chunk
(0,0,1,32) ... (0,0,1,63)
...
(0,0,7,32) ... (0,0,7,63)
(0,1,0,32) ... (0,1,7,63)
...
(0,7,7,32) ... (0,7,7,63) > end of chunk, finished traversing all 64 in dimension 3}
(0,0,8,0) ... (0,0,8,31)
(0,0,9,0) ... (0,0,9,31)
...
(0,0,15,0) ... (0,0,15,31)
(0,1,8,0) ... (0,1,15,31)
...
(0,7,8,0) ... (0,7,15,31)
(0,0,8,32) ... (0,0,15,63)
...
(0,7,8,32) ... (0,7,15,63)
(0,0,16,0) ... (0,0,23,63)
...
(0,7,16,32) ... (0,7,23,63) > finished traversing all 24 in dimension 2
(0,8,0,0) ... (0,8,23,63) (0,15,0,0) ... (0,15,23,63) > finished traversing all 16 in dimension 1
(1,0,0,0) ... (1,8,23,63) (1,15,0,0) ... (1,15,23,63) > end of memory layout
Note that the FlatMemoryLayout is just the special case of
ChunkedMemoryLayout where the Chunk Size is the minimal one (1
element in every dimension).
Practical tips working with Crouton:¶
chunks are not consecutive in memory (i.e. there’s gap in memory between each chunk)
usually use
get_raw(first element's idx in chunk)to retrieve the start memory location for such chunkoperations such as aligned copy can’t go across chunks
crouton padding is automatic and is
31(not0)user padding need to be explicitly set to quantized
0(or other specified value)
A More complicated example of Memory Layout¶
For convolution, the weight layout is
ChunkedMemoryLayout<4, 3,0, 2,0, 0,0, 1,0, 2,8, 3,32, 2,4> In HTP
Core, the weight dimension 0 is considered to be filter height,
dimension 1 to be filter width, dimension 2 to match the number
of input channels, and dimension 3 to be the number of output
channels.
So in the case below,
ChunkedMemoryLayout<4, 3,0, 2,0, 0,0, 1,0, 2,8, 3,32, 2,4>, it
means: * 4, the zeroth parameter, is the rank of the layout * The
rest of the parameters are done in (dimension,size) pairs, and it’s
easiest to explain the pairs right-to-left: * 2,4 means 4
contiguous elements in dimension 2 (which matches the input depth)
* 3,32 means 32 contiguous chunks of everything to the right in
dimension 3 (the output depth) * 2,8 means 8 contiguous
chunks of everything to the right in dimension 2 * 1,0 means “all
the rest of dimension 1” * 0,0 means “all the rest of dimension
0” * 2,0 means “all the rest of dimension 2” * 3,0
means “all the rest of dimension 3”
So if there is a 3x3x32x32 filter, the memory is laid out as
follows:
(0,0,0,0), (0,0,1,0), (0,0,2,0), (0,0,3,0),
(0,0,0,1), (0,0,1,1), (0,0,2,1), (0,0,3,1),
(0,0,0,2) ... (0,0,3,31),
(0,0,4,0), (0,0,5,0), (0,0,6,0), (0,0,7,0),
(0,0,4,1), (0,0,5,1), (0,0,6,1), (0,0,7,1),
(0,0,4,2) ... (0,0,7,31), ... (0,0,31,31),
(0,1,0,0), (0,1,1,0), ... (0,1,31,31),
...
(0,2,0,0), ... (0,2,31,31),
(1,0,0,0), ... (1,0,31,31)
So the rightmost fixed-size things indicate the block size of a
“chunk”. Then those chunks are ordered in the desired way for
computation. The 3,0, 2,0, 0,0, 1,0, just means that if there is
more output channels (dimension 3) or more input channels (dimension
2) those happen after the group of blocks x width x height, and
that the input channels (dim 2) is more contiguous than output
channels (dim 3). “0” here means “not a fixed size, just all the
rest of this dimension”
So a 3x3x32x50 tensor is just fine. It will get padded into
3x3x32x64. But the format says do (0,0,0,0)...(2,2,31,31) and
then do (0,0,0,32)...(2,2,31,63)
For even more clarity in this example, if there is 3x3x64x96 tensor,
it would go in memory:
(0,0,0,0)...(2,2,31,31),
(0,0,32,0)...(2,2,63,31),
(0,0,0,32)...(2,2,31,63),
(0,0,32,32)...(2,2,63,63),
(0,0,0,64)...(2,2,31,95),
(0,0,32,64)...(2,2,63,95),
Because dimension 2 is “more major” than dimension 3.
Different memory layouts¶
Based on the above description, here is a quick summary of how each of the croutons are laid out in memory:
Type |
Memory Layout |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Details about Tensors¶
Concrete tensors (real ones) have things like an underlying type, interface, padding, and memory layout.
The underlying type is just that: what kind of data is actually kept in the tensor.
The interface is the way information is encoded/decoded into the
underlying data type. For example, PlainInterface just returns the
value, but ScaleOffsetInterface applies the offset and scale value
(for quantized types).
One of the more interesting parts of a tensor is the Memory Layout. In HTP core, user can have arbitrary memory layouts.
Memory Layouts have a set of fixed sizes, which define the size of each chunk. They also have the ordering those chunks are arranged to fill out the entire set of the tensor.
Let’s look at an example Crouton formats:
ChunkedMemoryLayout<
/* RANK */ 4,
/* Least Major: Batch Dim, all the rest */ 0,0,
/* Next least major: height, all the rest */ 1,0,
/* Next least major: width, all the rest */ 2,0,
/* Next least major: depth, all the rest */ 3,0,
/* 8 rows high */ 1,8,
/* 8 columns wide */ 2,8,
/* 32 channels deep */ 3,32> ChannelMajorCrouton;
ChunkedMemoryLayout<
/* RANK */ 4,
/* Least Major: Batch Dim, all the rest */ 0,0,
/* Next least major: height, all the rest */ 1,0,
/* Next least major: width, all the rest */ 2,0,
/* Next least major: depth, all the rest */ 3,0,
/* 4 high */ 1,4,
/* 4 wide */ 2,4,
/* 32 channels deep */ 3,32,
/* 2 rows */ 1,2,
/* 2 cols */ 2,2> SpatialXYMajor;
ChunkedMemoryLayout<
/* RANK */ 4,
/* Least Major: Batch Dim, all the rest */ 0,0,
/* Next least major: height, all the rest */ 1,0,
/* Next least major: width, all the rest */ 2,0,
/* Next least major: depth, all the rest */ 3,0,
/* 8 high */ 1,4,
/* 2 wide */ 2,2,
/* 32 channels deep */ 3,32,
/* 4 cols */ 2,4> SpatialXMajor;
The infrastructure supports the use of all of these formats. Generic ops can use any of the formats indicated here.