Scheduling and Allocation¶
QNN HTP is high in performance goal due to the support of parallelism
and resource utilization. The initial graph is constructed though series
of append_node calls; after which graph goes through prepare
phase. With cost and dependency information, ordering can be determined
algorithmically.
In QNN HTP, both scheduling and allocation is done in
Graph::prepare() stage. As an overview, the following occurs in
regards to scheduling and allocation:
Memory blocks are registered with the allocator.
During
prepare, before scheduling and allocation, all blocks of data are registered with the allocator, informing the allocator of their memory type and minimum size and alignment requirements. The two types of memory blocks arePlainandTCM.TCMhere refers to theVTCM.
Pre-Scheduler fits as much data into
TCMas possible.At this point, the scheduler tries to develop a topological ordering which reduces
TCMusage by iteratively partitioning the graph atlow-TCMboundaries. This outputs a “runlist”.
Spill/fill nodes are inserted where necessary.
Based on the runlist output by the pre-scheduler, the spill pass adds up the requested
VTCMat each op. It is possible for the requestedVTCMat an op to be much higher than what is input and output by the op itself since other blocks of data might still be inVTCMthat were output earlier and not used as input until later. To reduce the requiredVTCMusage across ranges of ops, the spill pass inserts spill and fill ops that copy data out ofVTCMtemporarily to make room for other data and then copy it back in before it is needed later.
Some ops are split into launch-wait pairs.
Some ops have the ability to be run using background resources. This is where those ops are split into pairs that launch the operation onto some background resource and then wait for completion in order to prevent another op from starting before its inputs are ready.
Offsets are allocated for blocks that reside in
VTCM.The allocator takes in the modified runlist after spills and fills have been inserted. Using the requirements for each block of data that were registered earlier, the allocator assigns offsets to each
TCMblock withinVTCM. If two blocks of data do not have to be inVTCMat the same time, then the allocator might assign offsets to those two blocks of data such that their address ranges overlap. This can cause the situation where two ops that could have been rearranged in any order can no longer be swapped because doing so would cause some blocks of data that were allocated in an overlapping manner to be needed inVTCMat the same time. The allocator tries to reduce this situation where it can, since these new restrictions can constrain available parallelism.
Ops are re-scheduled to maximize parallelism.
The final scheduler moves some ops earlier/later to increase parallelism while respecting dependencies within the allocated graph. This pass takes in the existing runlist and outputs a new runlist that has been optimized for parallelism. The final scheduler runs after allocation has been performed, so must obey the restrictions the allocator introduced by allocating some blocks at overlapping address ranges within
VTCM.