Scheduling and Allocation

QNN HTP is high in performance goal due to the support of parallelism and resource utilization. The initial graph is constructed though series of append_node calls; after which graph goes through prepare phase. With cost and dependency information, ordering can be determined algorithmically.

In QNN HTP, both scheduling and allocation is done in Graph::prepare() stage. As an overview, the following occurs in regards to scheduling and allocation:

  1. Memory blocks are registered with the allocator.

    • During prepare, before scheduling and allocation, all blocks of data are registered with the allocator, informing the allocator of their memory type and minimum size and alignment requirements. The two types of memory blocks are Plain and TCM. TCM here refers to the VTCM.

  2. Pre-Scheduler fits as much data into TCM as possible.

    • At this point, the scheduler tries to develop a topological ordering which reduces TCM usage by iteratively partitioning the graph at low-TCM boundaries. This outputs a “runlist”.

  3. Spill/fill nodes are inserted where necessary.

    • Based on the runlist output by the pre-scheduler, the spill pass adds up the requested VTCM at each op. It is possible for the requested VTCM at an op to be much higher than what is input and output by the op itself since other blocks of data might still be in VTCM that were output earlier and not used as input until later. To reduce the required VTCM usage across ranges of ops, the spill pass inserts spill and fill ops that copy data out of VTCM temporarily to make room for other data and then copy it back in before it is needed later.

  4. Some ops are split into launch-wait pairs.

    • Some ops have the ability to be run using background resources. This is where those ops are split into pairs that launch the operation onto some background resource and then wait for completion in order to prevent another op from starting before its inputs are ready.

  5. Offsets are allocated for blocks that reside in VTCM.

    • The allocator takes in the modified runlist after spills and fills have been inserted. Using the requirements for each block of data that were registered earlier, the allocator assigns offsets to each TCM block within VTCM. If two blocks of data do not have to be in VTCM at the same time, then the allocator might assign offsets to those two blocks of data such that their address ranges overlap. This can cause the situation where two ops that could have been rearranged in any order can no longer be swapped because doing so would cause some blocks of data that were allocated in an overlapping manner to be needed in VTCM at the same time. The allocator tries to reduce this situation where it can, since these new restrictions can constrain available parallelism.

  6. Ops are re-scheduled to maximize parallelism.

    • The final scheduler moves some ops earlier/later to increase parallelism while respecting dependencies within the allocated graph. This pass takes in the existing runlist and outputs a new runlist that has been optimized for parallelism. The final scheduler runs after allocation has been performed, so must obey the restrictions the allocator introduced by allocating some blocks at overlapping address ranges within VTCM.