Tutorial - Running Inference Using Shared Memory

Qualcomm® AI Engine Direct Delegate provides APIs for users to allocate specified tensors, usually graph inputs and outputs on shared memory to reduce huge tensor copying time from TFLlite CPU to Qualcomm® AI Engine Direct. This feature can accelerate inference speed.

This feature is only able to use with HTP backend for now.

Users need to do shared memory resource management by themselves. Please check TfLiteQnnDelegateAllocCustomMem and TfLiteQnnDelegateFreeCustomMem for more information.

A TFLite interpreter provides SetCustomAllocationForTensor API to set a custom memory allocation for the given tensor. Please call AllocateTensors after setting custom allocation to make sure no invalid/insufficient buffers.

A fully delegated model with huge graph input/output benefits the most.

Workflow of using shared memory

For creating an application using shared memory, we prescribe the below pattern:

  1. Step 1: Try to request enough memory space on shared memory

  2. Step 2: Set the custom allocate tensor info

  3. Step 3: Assign a custom memory allocation for the given tensor

  4. Step 4: Free the allocated tensor at the end

Step 1: Try to request enough memory space on shared memory

void* custom_ptr = TfLiteQnnDelegateAllocCustomMem(num_bytes, tflite::kDefaultTensorAlignment);

num_bytes: To get exact or _enough_ output tensor bytes.

tflite::kDefaultTensorAlignment: TfLite default alignment.

custom_ptr: Pointer to the shared buffer on success; NULL on failure.

Step 2: Set the custom allocate tensor info

TfLiteCustomAllocation tensor_alloc = {custom_ptr, num_bytes};

Wrap the shared buffer and tensor bytes together as a TfLiteCustomAllocation.

Step 3: Assign a custom memory allocation for the given tensor

interpreter_->SetCustomAllocationForTensor(tensor_idx, tensor_alloc);

tensor_idx: Tensor index

tensor_alloc: TfLiteCustomAllocation

Step 4: Free the allocated tensor at the end

TfLiteQnnDelegateFreeCustomMem(custom_ptr);

custom_ptr: Allocated shared buffer pointer.

A Running Example of using shared memory

This tutorial demonstrates how to run a model using shared memory.

#include "QNN/TFLiteDelegate/QnnTFLiteDelegate.h"

// Setup interpreter with .tflite model.

// Create QNN Delegate options structure.
TfLiteQnnDelegateOptions options = TfLiteQnnDelegateOptionsDefault();

// Set the mandatory backend_type option as HTP.
options.backend_type = kHtpBackend;

// Instantiate delegate. Must not be freed until interpreter is freed.
// Please use QNN Delegate interface rather than external delegate interface.
TfLiteDelegate* delegate = TfLiteQnnDelegateCreate(&options);

// Allocate enough memory space on shared memory
void* custom_ptr = TfLiteQnnDelegateAllocCustomMem(num_bytes, tflite::kDefaultTensorAlignment);

// Assigns (or reassigns) a custom memory allocation for the given tensor and re-allocate tensors.
TfLiteCustomAllocation tensor_alloc = {custom_ptr, num_bytes};
interpreter_->SetCustomAllocationForTensor(tensor_idx, tensor_alloc);
interpreter_->AllocateTensors();

// Register QNN Delegate with TfLite interpreter to automatically delegate nodes.
interpreter_->ModifyGraphWithDelegate(delegate);

// Perform inference with interpreter as usual.
interpreter_->Invoke();

// User is responsible to free the allocated memory.
TfLiteQnnDelegateFreeCustomMem(custom_ptr);

// Delete delegate after interpreter no longer needed.
TfLiteQnnDelegateDelete(delegate);

The output should look like:

INFO: Initialized TensorFlow Lite runtime.
INFO: TfLiteQnnDelegate delegate: 128 nodes delegated out of 128 nodes with 1 partitions.

INFO: Replacing 128 node(s) with delegate (TfLiteQnnDelegate) node, yielding 1 partitions.
INFO: Tensor 0 is successfully registered to shared memory.
INFO: Tensor 319 is successfully registered to shared memory.