Tutorial - Running Inference Using Shared Memory¶
Qualcomm® AI Engine Direct Delegate provides APIs for users to allocate specified tensors, usually graph inputs and outputs on shared memory to reduce huge tensor copying time from TFLlite CPU to Qualcomm® AI Engine Direct. This feature can accelerate inference speed.
This feature is only able to use with HTP backend for now.
Users need to do shared memory resource management by themselves.
Please check TfLiteQnnDelegateAllocCustomMem and TfLiteQnnDelegateFreeCustomMem for more information.
A TFLite interpreter provides SetCustomAllocationForTensor API to set a custom memory allocation
for the given tensor. Please call AllocateTensors after setting custom allocation to
make sure no invalid/insufficient buffers.
A fully delegated model with huge graph input/output benefits the most.
Workflow of using shared memory¶
For creating an application using shared memory, we prescribe the below pattern:
Step 1: Try to request enough memory space on shared memory¶
void* custom_ptr = TfLiteQnnDelegateAllocCustomMem(num_bytes, tflite::kDefaultTensorAlignment);
num_bytes: To get exact or _enough_ output tensor bytes.
tflite::kDefaultTensorAlignment: TfLite default alignment.
custom_ptr: Pointer to the shared buffer on success; NULL on failure.
Step 2: Set the custom allocate tensor info¶
TfLiteCustomAllocation tensor_alloc = {custom_ptr, num_bytes};
Wrap the shared buffer and tensor bytes together as a TfLiteCustomAllocation.
Step 3: Assign a custom memory allocation for the given tensor¶
interpreter_->SetCustomAllocationForTensor(tensor_idx, tensor_alloc);
tensor_idx: Tensor index
tensor_alloc: TfLiteCustomAllocation
Step 4: Free the allocated tensor at the end¶
TfLiteQnnDelegateFreeCustomMem(custom_ptr);
custom_ptr: Allocated shared buffer pointer.
A Running Example of using shared memory¶
This tutorial demonstrates how to run a model using shared memory.
#include "QNN/TFLiteDelegate/QnnTFLiteDelegate.h"
// Setup interpreter with .tflite model.
// Create QNN Delegate options structure.
TfLiteQnnDelegateOptions options = TfLiteQnnDelegateOptionsDefault();
// Set the mandatory backend_type option as HTP.
options.backend_type = kHtpBackend;
// Instantiate delegate. Must not be freed until interpreter is freed.
// Please use QNN Delegate interface rather than external delegate interface.
TfLiteDelegate* delegate = TfLiteQnnDelegateCreate(&options);
// Allocate enough memory space on shared memory
void* custom_ptr = TfLiteQnnDelegateAllocCustomMem(num_bytes, tflite::kDefaultTensorAlignment);
// Assigns (or reassigns) a custom memory allocation for the given tensor and re-allocate tensors.
TfLiteCustomAllocation tensor_alloc = {custom_ptr, num_bytes};
interpreter_->SetCustomAllocationForTensor(tensor_idx, tensor_alloc);
interpreter_->AllocateTensors();
// Register QNN Delegate with TfLite interpreter to automatically delegate nodes.
interpreter_->ModifyGraphWithDelegate(delegate);
// Perform inference with interpreter as usual.
interpreter_->Invoke();
// User is responsible to free the allocated memory.
TfLiteQnnDelegateFreeCustomMem(custom_ptr);
// Delete delegate after interpreter no longer needed.
TfLiteQnnDelegateDelete(delegate);
The output should look like:
INFO: Initialized TensorFlow Lite runtime.
INFO: TfLiteQnnDelegate delegate: 128 nodes delegated out of 128 nodes with 1 partitions.
INFO: Replacing 128 node(s) with delegate (TfLiteQnnDelegate) node, yielding 1 partitions.
INFO: Tensor 0 is successfully registered to shared memory.
INFO: Tensor 319 is successfully registered to shared memory.