Tutorial - Profile Custom Models using Qualcomm® AI Engine Direct Delegate¶
Qualcomm® AI Engine Direct provides APIs for users to profile custom models, capturing information including inference time, per-operator execute time, etc. This feature might reduce inference speed.
Workflow of using Profiler APIs¶
Step 2: Get and save the profiling result
Step 3: Clear the profiling result
Step 4: View the profiling result
Step 1: Set profiling option¶
TfLiteQnnDelegateOptions options = TfLiteQnnDelegateOptionsDefault();
// for basic profiling, please use
options.profiling = kBasicProfiling;
// for detailed profiling, please use
options.profiling = kPerOpProfiling;
In basic profiling mode, user will get infomation of RPC time, execute time, etc. for both prepare stage and inference stage.
In detailed profiling mode, user will get additional information showing execute time of each operator in the model and the top K operators consuming the most time.
Step 2: Get and save the profiling result¶
After setting the profiling option and invoking the model. User can get the profiling result by TfLiteQnnDelegateGetProfilingResult.
interpreter_->Invoke();
TfLiteQnnDelegateProfilingResult result = TfLiteQnnDelegateGetProfilingResult(delegate);
Then, user need to save the result into a binary file.
std::string output_path = "/path/to/output/profile_result.bin"
std::ofstream outfile(output_path, std::ofstream::binary);
outfile.write(reinterpret_cast<const char*>(result.buffer), result.buffer_length);
outfile.close();
Step 3: Clear the profiling result¶
User could clear the profiling result whenever needed. The profiling result will be accumulated until
TfLiteQnnDelegateClearProfilingResult is called.
TfLiteQnnDelegateClearProfilingResult(delegate);
Step 4: View the profiling result¶
Follow the steps in View Profiling result by qtld-profile-viewer to view the saved binary file in Step 2: Get and save the profiling result.
The profiling result will contain the following sections.
Meta data of the profiling result.
[PROFILER API VERSION] : 1.0.0
[QNN API VERSION] : 2.13.0
[QNN DELEGATE API VERSION] : 0.19.0
[LOG DATE] : Tue Aug 29 01:14:49 2023
Graph stats including total graph number and their number of finalize/execute.
---------------Graph Stats----------------
Total Graph #: 2
Graph index: 0
Backend type: QNN_BACKEND_ID_HTP
Number of Finalize: 1
Number of Execute: 5
Finalize Info descirbes the events run in finalize (prepare) stage.
----------Finalize Info----------
Graph index: 0
RPC (finalize) time: 2155 us
QNN accelerator (finalize) time: 2055 us
Accelerator (finalize) time: 1969 us
QNN (finalize) time: 239965 us
Execute info describes the events run in excute (inference) stage, info of each operator can be only seen in detailed mode.
------------Execute Info------------
Graph index: 0
Number of HVX threads used: 4 count
RPC (execute) time: 3309 us
QNN accelerator (execute) time: 3235 us
Num times yield occured: 0 count
Time for initial VTCM acquire: 502 us
Time for HVX + HMX power on and acquire: 4135 us
Accelerator (execute) time (cycles): 1439530 cycles
0.00% | Input OpId_2 (cycles): 0 cycles
0.04% | OpId_0 (cycles): 528 cycles
20.05% | node_id_0_op_type_Quantize_op_count_0:OpId_16 (cycles): 288565 cycles
0.00% | node_id_1_op_type_Pad_op_count_0:OpId_18 (cycles): 0 cycles
7.23% | node_id_2_op_type_Conv2d_op_count_0:OpId_27 (cycles): 104082 cycles
For more than 2 times of Finalize/Execute, user can see a summary in the profiling result.
The first n times of execute are usually slower, user can count them as Warmups and exclude them from being calculated in Mean & Stdev.
Please refer to View Profiling result by qtld-profile-viewer for how to set Warmups.
------------Summary------------
Graph index:0
Warmup (first 1 events):
RPC (execute) time: 3309.00 us
QNN accelerator (execute) time: 3235.00 us
Time for initial VTCM acquire: 502.00 us
Time for HVX + HMX power on and acquire: 4135.00 us
Accelerator (execute) time: 578.00 us
Accelerator (execute excluding wait) time: 308.00 us
QNN (execute) time: 3584.00 us
Mean:
RPC (execute) time: 1483.50 us
QNN accelerator (execute) time: 1474.00 us
Time for initial VTCM acquire: 112.25 us
Time for HVX + HMX power on and acquire: 26.00 us
Accelerator (execute) time: 303.25 us
Accelerator (execute excluding wait) time: 281.00 us
QNN (execute) time: 1653.75 us
Stdev:
RPC (execute) time: 30.49 us
QNN accelerator (execute) time: 31.32 us
Time for initial VTCM acquire: 14.64 us
Time for HVX + HMX power on and acquire: 8.37 us
Accelerator (execute) time: 9.22 us
Accelerator (execute excluding wait) time: 7.35 us
QNN (execute) time: 37.66 us
Per-op Avg Time can be seen in detailed mode, descirbing the average time of each operators between multiple executes (inferences).
---------Per-op Avg Time----------
0.00% | Input OpId_2 (cycles): 0.00 cycles
0.04% | OpId_0 (cycles): 190.80 cycles
0.00% | node_id_0_op_type_Cast_op_count_0:OpId_16 (cycles): 0.00 cycles
22.07% | node_id_1_op_type_Conv2d_op_count_0:OpId_23 (cycles): 105402.30 cycles
0.00% | node_id_1_op_type_ReluMinMax_op_count_1:OpId_25 (cycles): 0.00 cycles
1.57% | node_id_2_op_type_DepthWiseConv2d_op_count_0:OpId_34 (cycles): 7503.50 cycles
0.00% | node_id_2_op_type_ReluMinMax_op_count_1:OpId_35 (cycles): 0.00 cycles
0.74% | node_id_3_op_type_Conv2d_op_count_0:OpId_44 (cycles): 3522.60 cycles
1.77% | node_id_4_op_type_Conv2d_op_count_0:OpId_52 (cycles): 8451.30 cycles
0.00% | node_id_4_op_type_ReluMinMax_op_count_1:OpId_54 (cycles): 0.00 cycles
16.28% | node_id_5_op_type_DepthWiseConv2d_op_count_0:OpId_63 (cycles): 77752.50 cycles
0.00% | node_id_5_op_type_ReluMinMax_op_count_1:OpId_64 (cycles): 0.00 cycles
0.51% | node_id_6_op_type_Conv2d_op_count_0:OpId_73 (cycles): 2442.10 cycles
0.78% | node_id_7_op_type_Conv2d_op_count_0:OpId_81 (cycles): 3732.40 cycles
0.00% | node_id_7_op_type_ReluMinMax_op_count_1:OpId_83 (cycles): 0.00 cycles
1.34% | node_id_8_op_type_DepthWiseConv2d_op_count_0:OpId_92 (cycles): 6378.90 cycles
0.00% | node_id_8_op_type_ReluMinMax_op_count_1:OpId_93 (cycles): 0.00 cycles
Top K by Computation Time (10 in this example). It shows K operators with the highest percentage in Per-op Avg Time.
---------Top 10 by Computation Time----------
22.07% | 105402 cycles | node_id_1_op_type_Conv2d_op_count_0:OpId_23 (cycles)
16.28% | 77752.5 cycles | node_id_5_op_type_DepthWiseConv2d_op_count_0:OpId_63 (cycles)
7.89% | 37703.2 cycles | node_id_12_op_type_DepthWiseConv2d_op_count_0:OpId_122 (cycles)
5.38% | 25670.2 cycles | node_id_49_op_type_DepthWiseConv2d_op_count_0:OpId_419 (cycles)
4.38% | 20895.2 cycles | node_id_23_op_type_DepthWiseConv2d_op_count_0:OpId_211 (cycles)
2.75% | 13117.7 cycles | node_id_15_op_type_DepthWiseConv2d_op_count_0:OpId_151 (cycles)
1.83% | 8729.9 cycles | node_id_63_op_type_PoolAvg2d_op_count_0:OpId_534 (cycles)
1.77% | 8451.3 cycles | node_id_4_op_type_Conv2d_op_count_0:OpId_52 (cycles)
1.73% | 8268.2 cycles | node_id_41_op_type_DepthWiseConv2d_op_count_0:OpId_359 (cycles)
1.71% | 8164.3 cycles | node_id_45_op_type_DepthWiseConv2d_op_count_0:OpId_389 (cycles)
Wait time, Overlap time, and Resources can be seen in linting mode, providing more details of the current execution.
4.54% | node_id_11_op_type_ElementWiseAdd_op_count_0:OpId_101 (cycles): 639291 cycles
Wait (Scheduler) time: 681 cycles
Overlap time: 20042 cycles
node_id_17_op_type_Relu_op_count_2:OpId_143
node_id_13_op_type_Relu_op_count_2:OpId_117
node_id_19_op_type_ElementWiseAdd_op_count_0:OpId_153
node_id_5_op_type_Relu_op_count_2:OpId_65
node_id_7_op_type_ElementWiseAdd_op_count_0:OpId_75
node_id_11_op_type_ElementWiseAdd_op_count_0:OpId_101
node_id_9_op_type_Relu_op_count_2:OpId_91
node_id_3_op_type_Relu_op_count_2:OpId_49
Overlap (wait) time: 663 cycles
node_id_17_op_type_Relu_op_count_2:OpId_143
node_id_13_op_type_Relu_op_count_2:OpId_117
node_id_19_op_type_ElementWiseAdd_op_count_0:OpId_153
node_id_5_op_type_Relu_op_count_2:OpId_65
node_id_7_op_type_ElementWiseAdd_op_count_0:OpId_75
node_id_11_op_type_ElementWiseAdd_op_count_0:OpId_101
node_id_9_op_type_Relu_op_count_2:OpId_91
Resources: HVX, HMX
A Running Example using Profiler APIs¶
#include "QNN/TFLiteDelegate/QnnTFLiteDelegate.h"
// Setup interpreter with .tflite model.
// Create QNN Delegate options structure.
TfLiteQnnDelegateOptions options = TfLiteQnnDelegateOptionsDefault();
// Set profiling options to either kBasicProfiling or kPerOpProfiling
options.profiling = kBasicProfiling;
// Instantiate delegate. Must not be freed until interpreter is freed.
// Please use QNN Delegate interface rather than external delegate interface.
TfLiteDelegate* delegate = TfLiteQnnDelegateCreate(&options);
// Register QNN Delegate with TfLite interpreter to automatically delegate nodes.
interpreter_->ModifyGraphWithDelegate(delegate);
// Perform inference with interpreter as usual.
interpreter_->Invoke();
// Get profling result.
TfLiteQnnDelegateProfilingResult result = TfLiteQnnDelegateGetProfilingResult(delegate);
// Save the result to binary.
std::string output_path = "/path/to/output/profiling_result.bin"
std::ofstream outfile(output_path, std::ofstream::binary);
outfile.write(reinterpret_cast<const char*>(result.buffer), result.buffer_length);
outfile.close();
// Clear profiling result.
TfLiteQnnDelegateClearProfilingResult(delegate)
TfLiteQnnDelegateDelete(delegate);