Benchmarking

Overview

The benchmark shipped in the Qualcomm® AI Engine Direct SDK consists of a set of python scripts that runs a network on a target device and collects performance metrics. It uses executables and libraries found in the SDK package to run a compiled ‘’model.so’’ file on the target, using a set of inputs for the network, and a file that points to that set of inputs.

The input to the benchmark scripts is a configuration file in JSON format. The SDK ships with a configuration file for running the InceptionV3 model that is created following instructions in the SDK documentation. The SDK users are encouraged to create their own configuration files and use the benchmark scripts to run on target devices to collect timing measurements.

The configuration file allows the user to specify:

  • Name of the benchmark (i.e., InceptionV3)

  • Host path to use for storing results

  • Device paths to use (where to push the necessary files for running the benchmark)

  • Device to run the benchmark on (only one device is supported per run)

  • Hostname/IP of the remote machine to which devices are connected

  • Number of times to repeat the run

  • Model specifics (name, location of ‘’model.so’’, location of inputs, etc.)

  • QNN backend configuration(s) to use (combination of CPU, GPU, and DSP)

  • Measurements to take (“timing”)

  • Profiling level of measurements (“basic” or “detailed”)

Command line parameters

To see all available command line parameters use the “-h” option when running qnn_bench.py.

usage: qnn_bench.py [-h] -c CONFIG_FILE [-o OUTPUT_BASE_DIR_OVERRIDE]
                    [-v DEVICE_ID_OVERRIDE] [-r HOST_NAME]
                    [-t DEVICE_OS_TYPE_OVERRIDE] [-d] [-s SLEEP]
                    [-n ITERATIONS] [-p PERFPROFILE]
                    [--backend_config BACKEND_CONFIG] [-l PROFILINGLEVEL]
                    [-json] [-be BACKEND [BACKEND ...]] [--htp_serialized]
                    [--dsp_type {v65,v66,v68,v69,v73,v75}]
                    [--arm_prepare] [--use_signed_skel] [--discard_output]
                    [--test_duration TEST_DURATION] [--enable_cache]
                    [--shared_buffer] [--clean_artifacts] [--cdsp_id {0,1}]

Run the qnn_bench

required arguments:
  -c CONFIG_FILE, --config_file CONFIG_FILE
                        Path to a valid config file
                        Refer to sample config file config_help.json present at <SDK_ROOT>/benchmarks/QNN/
                        to know details on how to fill parameters in config file

optional arguments:
  -o OUTPUT_BASE_DIR_OVERRIDE, --output_base_dir_override OUTPUT_BASE_DIR_OVERRIDE
                        Sets the output base directory.
  -v DEVICE_ID_OVERRIDE, --device_id_override DEVICE_ID_OVERRIDE
                        Use this device ID instead of the one supplied in config file.
  -r HOST_NAME, --host_name HOST_NAME
                        Hostname/IP of remote machine to which devices are connected.
  -t DEVICE_OS_TYPE_OVERRIDE, --device_os_type_override DEVICE_OS_TYPE_OVERRIDE
                        Specify the target OS type, valid options are
                        ['aarch64-android', 'aarch64-windows-msvc', 'aarch64-qnx',
                        'aarch64-oe-linux-gcc9.3', 'aarch64-oe-linux-gcc8.2']
  -d, --debug           Set to turn on debug log
  -s SLEEP, --sleep SLEEP
                        Set number of seconds to sleep between runs e.g. 20 seconds
  -n ITERATIONS, --iterations ITERATIONS
                        Set the number of iterations to execute for calculating metrics
  -p PERFPROFILE, --perfprofile PERFPROFILE
                        Specify the perf profile to set. Valid settings are
                        low_balanced, balanced, default, high_performance,
                        sustained_high_performance, burst, low_power_saver,
                        power_saver, high_power_saver, system_settings
  --backend_config BACKEND_CONFIG
                        config file to specify context priority or provide backend extensions related parameters or enable htp specific linting profile
  -l PROFILINGLEVEL, --profilinglevel PROFILINGLEVEL
                        Set the profiling level mode (basic, detailed, backend). Default is basic.
  -json, --generate_json
                        Set to produce json output.
  -be BACKEND [BACKEND ...], --backend BACKEND [BACKEND ...]
                        The backend to use
  --htp_serialized      qnn graph prepare is done on x86 and execute is run on target
  --dsp_type {v65,v66,v68,v69,v73,v75}
                        Specify DSP variant for QNN BM run
  --arm_prepare         qnn graph prepare is done on ARM and execute is run on target
  --use_signed_skel     use signed skels for HTP runs
  --discard_output      To discard writing output tensors after test execution.
  --test_duration TEST_DURATION
                        Specify duration for test execution in seconds
                        Loops over the input_list until this amount of time has transpired
  --enable_cache        To prepare graph on device first using qnn-context-binary-generator and
                         and then execute graph using qnn-net-run to accelerate the execution. Defaults to disable.
  --shared_buffer       Enables usage of shared buffer between application and backend for graph I/O
  --clean_artifacts     Clean the model specific artifacts after inference
  --cdsp_id {0,1}       To specify cdsp core to use when a SOC has multiple cdsp cores. By Default is 0.

Running the benchmark

Prerequisites

  • Set up the QNN SDK using the steps provided in Setup.

  • Complete the 'Tutorial Setup' and 'Model Conversion and Build' sections of Converting and executing a CNN model with QNN.

  • (Optional) If the device is connected to a remote machine, the remote adb server setup must be performed by the user.

Running InceptionV3 that is shipped with the SDK

qnn_bench.py is the main benchmark script to measure and report performance statistics. To use it with the InceptionV3 model:

# Running inceptionV3_sample.json (CPU backend)
cd $QNN_SDK_ROOT/benchmarks/QNN
python3.10 qnn_bench.py -c inceptionV3_sample.json

# Running inceptionV3_quantized_sample.json (HTP backend by generating context binary on X86)
cd $QNN_SDK_ROOT/benchmarks/QNN
python3.10 qnn_bench.py -c inceptionV3_quantized_sample.json --dsp_type v68 --htp_serialized

Viewing the results (csv file or JSON file)

All results are stored in the “HostResultDir” directory that is specified in the configuration JSON file. The benchmark creates time-stamped directories for each benchmark run. All timing results are stored in microseconds.

For convenience, a latest_results link is created that always points to the most recent run.

# In inceptionV3_sample.json, "HostResultDir" is set to "inception_v3.repo/results"
cd $QNN_SDK_ROOT/benchmarks/QNN/inception_v3.repo/results
# Notice the time stamped directories and the "latest_results" link.
cd $QNN_SDK_ROOT/benchmarks/QNN/inception_v3.repo/results/latest_results
# Notice the .csv file, open this file in a csv viewer (Excel, LibreOffice Calc)
# Notice the .json file, open the file with any text editor
CSV benchmark results file

The CSV file contains results similar to the example below. Some measurements may not be apparent in the CSV file. To get all timing information, the profiling level must be set to detailed. By default, the profiling level is set to basic.

../_static/resources/benchmarking_csv_output.png
Section 1: Execution information

This section contains:

  • SDK version used to generate the benchmark run

  • Model name and path to the compiled model file

  • Backends selected for the benchmark

  • Additional execution information

Section 2: Performance metrics

This section contains measurements for model initialization and execution. The profiling level affects the amount of measurements collected.

  • Init Stats [NetRun] measures the time taken to build and configure QNN.

  • Finalize Stats [NetRun] measures the time taken by QNN to finalize the graph.

  • De-Init Stats [NetRun] measures the time taken to de-initialize QNN.

  • Total Inference Time [NetRun] measures the entire execution time of one inference pass. This includes any input and output processing, copying of data, etc. This is measured at the start and end of the execute call.

Section 3: Per-layer detailed performance statistics

This section contains the execution stats of each layer of the neural network model.

Note

This information will be present only if the profiling level is set to detailed.

JSON benchmark results file

The benchmark results published in the CSV file can also be made available in JSON format. The contents are the same as in the CSV file, structured as key-value pairs, and will help parsing the results in a simple and efficient manner. The JSON file contains results similar to the following example.

../_static/resources/benchmarking_json_output.png
Section 4: Linting Profile Stats

Linting profiling mode is an HTP exclusive configuration that provides per op cycle count on the main thread as well as background execution information. See here QNN HTP Profiling for more information.

Running the benchmark with your own network and inputs

Prepare inputs

Before running the benchmark, prepare the following inputs:

  • your_model.so. See QNN Integration Workflow for the model.so creation workflow.

  • A text file listing all of your input data. For an example, see: $QNN_SDK_ROOT/examples/Models/InceptionV3/data/target_raw_list.txt.

  • All of the input data that is listed in the above text file. For an example, refer to the $QNN_SDK_ROOT/examples/Models/InceptionV3/data/cropped directory.

    Note

    target_raw_list.txt must exactly match the structure of your input directory.

Create a run configuration

Configuration structure

The configuration file is a JSON file with a predefined structure. Refer to $QNN_SDK_ROOT/benchmarks/QNN/inceptionV3_sample.json as an example.

Required fields:

  • Name – Name of the configuration, e.g., InceptionV3.

  • HostRootPath – Top level output folder on the host. This can be an absolute path or a relative path to the current working directory.

  • HostResultDir – Folder on the host where all benchmark results are put. This can be an absolute path or a relative path to the current working directory.

  • DevicePath – Folder on the device where all benchmark-related data and artifacts are put, e.g., /data/local/tmp/qnnbm.repo.

  • Devices – Serial number of the device on which the benchmark runs. Only one device is currently supported.

  • Runs – Number of times that the benchmark runs for each of the “Backend” and “Measurements” run combinations.

  • Model

    • Name – Name of the DNN model, e.g., InceptionV3.

    • qnn_model – Folder where the compiled ‘’model.so’’ file is located on the host. This can be an absolute path or a relative path to the current working directory.

    • InputList – Text filepath that lists all of the input data. This can be an absolute path or a relative path to the current working directory.

    • Data – A list of data files or folders that are listed in the InputList file. This can be an absolute path or a relative path to the current working directory. If the path is a folder, all contents of that folder will be pushed to the device.

  • Backends – Possible values are “GPU”, “DSP”, and “CPU”. Any combination of these can be used.

  • Measurements – Possible value is “timing”. Measurement type is measured alone for each run.

Optional fields:

  • HostName – Hostname/IP of the remote machine to which devices are connected. The default value is ‘localhost’.

  • PerfProfile – Performance mode to enable. The default is ‘high_performance’.

  • ProfilingLevel – Profiling level to enable. The default is ‘basic’.

Architecture support

Android AARCH 64-bit is supported.

Backend and measurement are concatenated to make a full run combination name, e.g.,
“GPU_timing”: GPU backend, timing measurement

Run the benchmark

cd $QNN_SDK_ROOT/benchmarks/QNN
python3.10 qnn_bench.py -c yourmodel.json

The benchmark will perform an md5sum on the host files (those specified in the JSON configuration) and on the device files. Because of the md5sum check, the files needed for running the benchmark must be available on the host.

For any file that exists both on the host and on the device with mismatch md5, the benchmark will copy the file from the host to the target and issue a warning message notifying that the local files do not match the device files. This is done to be sure that the results received from a benchmark run accurately reflect the files specified in the JSON file.

Other options

-v option
Allows user to override the device ID specified in the configuration file, so that the same configuration file
can be used across multiple devices.
-o option
Result output base directory override applies only if the relative paths are specified for HostRootPath
and HostResultsDir. It allows pooling the output regardless of from where the benchmark is run.
-t option
OS Type override currently supports Android aarch64 (arm64-v8a) devices.
-n option
Allows user to specify the number of times to repeat the runs to calculate the performance metrics.
-p option
Allows user to profile performance in different operating modes.
-l option
Allows user to specify the level of performance profiling.
-json option
Allows user to generate the result in JSON format along with default CSV format.
-be option
Allows user to set the backend to use.
--dsp_type option
Allows user to mention the dsp_type of the device.
--htp_serialized option
Allows user to prepare graph using HTP emulator on x86 and execute on the target.
--shared_buffer option
Specifies using shared buffers for a zero-copy usecase between the application and device/co-processor
associated with the backend.
--arm_prepare option
Allows user to prepare graph on arm and execute on the target.
--backend_config option
Allows user to specify context priority or provide backend extensions related parameters or htp specific linting profile.
Reading the results
Open the results (CSV file or JSON file) in the <HostResultsDir>/latest_results folder to view
the results. (<HostResultsDir> is what is specified in the JSON configuration file.)

Measurement methodology

In all cases, the qnn-net-run executable is used to load a model and run inputs through the model.

Performance (“timing”)

Timing measurements are taken using internal timing utilities inside the QNN libraries. When qnn-net-run is executed, the libraries will log timing information to a file. This file is then parsed offline to retrieve the total inference times and per-layer times.

The total inference times include both the per-layer computation times plus overhead, such as data movements between layers, as well as into and out of backend. The per-layer times are strictly computational times for each layer. For smaller networks, the overhead can be significant relative to the computational time, particularly when offloading the networks to run on GPU or DSP.

Further optimizations present on the GPU/DSP may cause layer times to be misattributed, in the case of neuron conv-neuron or fc-neuron pairs. When executing on GPU, the total time of the pairs would be assigned to convs, whereas for DSP, they would be assigned to the neurons.

Note

Detailed and Linting Profiling will cause performance impact.

Benchmark dependencies

Binaries that the benchmark script depends on are in the following configuration files (depending on the target architecture, compiler, and STL library):

  • Android 64-bit

    • clang - libc++: bm_utils/qnnbm_artifacts_android_aarch64.json