Benchmarking¶
Overview
The benchmark shipped in the Qualcomm® Neural Processing SDK consists of a set of python scripts that runs a network on a target Android/LinuxEmbedded device and collect performance metrics. It uses executables and libraries found in the SDK package to run a DLC file on target, using a set of inputs for the network, and a file that points to that set of inputs.
The input to the benchmark scripts is a configuration file in JSON format. The SDK ships with a configuration file for running the Inception_v3 model that is created in the Qualcomm® Neural Processing SDK. SDK users are encouraged to create their own configuration files and use the benchmark scripts to run on target to collect timing and memory consumption measurements.
The configuration file allows the user to specify:
Name of the benchmark (i.e., Inception_v3)
Host path to use for storing results
Device paths to use (where to push the necessary files for running the benchmark)
Device to run the benchmark on (only one device is supported per run)
Hostname/IP of remote machine to which devices are connected
Number of times to repeat the run
Model specifics (name, location of dlc, location of inputs)
Qualcomm® Neural Processing SDK runtime configuration(s) to use (combination of CPU, GPU, GPU_FP16 and DSP)
Which measurements to take (“mem” and/or “timing”)
Profiling level of measurements (“off”, “basic”, “moderate” or “detailed”)
Command Line Parameters
To see all of the command line parameters use the “-h” option when running snpe_bench.py
usage: snpe_bench.py [-h] -c CONFIG_FILE [-o OUTPUT_BASE_DIR_OVERRIDE]
[-v DEVICE_ID_OVERRIDE] [-r HOST_NAME] [-a]
[-t DEVICE_OS_TYPE_OVERRIDE] [-d] [-s SLEEP]
[-b USERBUFFER_MODE] [-p PERFPROFILE] [-l PROFILINGLEVEL]
[-json] [-cache]
Run the snpe_bench
required arguments:
-c CONFIG_FILE, --config_file CONFIG_FILE
Path to a valid config file
Refer to sample config file config_help.json for more
detail on how to fill params in config file
optional arguments:
-o OUTPUT_BASE_DIR_OVERRIDE, --output_base_dir_override OUTPUT_BASE_DIR_OVERRIDE
Sets the output base directory.
-v DEVICE_ID_OVERRIDE, --device_id_override DEVICE_ID_OVERRIDE
Use this device ID instead of the one supplied in config
file. Cannot be used with -a
-r HOST_NAME, --host_name HOST_NAME
Hostname/IP of remote machine to which devices are
connected.
-a, --run_on_all_connected_devices_override
Runs on all connected devices, currently only support 1.
Cannot be used with -v
-t DEVICE_OS_TYPE_OVERRIDE, --device_os_type_override DEVICE_OS_TYPE_OVERRIDE
Specify the target OS type, valid options are
['android-aarch64', 'le_oe_gcc8.2', 'le64_oe_gcc8.2']
-d, --debug Set to turn on debug log
-s SLEEP, --sleep SLEEP
Set number of seconds to sleep between runs e.g. 20
seconds
-b USERBUFFER_MODE, --userbuffer_mode USERBUFFER_MODE
[EXPERIMENTAL] Enable user buffer mode, default to
float, can be tf8exact0
-p PERFPROFILE, --perfprofile PERFPROFILE
Set the benchmark operating mode (balanced, default,
sustained_high_performance, high_performance,
power_saver, low_power_saver, high_power_saver,
extreme_power_saver, low_balanced, system_settings)
-l PROFILINGLEVEL, --profilinglevel PROFILINGLEVEL
Set the profiling level mode (off, basic, moderate, detailed, linting).
Default is basic.
-json, --generate_json
Set to produce json output.
-cache, --enable_init_cache
Enable init caching mode to accelerate the network
building process. Defaults to disable.
Running the Benchmark
Prerequisites
The Qualcomm® Neural Processing SDK has been set up following the Qualcomm (R) Neural Processing SDK Setup chapter.
The Tutorials Setup has been completed.
Optional: If the device is connected to remote machine, the remote adb server setup needs to be done by the user.
Running Inceptionv3 that is Shipped with the SDK
snpe_bench.py is the main benchmark script to measure and report performance statistics. Here is how to use it with the Inception_v3 model and data that is created in the SDK.
cd $SNPE_ROOT/benchmarks/SNPE
python3 snpe_bench.py -c inception_v3_sample.json -a
where
-a benchmarks on the device connected
Viewing the Results (csv File or json File)
All results are stored in the “HostResultDir” that is specified in the configuration json file. The benchmark creates time-stamped directories for each benchmark run. All timing results are stored in microseconds.
For your convenience, a “latest_results” link is created that always points to the most recent run.
# In inception_v3_sample.json, "HostResultDir" is set to "inception_v3/results"
cd $SNPE_ROOT/benchmarks/SNPE/inception_v3/results
# Notice the time stamped directories and the "latest_results" link.
cd $SNPE_ROOT/benchmarks/SNPE/inception_v3/results/latest_results
# Notice the .csv file, open this file in a csv viewer (Excel, LibreOffice Calc)
# Notice the .json file, open the file with any text editor
CSV Benchmark Results File
The CSV file contains results similar to the example below. Some measurements may not be apparent in the CSV file. To get all timing information, profiling level needs to be set to detailed. By default, profiling level is basic. Note that colored headings have been added for clarity.
CSV 1: Configuration
This section contains information about:
the SDK version used to generate the benchmark run
model name and path to the DLC file
runtimes selected for the benchmark
etc.
CSV 2: Initialization metrics
This section contains measurements concerning model initialization. Profiling level affects the amount of measurements collected. No metrics are collected for profiling off. Metrics for basic and detailed are shown below.
Profiling level: Basic
Load measures the time required to load the model’s metadata.
Deserialize measures the time required to deserialize the model’s buffers.
Create measures the time spent to create Qualcomm® Neural Processing SDK network(s) and initialize all layers for the given model. Detailed breakdown of create time can be retrieved with profiling level set to detailed.
Init measures the time taken to build and configure Qualcomm® Neural Processing SDK. This time includes time measured Load, Deserialize, and Create.
De-Init measures the time taken to de-initialized Qualcomm® Neural Processing SDK.
Profiling level: Moderate or Detailed
Load measures the time required to load the model’s metadata.
Deserialize measures the time required to deserialize the model’s buffers.
Create measures the time spent to create Qualcomm® Neural Processing SDK network(s) and initialize all layers for the given model. Detailed breakdown of create time can be retrieved with profiling level set to detailed.
Init measures the time taken to build and configure Qualcomm® Neural Processing SDK. This time includes time measured Load, Deserialize, and Create.
De-Init measures the time taken to de-initialized Qualcomm® Neural Processing SDK.
Create Network(s) measures the total time required to create all network(s) for the model. Models that are partitioned will result in multiple networks being created.
RPC Init Time measures the entire time spent on RPC and accelerators used by Qualcomm® Neural Processing SDK. This time includes time measured in Snpe Accelerator Init Time and Accelerator Init Time. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
Snpe Accelerator Init Time measures the total time spent by Qualcomm® Neural Processing SDK to prepare the data for the accelerator process such as GPU, DSP, AIP. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
Accelerator Init Time measures the total processing time spent on the accelerator core, which may include different hardware resources. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
CSV 3: Execution metrics
This section contains measurements concerning the execution of one inference pass of the neural network model. Profiling level affects the amount of measurements collected.
Profiling level: Basic
Total Inference Time measures the entire execution time of one inference pass. This includes any input and output processing, copying of data, etc. This is measured at the start and end of the execute call.
Profiling level: Moderate or Detailed
Total Inference Time measures the entire execution time of one inference pass. This includes any input and output processing, copying of data, etc. This is measured at the start and end of the execute call.
Forward Propogate measures the time spent executing one inference pass excluding processing overheads on one of the accelerator cores. For example, in the case of the GPU this represents the execution time of all the GPU kernels running on the GPU HW.
RPC Execute measures the entire time spent on RPC and accelerators used by Qualcomm® Neural Processing SDK. This time includes time measured in Snpe Accelerator and Accelerator. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
Snpe Accelerator measures the total time spent by Qualcomm® Neural Processing SDK to setup processing for the accelerator. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
Accelerator measures the total execution time spent on the accelerator core, which may include different hardware resources. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
Profiling level: Detailed
Misc Accelerator measures the total execution time spent on the accelerator core for optimization elements not specified by Qualcomm® Neural Processing SDK. For example, in the case of DSP this represents the exection time spent on additional layers added by the accelerator to acheive optimal performance. Currently only available for DSP and AIP runtime. Will appear as 0 for other runtimes.
Profiling level: Linting
Linting profiling mode is an HTP exclusive configuration that provides per op cycle count on the main thread as well as background execution information. See here for more information.
CSV 4: Model Layer Names
This section contains the list of names of each layer of the neural network model.
Note that this information will only be present for DSP/GPU runtime only if the profiling level is set to detailed.
CSV 5: Model Layer Times
Each row in this section represents the execution time of each layer of the neural network model. For each runtime the average, min and max values are represented. Note that the values corresponding to each layer are referring to cycle counts, whereas the values for all other components like init, ForwardPropagate are in microseconds.
Note that this information will only be present for DSP/GPU runtime only if the profiling level is set to detailed.
Refer to Measurement Methodology Measurement Methodology for more background on the numbers.
JSON Benchmark Results File
The benchmark results published in the CSV file can also be made available in JSON format. The contents are the same as in the CSV file, structured as key-value pairs, and will help parsing the results in a simple and efficient manner. The JSON file contains results similar to the example below.
Running the Benchmark with Your Own Network and Inputs
Prepare inputs
Before running the benchmark, a set of inputs need to be ready:
your_model.dlc
A text file listing all of your input data. For an example, see: $SNPE_ROOT/examples/Models/InceptionV3/data/cropped/raw_list.txt
All of the input data that is listed in the above text file. For an example, see directory $SNPE_ROOT/examples/Models/InceptionV3/data/cropped
Note: image_list.txt must exactly match the structure of your input directory.
For more information on preparing your inputs, see: Input Images
Create a Run Configuration
Config structure
The configuration file is a JSON file with a predefined structure. Refer to $SNPE_ROOT/benchmarks/SNPE/inception_v3_sample.json as an example.
All fields are required.
Name: The name of this configuration, e.g., Inception_v3
HostRootPath: The top level output folder on the host. It can be an absolute path or a relative path to the current working directory.
HostResultDir: The folder on the host where all benchmark results are put. It can be an absolute path or a relative path to the current working directory.
DevicePath: The folder on the device where all benchmark-related data and artifacts are put, e.g., /data/local/tmp/snpebm
Devices: The serial number of the device that the benchmark runs on. Only one device is currently supported.
Runs: The number of times that the benchmark runs for each of the “Runtime” and “Measurements” run combinations
Model:
Name: The name of the DNN model, e.g., INCEPTION_V3
Dlc: The folder where the model dlc file is located on the host. It can be an absolute path or a relative path to the current working directory.
InputList: The text file path that lists all of the input data. It can be an absolute path or a relative path to the current working directory.
Data: A list of data files or folders that are listed in InputList file. It can be an absolute path or a relative path to the current working directory. If the path is a folder, all contents of that folder will be pushed to the device.
Runtimes: Possible values are “GPU”, “GPU_FP16”, “DSP” and “CPU”. You can use any combination of these.
Measurements: Possible values are “timing” and “mem”. You can set either one or both of these values. Each measurement type is measured alone for each run.
Optional fields:
CpuFallback: Possible values are true and false. Indicates whether or not the network can fallback to CPU when a layer is not available on a runtime. Default value is false.
HostName: Hostname/IP of remote machine to which devices are connected. Default value is ‘localhost’.
BufferTypes: List of user buffertypes. Possible values are “”(empty string), or any combination of “ub_float” and “ub_tf8”. If “” is given, it executes for all the runtimes given in RunTimes field. If any userbuffer option is given, it executes for all runtimes in RunTimes along with given userbuffer variants. If this field is absent, it is executed for all possible runtimes(default case).
Architecture support
In Android, AARCH 64-bit is supported. In addition to Android, there is limited support for LinuxEmbedded, where only the timing measurement is supported.
Runtime and measurement are concatenated to make a full run combination name, e.g.,
“GPU_timing”: GPU runtime, timing measurement
“CPU_mem”: CPU runtime, memory measurement
Note that for each specified runtime, there are multiple sets of timing measurements which differ only in the tensor format (Using ITensors and Using User Buffers). For example, for the DSP runtime, the following timing measurements are taken:
“DSP_timing”: Timing measurement on DSP runtime using ITensors
“DSP_ub_tf8_timing”: Timing measurement on DSP runtime using UserBuffer with TF8 encoding
“DSP_ub_float_timing”: Timing measurement on DSP runtime using UserBuffer with float encoding
Run the Benchmark
The easiest way to run the benchmark is to specify the -a option, which will run the benchmark on the lone device connected to your computer.
cd $SNPE_ROOT/benchmarks/SNPE
python3 snpe_bench.py -c yourmodel.json -a
The benchmark will do an md5sum on the host files (those specified in the JSON configuration) and on the device files. Because of the md5sum check, files needed for the benchmark run has to be available on the host.
For any file that exists both on host and on the device with mismatch md5, the benchmark will copy the file from host to target and issue a warning message letting you know that the local files do not match the device files. This is done so that you can be sure that the results you get from a benchmark run accurately reflect the files specified in the JSON file.
Other options
Reading the Results
Open the results(CSV file or JSON file) in your <HostResultsDir>/latest_results folder to view your results. (<HostResultsDir> is what you specified in your json configuration file.)
Measurement Methodology
In all cases, the snpe-net-run executable is used to load a model and run inputs through the model.
Performance (“timing”)
Timing measurements are taken using internal timing utilities inside the Qualcomm® Neural Processing SDK libraries. When snpe-net-run is executed, the libraries will log timing info to a file. This file is then parsed offline to retrieve total inference times and per-layer times.
The total inference times include both the per-layer computation times plus overhead such as data movements between layers, as well as into and out of runtimes, whereas the per-layer times are strictly computational times for each layer. For smaller networks the overhead can be quite significant relative to computational time, particularly when offloading the networks to run on GPU or DSP.
As well, further optimizations present on the GPU/DSP may cause layer times to be mis-attributed, in the case of neuron conv-neuron or fc-neuron pairs. When executing on GPU the total time of the pairs would be assigned to convs, whereas for DSP they would be assigned to the neurons.
Note
Detailed and Linting Profiling will cause performance impact.