Tutorial - Use Mix-Precision Model with Qualcomm® AI Engine Direct Delegate¶
Floating-point models often result in more accurate predictions compared to quantization models. Quantization models, on the other hand, can significantly reduce the model size and computational requirements, resulting in lower run latency than the corresponding floating-point model. To strike a balance between the high accuracy of floating-point models and the computational efficiency of quantization models, we can employ the mix-precision models. This approach offers a compromise by achieving a reasonable level of accuracy while optimizing computational efficiency.
This tutorial demonstrates how to use the mix-precision model in Qualcomm® AI Engine Direct Delegate. Additionally, we demonstrate one case on MobileNet v3 that highlight the benefits of employing mix-precision models.
Prerequisites¶
The following list of prerequisites must be met before starting this tutorial:
To generate the mix-precision model, please refer to Quantization Debugger tutorial on the TenserFlow website.
Read the Tutorial for qtld-net-run, Tutorial for benchmark_model to understand how to run inference and benchmark using the Qualcomm® AI Engine Direct Delegate.
Running mix-precision model with qtld-net-run¶
Tutorial for qtld-net-run demonstrates how to run the TFLite model using the Qualcomm® AI Engine Direct Delegate on the HTP backend.
Set the htp_precision=1 when using a mixed-precision model.
$ adb shell "LD_LIBRARY_PATH=/data/local/tmp/qnn_delegate/:$LD_LIBRARY_PATH &&
ADSP_LIBRARY_PATH=/data/local/tmp/qnn_delegate/ &&
/data/local/tmp/qnn_delegate/qtld-net-run \
--model=/data/local/tmp/qnn_delegate/mix_precision_model.tflite \
--input=/data/local/tmp/qnn_delegate/input_list.txt \
--output=/data/local/tmp/qnn_delegate/tensor_dump_output \
--htp_precision=1 \
--backend=htp"
Running mix-precision model with benchmarking¶
Tutorial for benchmark demonstrates how to benchmark models running through the Qualcomm® AI Engine Direct Delegate using the TFLite benchmark_model application.
Set the htp_precision:1, when using a mixed-precision model.
$ adb shell 'export LD_LIBRARY_PATH=/data/local/tmp/qnn_delegate/:$LD_LIBRARY_PATH &&
export ADSP_LIBRARY_PATH="/data/local/tmp/qnn_delegate/" &&
/data/local/tmp/qnn_delegate/benchmark_model \
--graph=mix_precision_model.tflite \
--external_delegate_path=/data/local/tmp/qnn_delegate/libQnnTFLiteDelegate.so \
--external_delegate_options="backend_type:htp;htp_precision:1"'
Experiments¶
Here, we use two experiments to show you that the mixed-precision model results in more accurate predictions compared to the quantized model, and result in lower run latency than the corresponding floating-point model.
Testing environment¶
Here is our testing environment.
- Base Model: We download the MobileNet v3 model from Tenserflow Hub.
- Representative dataset:
We use tensorflow dataset, imagenet_v2, first 100 pictures.
- Testing dataset:
We use tensorflow dataset, imagenet_v2, first 1000 pictures.
Testing device: We test the MobileNet v3 model with different precision on Snapdragon 8 Gen 1+.
TensorFlow version: v2.10.0
Testing the top-K accuracy¶
We apply top-K accuracy to verify if the mixed-precision model offers better accuracy than the fully-quantized model.
Results show that the mixed-precision model has higher top-K accuracy.
MobileNet v3 (full float) |
MobileNet v3 (full quantization) |
MobileNet v3 (mix-precision) |
|
|---|---|---|---|
1000 testing data, Top-1 accuracy, running on htp, delegated by Qualcomm® AI Engine Direct op |
59.2% |
15.7% |
51.3% |
1000 testing data, Top-1 accuracy, running on cpu |
59.3% |
19.5% |
53.7% |
1000 testing data, Top-5 accuracy, running on htp, delegated by Qualcomm® AI Engine Direct op |
83.2% |
33.0% |
74.1% |
1000 testing data, Top-5 accuracy, running on cpu |
83.3% |
37.8% |
76.4% |
Testing the benchmark¶
We employ the TFLite benchmark_model application used in this tutorial to check if the mixed-precision model operates with less latency than the floating-point model.
Results show that the mixed-precision model runs with lower latency than the corresponding floating-point model.
Inference timings (in us) |
MobileNet v3 (full float) |
MobileNet v3 (full quantization) |
MobileNet v3 (mix-precision) |
|---|---|---|---|
Init |
967562 |
576929 |
606613 |
Inference (avg) |
4384.77 |
2944.59 |
3148.82 |
Conclusion¶
On this page, we have attempted to infer the MobileNet v3 model with different precision. In the experiments, we can see that the mixed-precision model has higher top-K accuracy than the quantized model, and result in lower run latency than the corresponding floating-point model.