Converters

This page describes the general conversion process, the expected inputs and generated outputs, and provides examples of usage.

Overview

Qualcomm® AI Engine Direct currently supports converters for four frameworks: Tensorflow, TFLite, PyTorch, and Onnx. Each converter, at a minimum, requires the original framework model as input to generate a Qualcomm® AI Engine Direct Model. For additional required inputs please refer to the framework specific sections below.

The flow for each converter is the same:

Converter Workflow

../_static/resources/qnn_converter_callflow.png

There are four main parts to each converter:

  1. The front end translation which handles converting the original framework model into the common intermediate represention (IR)

  2. The common IR code which contains graph and IR operation definitions as well as various graph optimizations that can be applied to translated graphs.

  3. Quantizer, which is optionally invoked to quantize the model prior to the final lowering to QNN. See Quantization for more information.

  4. The Qnn converter backend which is responsible for lowering the IR into the final QnnModel API calls.

All the converters share the same IR code and QNN converter backend. The output for each converter is the same, a model.cpp or model.cpp/model.bin which contains the final converted QNN graph. The converted model.cpp contains two functions: QnnModel_composeGraphs and QnnModel_freeGraphsInfo. These two functions leverage the Tools Utility API described below. Additionally, model_net.json is saved which is a json format variant to model.cpp.

QNN Model JSON Format

Note

  • All QNN enum/macro values are resolved in fields.

  • All input/output tensors are stored in “tensors” config section and the tensor names are later used for defining a node inputs/outputs. The only tensor defined in the node config is a tensor parameter.

  • Static input tensor data is not stored in the JSON.

{
  "model.cpp": "<CPP filename goes here>",
  "model.bin": "<BIN filename goes here if applicable else NA>",
  "coverter_command": "<command line used goes here>",
  "copyright_str": "<copyright str goes here if applicable else "">",
  "op_types": ["list of unique op types found in graph"]
  "Total parameters": "total parameter count in graph ( value in MB assuming single precision float)",
  "Total MACs per inference": "total multiply and accumulates in graph count in M),
  "graph": {
     "tensors": {
       "<tensor_name>: {
         "id": <generated_id>,
         "type": <tensor_type>,
         "dataFormat": <tensor_memory_layout>,
         "data_type": <tensor_data_type>,
         "quant_params": {
           "definition": <enum_value>,
           "encoding": <enum_value>,
           "scale_offset": {
             "offset": <val>,
             "scale":  <val>
           }
         }
         "current_dims": <list_val>,
         "max_dims": <list_val>,
         "params_count": <val> ("parameter count for node, along with value/total percentage. (only where applicable)")
       },
       "<tensor_name_with_axis_scale_offset_variant>: {
         "id": <generated_id>,
         "type": <tensor_type>,
         "dataFormat": <tensor_memory_layout>,
         "data_type": <tensor_data_type>,
         "quant_params": {
           "definition": <enum_value>,
           "encoding": <enum_value>,
           "axis_scale_offset": {
             "axis": <val>,
             "num_scale_offsets": <val>,
             "scale_offsets": [
               {
                 "scale": <val>,
                 "offset": <val>
               },
               ...
             ]
           }
         }
         "current_dims": <list_val>,
         "max_dims": <list_val>
        },
       ...
    }
    "nodes": {
       "<node_name>: {
         "package": <str_val>,
         "type": <str_val>,
         "tensor_params": {
           "<param_name>": {
             "<tensor_name_*>: {
                 "id": <generated_id>,
                 "type": <tensor_type>,
                 "dataFormat": <tensor_memory_layout>,
                 "data_type": <tensor_data_type>,
                 "quant_params": {
                    "definition": <enum_value>,
                    "encoding": <enum_value>,
                    "scale_offset": {
                      "offset": <val>,
                      "scale":  <val>
                    }
                 "current_dims": <list_val>,
                 "max_dims": <list_val>,
                 "data": <list_val>
               }
           }
           ...
         },
         "scalar_params": {
           "param_name": {
              "param_data_type": <val>
            }
           ...
         },
         "input_names": <list_str_val>,
         "output_names": <list_str_val>,
         "macs_per_inference": <val> ("multiply and accumulate value for node, along with value/total percentage. (only where applicable)")
       }
       ...
    }
  }
}

Tools Utility API

The tools Utility API contains helper modules to generate QNN API calls. The APIs are light-weight wrappers on-top of the core QNN API and are intended to mitigate repetitive steps for creating QNN graphs.

  • Tools Utility C++ API:

  • QNN Core C API Reference: C

QNN Model Classes

../_static/resources/qnn_model_classes.png
  • QnnModel: This class is analogous to a QnnGraph and its tensors inside a given context. The context shall be provided at initialization and a new QnnGraph will be created within it. For more details on these class APIs please see QnnModel.hpp, QnnWrapperUtils.hpp

  • GraphConfigInfo: This structure is used to pass a list of QNN graph configurations(if applicable) from the client. Refer to QnnGraph API for details on available graph config options.

  • GraphInfo: This structure is used to communicate constructed graph along with its input and output tensors to the client.

  • QnnModel_composeGraphs: is responsible for constructing QNN graph on the provided QNN backend using the QnnModel class. It will return the constructed graph via graphsInfo.

  • QnnModel_freeGraphsInfo: should only be called once the graph is no longer being used.

For more information on integrating the model into an application see Integration workflow

Tensorflow Conversion

QNN, like many other neural network runtime engines, supports both low level operations (like an elementwise multiply) as well as high level operations (like Prelu). TensorFlow on the other hand, generally supports high level operations by representing them as subgraphs of low level operations. To reconcile these differences the converter must sometimes pattern match subgraphs of small operations into larger “layer-like” operations that can be leveraged in QNN.

Pattern Matching

The following are a few examples of pattern matching that occurs in the QNN Tensorflow converter. In each case the pattern generally consists of any operations that fall in between the layer input and output, with additional parameters like weights and biases being absorbed into the final IR op.

../_static/resources/node_labels.png

Convolution example:

../_static/resources/node_conv1.png

Prelu example:

../_static/resources/node_prelu.png

The important thing to remember is that these patterns are hard coded in the converter. Changes to the model that affect the connectivity and order of the operations in these patterns is also likely to break the conversion as the converter will not be able to identify and map the subgraph to the appropriate layer.

The TF converter also supports propagating quantization aware trained (QAT) model parameters to the final QNN model. This happens automatically during conversion when quantization is invoked. Note that the placement of quantization nodes also determines whether or not they will be propagated. Inserting quantization nodes inside a pattern will cause the pattern matching to break and conversion to fail. The safe place to insert nodes is after “layer-like” layers to capture activation information for a layer. In addition, quantization nodes inserted after weights and biases can capture the quantization information for static parameters.

An example of inserting a quantization node after a Convolution:

../_static/resources/qnn_tf_quant_act.png

See Quantization for more information on initiating quantization as part of the conversion process.

Additional Required Parameters

As Tensorflow graphs often include extraneous nodes that are not required for general inference it is required that the input nodes and dimensions be provided along with the final output nodes required for inference. The converter will then prune unnecessary nodes from the graph ensuring a more compact and efficient graph.

To specify graph’s inputs to the converter pass the following on the command line:

--input_dim <input_name> <comma separated dims>

To specify the graph’s output nodes simply pass:

--out_node <output_name>

Tensorflow also has multiple input formats, but only frozen graphs (.pb files) or .meta files are supported. Saved training sessions are not supported by the converter.

Notes on Tensorflow 2.x Support

The qnn-tensorflow-converter has been updated to support conversion of Tensorflow 2.3 models. Note that while some TF 1.x models may convert using Tensorflow 2.3 as the conversion framework it is generally recommended to use the same TF version for conversion as was used for training the model. Some older 1.x models may not convert at all using TF 2.3 and a TF 1.x instance may be required for successful conversion.

Note that some options have been updated or added to support Tensorflow 2.x models. The first is a change to support the SavedModel format. Users can provide the directory to the SavedModel files by passing it to the same input_network option:

--input_network <SavedModel path>

Users can optionally pass saved_model_tag to indicate the tag and associated MetaGraph from the SavedModel. Default is “serve”

--saved_model_tag <tag>

Lastly a user can select the input and output of the model by using the signature key. Default value is ‘serving default’

--saved_model_signature_key <signature_key>

Example

The following is an example of an SSD model which requires one image input, but has 4 output nodes.

qnn-tensorflow-converter --input_network frozen_graph.pb --input_dim Preprocessor/sub 1,300,300,3 --output_path ssd_model.cpp --out_node detection_scores --out_node detection_boxes --out_node detection_classes --out_node Postprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_2/TensorArrayGatherV3 -p "qti.aisw"

TFLite Conversion

The qnn-tflite-converter converts a TFLite model to an equivalent QNN representation. It takes as input a .tflite model.

Additional Required Parameters

TFlite converter needs the names and dimensions of the input nodes to be provided at commandline for the conversion. Each input must be passed individually using the same argument.

To specify graph’s inputs to the converter pass the following on the command line:

--input_dim <input_name_1> <comma separated dims> --input_dim <input_name_2> <comma separated dims>

Example

The following is an example of converting an Inception_v3 model which requires one image input

qnn-tflite-converter --input_network model.tflite --input_dim "input" 1,299,299,3 --output_path model.cpp

PyTorch Conversion

The qnn-pytorch-converter converts a PyTorch model to an equivalent QNN representation. It takes as input a TorchScript model (.pt).

Additional Required Parameters

PyTorch converter needs the names and dimensions of the input nodes to be provided at commandline for the conversion. Each input must be passed individually using the same argument.

To specify graph’s inputs to the converter pass the following on the command line:

--input_dim <input_name_1> <comma separated dims> --input_dim <input_name_2> <comma separated dims>

Example

The following is an example of converting an Inception_v3 model which requires one image input

qnn-pytorch-converter --input_network model.pt --input_dim "input" 1,3,299,299 --output_path model.cpp

Onnx Conversion

The qnn-onnx-converter converts a serialized ONNX model to an equivalent QNN representation. By default, it also runs onnx-simplifier if available in user environment(see Setup). Additionally, onnx-simplifier is only run by default if user has not provided quantization overrides/custom ops as the simplification process could possibly squash layers preventing the custom ops or quantization overrides from being used. If the model contains ONNX functions, converter always does inlining of function nodes. Note: If conversion fails, the onnx converter supports an additional option “–dry_run” which will dump detailed information about unsupported ops and associated parameters. Current ONNX Conversion supports upto ONNX Opset 22.

Supported ONNX Ops

For the complete list of ONNX ops supported by the ONNX converters check the supported onnx ops table

Example

qnn-onnx-converter --input_network model.onnx --output_path model.cpp

Custom Operation Output Shape and Datatype Inference

QNN converter requires output shapes and datatypes for all operations to be present in the model for successful conversion. Output shapes and datatypes for custom operations can be inferred from the model if present in the model or inferred using the framework’s shape inference script. When the output shapes and datatypes of a custom operation are not present in the model or cannot be inferred from the framework’s shape inference script, the logic to infer custom operation output shapes and datatypes can be provided to the converter through a shared library compiled with Convter Op Package Generation. The compiled library can be provided with the --converter_op_package_lib or -cpl option followed by the absolute path to the compiled library. The converter takes the library, infers the output shapes and datatypes of the custom operations needed for successful model conversion. Multiple libraries must be comma separated.

Note

--converter_op_package_lib or -cpl is an optional argument and should be used when the output shapes and/or output datatypes for custom operations are not present in the model or cannot be inferred from the framework’s shape inference script.

Note

When the output datatypes are present in the model and the --converter_op_package_lib with the logic to populate the output datatypes is passed, output datatypes inferred from the library will be given priority and override the output datatypes inferred from the model.

Example

qnn-onnx-converter --input_network model.onnx --converter_op_package_lib libExampleLibrary.so

Note

  • See Convter Op Package Generation for library generation and compilation instructions.

  • Custom operation output shape inference is only supported for ONNX and PyTorch converters.

  • Tensorflow and TFLite converters do not support custom operation output shape inference.

Custom I/O

Introduction

Custom I/O feature allows users to provide the desired layout and datatype for the inputs and outputs while loading a network. Instead of compiling the network for the inputs and outputs specified in the model, the network is compiled for the inputs and outputs described in custom configuration. This feature is used when the user intends to pre-process (on GPU/CDSP or any other method) or offline process (like allowed by ML commons) the input data and avoid some steps in the input processing. Users can avoid redundant transposes and data-type conversions if they have knowledge of the input pre-processing steps. Similarly, on the post-processing side, if the model output is to be fed to a next stage in a pipeline, the desired format and type can be configured as the output of current stage.

In this section, the term “Model I/O” refers to the input and output datatypes and formats of the original model. The term “Custom I/O” refers to the input and output datatypes and formats desired by the user.

Custom I/O Configuration File

Custom I/O can be applied using a configuration yaml file that contains the following fields for each input and output that needs to be modified.

  • IOName: Name of the input or output present in the model that needs to be loaded as per the custom requirement.

  • Layout: Layout field (optional) has two sub fields: Model and Custom. Model and Custom fields support valid QNN Layout. Accepted values are: NCDHW, NDHWC, NCHW, NHWC, NFC, NCF, NTF, TNF, NF, NC, F, NONTRIVIAL, where, N = Batch, C = Channels, D = Depth, H = Height, W = Width, F = Feature, T = Time

    • Model: Specify the layout of the buffer in the original model. This is equivalent to the –input_layout option and both cannot be used together.

    • Custom: Specify the custom layout desired for the buffer. This field needs to be filled by the user.

  • Datatype: Datatype field (optional) supports float32, float16 and uint8 datatypes.

  • QuantParam: QuantParam field (optional) has three sub fields: Type, Scale and Offset.

    • Type: Set to QNN_DEFINITION_DEFINED (default) if the scale and offset are provided by the user else set to QNN_DEFINITION_UNDEFINED.

    • Scale: Float value for the scale of the buffer as desired by the user.

    • Offset: Integer value for the offset as desired by the user.

Example

Consider a ONNX model with the original model I/O and custom I/O configuration as shown in the table below:

Input/Output Name

Model I/O

Custom I/O

‘input_0’

float NCHW

int8 NHWC

‘output_0’

float NHWC

float NCHW

Then, the content of custom I/O configuration yaml file that should be provided is

- IOName: input_0
  Layout:
    Model: NCHW
    Custom: NCHW
  Datatype: uint8
  QuantParam:
    Type:
       QNN_DEFINITION_DEFINED
    Scale:
       0.12
    Offset:
       2

- IOName: output_0
  Layout:
    Model: NHWC
    Custom: NCHW

Note:

  • If no change is required for an input or output, it can be skipped in the configuration file.

  • Datatype can be modified using custom I/O feature only if the model input or output datatype is float, float16, int8 or uint8. For other datatypes, ‘Datatype’ field should be skipped in the configuration file.

Usage

The custom IO config YAMl file can be provided using the --custom_io option of qnn-onnx-converter. Sample usage is as follows:

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --custom_io <path/to/YAML/file> ....

Custom IO Config Template File

The Custom IO Configuration file filled with default values can be obtained using the --dump_custom_io_config_template option of qnn-onnx-converter.

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --input_network ${QNN_SDK_ROOT}/examples/Models/InceptionV3/tensorflow/model.onnx \
  --dump_custom_io_config_template <output_folder>/config.yaml

The dumped template file has an entry for each input and output of the model provided. Each field in the template file is filled with the default value obtained from the model for that particular input or output. The template file also has comments describing each field for the user.

Supported Use Cases

  1. Layout conversions of the input and output buffers of the model. Valid layout conversions are inter-conversions between:

    • NCDHW and NDHWC

    • NHWC and NCHW

    • NFC and NCF

    • NTF and TNF

  2. Passing quantized inputs of datatype uint8 or int8 to a non-quantized model. In this case, users must provide the scale and offset for the quantized inputs.

  3. Users can provide custom scale and offset for the inputs and outputs of a quantized model. The scale and offset generated by the quantizer are overrriden by those provided by the user in the YAML file.

The user may use the --input_data_type and --output_data_type options of qnn-net-run to provide float or uint8_t type data to model inputs/outputs. Users may pass and get int8/uint8 data to the model using the native option. By default, qnn-net-run assumes the data to be of type float32 and performs the quantization at input and dequatization at output in case of quantized models.

Limitations

  • Custom IO only supports providing the following datatypes: float32, float16, uint8, int8.

  • If the user needs to pass quantized inputs (i.e. of type int8 or uint8) to a non-quantized model, the scale and offset must be provided by the user in the YAML file. Not providing the scale and offset in this case would throw an error.

Preserve I/O

Introduction

Preserve I/O feature allows users to retain the layout and datatype of the inputs and outputs as present in the original ONNX model. This feature allows the user to avoid any pre- or post-processing steps to transform the data to the layout and datatype due to the default behavior of QNN converters at the input and output of the model.

Usage

The different ways of using this option are as follows:

  1. The user may choose to preserve layouts and datatypes for all IO tensors by just passing the --preserve_io option as follows:

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io ....
  1. The user may choose to preserve the only layout or datatype for all the inputs and outputs of the graph as follows:

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io layout ....

or,

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io datatype....
  1. The user may choose to preserve the layout or datatype for only a few inputs and outputs of the graph as follows:

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io layout <space separated list of names of inputs and outputs of the graph>....

or,

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io datatype <space separated list of names of inputs and outputs of the graph>....
  1. The user can pass a combination of --preserve_io layout and --preserve_io datatype as follows:

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io layout <space separated list of names of inputs and outputs of the graph> \
  --preserve_io datatype <space separated list of names of inputs and outputs of the graph> ....

Passing just --preserve_io layout and --preserve_io datatype together is valid and equivalent to passing --preserve_io only. Usage in point 3 cannot be combined with usage in point 1 or point 2 and will result in an error if used together.

Usage in qnn-pytorch-converter

In PyTorch models there may be no tensor names. Input tensor names are named by passing -d, but output names in converter are named by internal logic. To preserve layout or datatype for only the specified output tensor user can do as follows:

  1. Run a 1st pass of the Converter and use the generated CPP/JSON file to fetch the APP_READ type tensor names.

  2. Run a 2nd Converter for preserve layout or datatype for only the specified IO tensor with their names:

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io layout <space separated list of names of inputs and outputs of the graph>....

or,

$ ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter \
  --preserve_io datatype <space separated list of names of inputs and outputs of the graph>....

Usage with other converter options

  1. --keep_int64_inputs need not be passed if preserve IO is used to preserve the datatype of such inputs.

  2. --use_native_input_files is set to True in case of quantization if preserve IO is used to preserve the datatypes.

  3. The layout specified using --input_layout is honored.

  4. Using --input_dtype with preserve IO may result in an error in case of datatype mismatch for any IO tensor.

  5. The layouts and datatypes specified using --custom_io get higher precedence over --preserve_io.

Since preserve IO retains the datatypes of IO tensors in the original model, the user must use --use_native_input_files or --native_input_tensor_names with qnn-net-run.

Common Parameters

There are a number of common parameters that can be passed to all the converters. These are described here: Tools

In addition, quantization parameters are also specified at conversion time. For more information refer to the tools document above and to: Quantization

Qairt Converter

The qairt-converter tool converts a model from one of ONNX/TensorFlow/TFLite/PyTorch framework to a DLC. The DLC contains the model in a Qualcomm graph format to support inference on Qualcomm’s AI accelerator cores. A new prefix qairt for Qualcomm AI Runtime signifies that this converter can be used with both the Qualcomm Neural Processing SDK API as well as the Qualcomm AI Engine Direct API. The converter automatically detects the proper framework based on the source model extension.

Supported frameworks and file types are:

Framework

File Type

Onnx

*.onnx

TensorFlow

*.pb

TFLite

*.tflite

PyTorch

*.pt

Basic Conversion

Basic conversion has only one required argument --input_network, which is the path to the source framework model. The source model can either be float model or quantized model, qairt-converter will convert it to corresponding DLC, retaining the precision and datatype of the tensors. Some frameworks may require additional arguments that are otherwise listed as optional. Please check the help text at qairt-converter for more details.

../_static/resources/qairt_basic_conversion.png
  • Onnx Conversion

Current ONNX Conversion supports upto ONNX Opset 22.

$ qairt-converter --input_network model.onnx
  • Tensorflow Conversion

Tensorflow additionally requires --source_model_input_shape and --out_tensor_node arguments. --source_model_input_shape is for specifying the list of all the input names and dimensions to the network model. --out_tensor_node is for specifying the network model’s output tensor name/s.

$ qairt-converter \
      --input_network inception_v3_2016_08_28_frozen.pb \
      --source_model_input_shape input 1,299,299,3 \
      --out_tensor_node InceptionV3/Predictions/Reshape_1

In the above example, the model inception_v3_2016_08_28_frozen.pb has input named input with dimensions (1,299,299,3), and output tensor with name InceptionV3/Predictions/Reshape_1.

Input/Output Layouts

The default input and output layouts in the converted graph are the same as per the source model. This behavior differs from the legacy converter which would modify the input and (optionally) the output layout to the spatial first format. An example single layer Onnx model (spatial last) is shown below.

../_static/resources/qairt-conversion-layout-comparison.png

Input/Output Customization using YAML

Note

This feature allows user to specify their desired input/output tensor layout for the converted model.

Users can provide a YAML configuration file to simplify using different input and output configurations using the --config command-line option. All configurations in the YAML are optional. If an option is provided in the YAML configuration and an equivalent option is provided on the command line, the command line option takes precedence over the one provided in the configuration file. The YAML configuration schema is shown below.

  • Name: Name of the input or output tensor present in the model that needs to be customized

  • Src Model Parameters

    These are mandatory if a certain equivalent desired configuration is specified.

    • DataType: Data type of the tensor in source model.

    • Layout: Tensor layout in the source model. Valid values are:

      • NCDHW

      • NDHWC

      • NCHW

      • NHWC

      • NFC

      • NCF

      • NTF

      • TNF

      • NF

      • NC

      • F

      where

      • N = Batch

      • C = Channels

      • D = Depth

      • H = Height

      • W = Width

      • F = Feature

      • T = Time

  • Desired Model Parameters

    • DataType: Desired data type of the tensor in the converted model. Valid values are float32, float16, uint8, int8 datatypes.

    • Layout: Desired tensor layout of the converted model. Same valid values as source layout.

    • Shape: Tensor shape/dimension in the converted model. Valid values are comma separated dimension values, i.e., (a,b,c,d).

    • Color Conversion: Tensor color encoding in the converted model. Valid values are BGR, RGB, RGBA, ARGB32, NV21, and NV12.

    • QuantParams: Required when the desired model data type is a quantized data type. Has two subfields: Scale and Offset.

      • Scale: Scale of the buffer as a float value.

      • Offset: Offset value as an integer.

    • Optional: During calls to graph execute, the client can use optional I/O tensors to signal to the backend which tensors to be optionally provided/produced. Valid values are True, False.

The --dump_config_template option of qairt-converter saves the IO configuration file for the user to update. Pass the --dump_config_template option to the qairt-converter to save the IO configuration file at the specified location.

../_static/resources/qairt_io_config.png
$ qairt-converter \
      --input_network model.onnx \
      --dump_config_template <output_folder>/io_config.yaml

This is the sample output of the dumped IO configuration file:

Converted Graph:
- Input Tensors:
- Output Tensors:

Input Tensor Configuration:
  # Input 1
  - Name: 'input'
    Src Model Parameters:
        DataType:
        Layout:
        Shape:
    Desired Model Parameters:
        DataType:
        Layout:
        Color Conversion:
        QuantParams:
          Scale:
          Offset:

Output Tensor Configuration:
  # Output 1
  - Name: 'output'
    Src Model Parameters:
        DataType:
        Layout:
    Desired Model Parameters:
        DataType:
        Layout:
        QuantParams:
          Scale:
          Offset:

Consider a model with the Source model I/O and Desired model I/O configuration as shown in the table below:

Input/Output Name

Source Model I/O

Desired Model I/O

Datatype / Layout

Datatype / Layout

‘input_0’

float32 / NCHW

uint8 / NHWC

‘output_0’

float32 / NCHW

uint8 / NHWC

Here is an example io_config.yaml, where: Input and output tensor layouts are converted from NCHW format in source model to NHWC format in the converted model. Also, the datatypes are converted from float32 format in source model to uint8 format in the converted model.

Converted Graph:
- Output Tensors: ['output']

Input Tensor Configuration:
  # Input 1
  - Name: 'input'
    Src Model Parameters:
        DataType: float32
        Layout: NCHW
        Shape:
    Desired Model Parameters:
        DataType: uint8
        Layout: NHWC
        Color Conversion:
        QuantParams:
          Scale:
          Offset:

Output Tensor Configuration:
  # Output 1
  - Name: 'output'
    Src Model Parameters:
        DataType: float32
        Layout: NCHW
    Desired Model Parameters:
        DataType: uint8
        Layout: NHWC
        QuantParams:
          Scale:
          Offset:
$ qairt-converter \
      --input_network model.onnx \
      --config io_config.yaml

Disconnected Input Preservation

In deep learning framework models, computational graphs often contain multiple inputs. During graph optimization, unused inputs may be removed through techniques like constant folding or dead code elimination. While this improves performance and reduces memory usage, it can sometimes interfere with workflows that rely on the presence of all inputs being present in the source framework model — especially in scenarios involving inference. To address this, qairt-converter retains all the source framework model inputs, ensuring that all graph inputs remain part of the graph regardless of their usage. This behavior is similar to the other open-source inference engines. Unused graph input nodes can be removed by using --remove_unused_inputs command line argument while using qairt-converter.

Retaining unused or disconnected inputs provides the following benefits:
  • Avoid unintended side effects during model conversion.

  • Facilitate debugging and analysis by retaining all original inputs.

The following figure shows a source graph (left) with two inputs i1 and i2. The input i2 is disconnected post conversion, but it is preserved in the converter graph.

../_static/resources/qairt_disconnected_input_nodes.png

QAT encodings

QAT encodings are quantization-aware training encodings which are present in the source network model. They can be present in the following form in the source network model.

  • FakeQuant Nodes: There can be FakeQuant nodes in the source network model. This nodes simulate the quantize-dequantize operations and use parameters like scale and zero-points to map the floating point values to quantized values and back. During conversion this nodes will be removed and corresponding encodings are applied to generate a quantized or mixed precision DLC output.

    ../_static/resources/qairt_fakequant.png
  • Quantization overrides: Tensor output encodings can be associated with the output tensors in the source network model via overrides. The quantization overrides for the tensors(output, weights, bias, activations) in the source network model can be provided to the qairt-converter with a JSON file using the --quantization_overrides command-line option. When the overrides option is specified, qairt-converter produces a fully quantized or mixed precision graph depending on the overrides by applying encoding overrides, propagate encodings across data invariant Ops and fallback the missing tensors in float datatype.

  • Quant-Dequant Nodes: There can be Quant-Dequant(QDQ) nodes present in the source network model. The Quant nodes convert floating-point values to lower precision values typically integers to reduce model’s memory footprint and improving inference time. The Dequant do the opposite and convert from lower precision values to floating-point values for getting higher precision for certain operations. During conversion this nodes will be removed and corresponding encodings are applied to generate a quantized or mixed precision DLC output.

    ../_static/resources/qairt_qdq.png

    Note

    • Inference fails for CPU and DSP runtimes if QAT encodings contain 16-bit.

Float model Usecases

  • Float bitwidth conversions

    Users can convert float source model between float bitwidth 16 and 32 using the --float_bitwidth flag to the qairt-converter tool.

    ../_static/resources/qairt_float_conversion.png

    For converting a source model with all float32 tensors to float16 tensor use --float_bitwidth 16.

    Note

    • Float bitwidth 32 is the default bitwidth for float source model conversion.

    • Float bitwidth 16 is the default bitwidth for source model with quantization encodings or overrides

    $ qairt-converter --input_network model.onnx \
          --float_bitwidth 16
    

    For converting a source model with all float16 tensors to float32 tensor use --float_bitwidth 32.

    $ qairt-converter --input_network model.onnx \
          --float_bitwidth 32
    
  • Float16 Conversion with Float32 bias

    To generate a float16 graph with the bias still in float32, an additional --float_bias_bitwidth 32 flag can be passed.

    $ qairt-converter --input_network model.onnx \
        --float_bitwidth 16 \
        --float_bias_bitwidth 32
    

Quantization overrides Usecases

  • Float mixed precision conversion

    User can provide overrides to qairt-converter to floating point source model to a mixed float precision (float16 and float32) model. For example, if the source model has all tensors with float32 precision and user wants to change precision of some tensors to float16, override file should contain names of the tensor with type as float16.

    ../_static/resources/qairt_override_float_mp.png
    $ qairt-converter \
      --input_network model.onnx \
      --quantization_overrides <path to json>/overrides.json
    
  • Quant conversion

    User can also convert a float source model or mixed precision source model to a quantized model using quantization overrides. The qairt-converter will generate a fully quantized or mixed precision graph based on the overrides provided.

    ../_static/resources/qairt_override_float_mp_quant.png
    $ qairt-converter \
      --input_network model.onnx \
      --quantization_overrides <path to json>/overrides.json
    
  • Overrides to Float conversion

    User can convert a source model with overrides to float to run on floating point runtimes i.e. QNN-GPU and QNN-CPU using the command-line option --export_format=DLC_STRIP_QUANT.

    Note

    • This might result in loss of accuracy.

    ../_static/resources/qairt_override_strip_quant.png
    $ qairt-converter \
          --input_network model.onnx \
          --quantization_overrides <path to json>/overrides.json \
          --export_format=DLC_STRIP_QUANT
    

Quantized model Usecases

  • Quant model conversion

    User can now convert a quantized model in a single step using the qairt-converter without any additional steps.

    ../_static/resources/qairt_quant_conversion.png
    $ qairt-converter \
          --input_network quant_model.onnx
    
  • Quant to Float conversion

    User can convert a quantized source model to float to run on floating point runtimes i.e. QNN-GPU and QNN-CPU using the command-line option --export_format=DLC_STRIP_QUANT.

    Note

    • This might result in loss of accuracy.

    ../_static/resources/qairt_quant_strip_quant.png
    $ qairt-converter --input_network quant_model.onnx \
        --export_format=DLC_STRIP_QUANT
    

Quant-Dequant(QDQ) model Usecases

  • QDQ model conversion

    User can now convert a Quant-Dequant source model to quantized model in a single step using the qairt-converter without any additional steps.

    ../_static/resources/qairt_qdq_conversion.png
    $ qairt-converter \
          --input_network quant_dequant_model.onnx
    
  • QDQ to Float conversion

    User can convert a Quant-Dequant source model to float to run on floating point runtimes i.e. QNN-GPU and QNN-CPU using the command-line option --export_format=DLC_STRIP_QUANT.

    ../_static/resources/qairt_qdq_strip_quant.png

    Note

    • This might result in loss of accuracy.

    $ qairt-converter --input_network model.onnx \
        --export_format=DLC_STRIP_QUANT
    

DryRun

Use the --dry_run option to evaluate the model without actually converting any ops. This returns unsupported ops/attributes and unused inputs/outputs.

FAQs

  • How is QAIRT Converter different from Legacy Converters?

    • Single converter vs independent framework converters

      The qairt-converter is a single converter tool supporting conversion for all supported frameworks based on the model extension while legacy converters had different framework specific tools.

    • Changed some optional arguments as default behavior

      The default input and output layouts in the Converted graph will be same as in the Source graph. The legacy ONNX and Pytorch converters may not always retain the input and output layouts from Source graph.

    • Removed deprecated arguments

      Deprecated arguments on the legacy converters are not enabled on the new converter.

    • Renamed some arguments for clarity

      The –input_encoding argument is renamed to –input_color_encoding. Framework-specific arguments have the framework name present. eg- –define_symbol is renamed to –onnx_define_symbol, –show_unconsumed_nodes is renamed to –tf_show_unconsumed_nodes, –signature_name is renamed to –tflite_signature_name.

    • DLC as the Converter output file format

      The QAIRT Converter uses DLC as output format. The .cpp/.bin & .json format used by qnn-<framework>-converter Converter are not supported by QAIRT Converter. In order to generate the .cpp/.bin and .json output, continue to use the legacy converter.

    • HTP as Default Backend in QAIRT vs Legacy Converters

      HTP is set as the default backend in the QAIRT converter, which may enable certain HTP-specific behaviors that wouldn’t be triggered by default in legacy converters where the backend is left empty. This difference can affect how some backend-dependent features behave during conversion/quantization.

      • For example, during quantization, an optimization called IntBiasUpdates is applied to the FullyConnected op if the backend is set to HTP in SNPE, whereas it is always applied in QAIRT.

    • Quantizer functionality is separated from Conversion functionality

      • qnn-<framework>-converter invokes the quantizer as part of the converter tool when --input_list or --float_fallback is passed.

      • qairt-quantizer however is a standalone tool for quantization like snpe-dlc-quant.

      • Please refer to qairt-quantizer for more information and usage details.

    • QAIRT Converter preserves the original output order from ONNX models, while legacy converters may reorder outputs.

    To maintain output order in the legacy converter (qnn-onnx-converter), use the --preserve_onnx_output_order flag.

  • Will the Converted model be any different with QAIRT converter compared to Legacy Converter?

    • The result of the QAIRT Converter will be different from the result of Legacy Converters in terms of the input/output layout.

    • Legacy converters will by default modify the input tensors to Spatial First (e.g. NHWC) layout. This means for Frameworks like ONNX, where the predominant layout is Spatial Last (e.g. NCHW), the input/output layout is different between the source model and the converted model.

    • Since QAIRT Converter preserves the source layouts be default, the QAIRT-converted graphs in case of many ONNX/Pytorch models will be different from the Legacy-converted graphs.

    • QAIRT Converter preserves the original output order from ONNX models.