Quantization

This page describes the general quantization process and supported algorithms and features.

Overview

Non-quantized models files use 32 bit floating point representations of network parameters. Quantized model files use fixed point representations of network parameters, generally 8 bit weights and 8 or 32bit biases. The fixed point representation is the same used in Tensorflow quantized models.

Choosing Between a Quantized or Non-Quantized Model

  • CPU - Choose a non-quantized model. Quantized models are currently incompatible with the CPU backend.

  • DSP - Choose a quantized model. Quantized models are required when running on the DSP backend.

  • GPU - Choose a non-quantized model. Quantized models are currently incompatible with the GPU backend.

  • HTP - Choose a quantized model. Quantized models are required when running on the HTP backend.

  • HTA - Choose a quantized model. Quantized models are required when running on the HTA backend.

Quantization

This section describes the concepts behind the quantization algorithm used in QNN. These concepts are used by the converters when the developer decides to quantize a graph.

Overview

QNN supports multiple quantization modes. The basics of the quantization, regardless of mode, are described here.

  • Quantization converts floating point data to the Tensorflow-style fixed point format using a provided bit width.

  • The following requirements are satisfied:

    • Full range of input values is covered.

    • Minimum range of 0.0001 is enforced.

    • Floating point zero is exactly representable.

  • Quantization algorithm inputs:

    • Set of floating point values to be quantized.

  • Quantization algorithm outputs:

    • Set of 8-bit fixed point values.

    • Encoding parameters:

      • encoding-min - minimum floating point value representable (by fixed point value 0)

      • encoding-max - maximum floating point value representable (by fixed point value 255)

      • scale - The step size for the given range (max - min) / (2^bw-1)

      • offset - The integer value which exactly represents 0. round(-min/scale)

  • Algorithm

    1. Compute the true range (min, max) of input data.

    2. Compute the encoding-min and encoding-max.

    3. Quantize the input floating point values.

    4. Output:

    • fixed point values

    • encoding-min and encoding-max parameters

    • scale and offset parameters

Details

  1. Compute the true range of the input floating point data.

  • finds the smallest and largest values in the input data

  • represents the true range of the input data

  1. Compute the encoding-min and encoding-max.

  • These parameters are used in the quantization step.

  • These parameters define the range and floating point values that will be representable by the fixed point format.

    • encoding-min: specifies the smallest floating point value that will be represented by the fixed point value of 0

    • encoding-max: specifies the largest floating point value that will be represented by the fixed point value of 255

    • floating point values at every step size, where step size = (encoding-max - encoding-min) / (2^bw-1), will be representable

    • offset where zero is exactly represented

  • encoding-min and encoding-max are first set to the true min and true max computed in the previous step

  • Requirements

    1. Encoding range must be at least a minimum of 0.0001

    • encoding-max is adjusted to max(true max, true min + 0.0001)

    1. Floating point value of 0 must be exactly representable

    • encoding-min or encoding-max may be further adjusted

  1. Cases - Handling 0

  1. Inputs are strictly positive

    • the encoding-min is set to 0.0

    • zero floating point value is exactly representable by smallest fixed point value 0

    • e.g. input range = [5.0, 10.0]

      • encoding-min = 0.0, encoding-max = 10.0

  2. Inputs are strictly negative

  • encoding-max is set to 0.0

  • zero floating point value is exactly representable by the largest fixed point value 255

  • e.g. input range = [-20.0, -6.0]

    • encoding-min = -20.0, encoding-max = 0.0

  1. Inputs are both negative and positive

  • encoding-min and encoding-max are slightly shifted to make the floating point zero exactly representable

  • e.g. input range = [-5.1, 5.1]

    • encoding-min and encoding-max are first set to -5.1 and 5.1, respectively

    • encoding range is 10.2 and the step size is 10.2/255 = 0.04

    • zero value is currently not representable. The closest values representable are -0.02 and +0.02 by fixed point values 127 and 128, respectively

    • encoding-min and encoding-max are shifted by -0.02. The new encoding-min is -5.12 and the new encoding-max is 5.08

    • floating point zero is now exactly representable by the fixed point value of 128

  1. Quantize the input floating point values.

  • encoding-min and encoding-max parameter determined in the previous step are used to quantize all the input floating values to their fixed point representation

  • Quantization formula is:

    • quantized value = round(255 * (floating point value - encoding.min) / (encoding.max - encoding.min))

  • quantized value is also clamped to be within 0 and 2^bw-1

  1. Outputs

  • the fixed point values

  • encoding-min, encoding-max, scale, and offset parameters

Quantization Example

  1. Inputs:

  • input values = [-1.8, -1.0, 0, 0.5]

  • encoding-min is set to -1.8 and encoding-max to 0.5

  • encoding range is 2.3, which is larger than the required 0.0001

  • encoding-min is adjusted to −1.803922 and encoding-max to 0.496078 to make zero exactly representable

  • step size is 0.009020

  • offset is 200

  1. Outputs:

  • quantized values are [0, 89, 200, 255]

Dequantization Example

  1. Inputs:

  • quantized values = [0, 89, 200, 255]

  • encoding-min = −1.803922, encoding-max = 0.496078

  • step size is 0.009020

  • offset is 200

  1. Outputs:

  • dequantized values = [−1.8039, −1.0011, 0.0000, 0.4961]

Bitwidth Selection

QNN currently supports a default quantization bit width of 8 for both weights and biases. The weight, bias, and activation bit widths, however, can be overriden by passing one of –weight_bw, –bias_bw, and/or –act_bw followed by the bitwidth. Please see the converter documentation here for more details on the command line options.

Packed 4-bit Quantization

In packed 4-bit quantization, two 4-bit quantized tensors can be stored in a single 8-bit buffer. The lower nibble stores the first value while the higher nibble stores the second value. This can be enabled by providing the –pack_4_bit_weights flag. For the quantized values (10, 4) the unpacked and packed representation is given below.

  • Unpacked = (0000 1010, 0000 0100)

  • Packed = (0100 1010)

In case of per-channel/per-row quantization, the quantized values are packed along each channel/row. For a tensor of size (3,3,3,32) containing 32 output channels and 27 values per channel, the unpacked and packed representation will take the following amount of memory for the 27 quantized values per channel.

  • Unpacked = (3*3*3) = 27 bytes

  • Packed = ceil((3*3*3)/2) = 14 bytes

Note Packed 4-bit tensors are stored with QNN_DATATYPE_SFIXED_POINT_4/QNN_DATATYPE_UFIXED_POINT_4 datatypes while unpacked 4-bit tensors are stored with QNN_DATATYPE_SFIXED_POINT_8/QNN_DATATYPE_UFIXED_POINT_8 datatypes. Please refer to the backend supplements to find ops which support 4-bit packed tensors.

Quantization Modes

QNN supports four quantization modes: tf, symmetric, enhanced, and adjusted. The primary difference is between how they select the quantization range to be used.

TF

The default mode has been described above, and uses the true min/max of the data being quantized, followed by an adjustment of the range to ensure a minimum range and to ensure 0.0 is exactly quantizable.

Symmetric

Symmetric quantization follows the same basic principles as TF quantization but adjusted the range to be symmetric. It does this by selecting a new min and max from the original range such that new_max=max(abs(min), abs(max)) and adjusts the range to be (-new_max, new_max) such that the range is symmetric around 0. This is typically only used for weights as it helps to reduce computation overhead at runtime. This mode is enabled by passing –param_quantizer symmetric to one of the converters.

Enhanced

Enhanced quantization mode (invoked by passing “enhanced” to either the --param_quantizer or --act_quantizer options in one of the converters) uses an algorithm to try to determine a better set of quantization parameters to improve accuracy. The algorithm may pick a different min/max value than the default quantizer, and in some cases it may set the range such that some of the original weights and/or activations cannot fall into that range. However, this range does produce better accuracy than simply using the true min/max. The enhanced quantizer can be enabled independently for weights and activations by appending either “weights” or “activations” after the option.

This is useful for some models where the weights and/or activations may have “long tails”. (Imagine a range with most values between -100 and 1000, but a few values much greater than 1000 or much less than -100.) In some cases these long tails can be ignored and the range -100, 1000 can be used more effectively than the full range.

Enhanced quantizer still enforces a minimum range and ensures 0.0 is exactly quantizable.

TF Adjusted

This mode is used only for quantizing weights to 8 bit fixed point (invoked by passing “adjusted” to either the --param_quantizer or --act_quantizer options in one of the converters) to, which uses adjusted min or max of the data being quantized other than true min/max or the min/max that exclude the long tail. This has been verified to be able to provide accuracy benefit for denoise model specifically. Using this quantizer, the max will be expanded or the min will be decreased if necessary.

Adjusted weights quantizer still enforces a minimum range and ensures 0.0 is exactly quantizable.

Enhanced Quantization Techniques

Quantization can be a difficult problem to solve due to the myriad of training techniques, model architectures, and layer types. In an attempt to mitigate quantization problems model preprocessing techniques have been added to the quantizer that may improve quantization performance on models which exhibit sharp drops in accuracy upon quantization.

The primary technique introduced is CLE (Cross Layer Equalization).

CLE works by scaling the convolution weight ranges in the network by making use of a scale-equivariance property of activation functions. In addition, the process absorbs high biases which may be result from weight scaling from one convolution layer to a subsequent convolution layer.

Enhanced Quantization Techniques: Limitations

In many cases CLE may enable quantized models to return to close to their original floating-point accuracy. There are some caveats/limitations to the current algorithms:

CLE operates on specific patterns of operations that all exist in a single branch (outputs cannot be consumed by more than one op). The matched operation patterns (r=required, o=optional) are:

Conv(r)->Batchnorm(r)->activation(o)->Conv(r)->Batchnorm(r)->activation(o) Conv(r)->Batchnorm(r)->activation(o)->DepthwiseConv(r)->Batchnorm(r)->activation(o)->Conv(r)->Batchnorm(r)->activation(o)

The CLE algorithm currently only supports Relu activations. Any Relu6 activations will be automatically changed to Relu and any activations other than these will cause the algorithm to ignore the preceding convolution. Typically the switch from Relu6->Relu is harmless and does not cause any degradation in accuracy, however some models may exhibit a slight degradation of accuracy. In this case, CLE can only recover accuracy to that degraded level, and not to the original float accuracy. CLE requires batchnorms (specifically detectable batchnorm beta/gamma data) be present in the original model before conversion to DLC for the complete algorithm to be run and to regain maximum accuracy. For Tensorflow, the beta and gamma can sometimes still be found even with folded batchnorms, so long as the folding didn’t fold the parameters into the convolution’s static weights and bias. If it does not detect the required information you may see a message that looks like: “Invalid model for HBA quantization algorithm.” This indicates the algorithm will only partially run and accuracy issues may likely be present.

To run CLE simply add the option –algorithms cle to the converter command line.

More information about the algorithms can be found here: [https://arxiv.org/abs/1906.04721]

Quantization Impacts

Quantizing a model and/or running it in a quantized runtime (like the HTP) can affect accuracy. Some models may not work well when quantized, and may yield incorrect results. The metrics for measuring impact of quantization on a model that does classification are typically “Mean Average Precision”, “Top-1 Error” and “Top-5 Error”.

Quantization Overrides

If the option –quantization_overrides is provided the user may provide a json file with parameters to use for quantization. These will override any quantization data carried from conversion (eg TF fake quantization) or calculated during the normal quantization process. Format defined as per AIMET specification.

There are two sections in the json, a section for overriding operator output encodings called “activation_encodings” and a section for overriding parameter (weight and bias) encodings called “param_encodings”. Both must be present in the file, but can be empty if no overrides are desired. An example with all of the currently supported options:

{
   "activation_encodings": {
       "Conv1:0": [
           {
               "bitwidth": 8,
               "max": 12.82344407824954,
               "min": 0.0,
               "offset": 0,
               "scale": 0.050288015993135454
           }
       ],
       "input:0": [
           {
               "bitwidth": 8,
               "max": 0.9960872825108046,
               "min": -1.0039304197656937,
               "offset": 127,
               "scale": 0.007843206675594112
           }
       ]
   },
   "param_encodings": {
       "Conv2d/weights": [
           {
               "bitwidth": 8,
               "max": 1.700559472933134,
               "min": -2.1006477158567995,
               "offset": 140,
               "scale": 0.01490669485799974
           }
       ]
   }
}

Note that it is not required to provide scale and offset but bw, min, and max should be provided. Scale and offset will be calculated from the provided bw, min, and max parameters regardless if they are provided or not.

Per-channel Quantization Overrides

Per-channel quantization should be used for tensors that are weight inputs to Conv consumers (Conv2d, Conv3d, TransposeConv2d, DepthwiseConv2d). This section provides examples to manually override per-channel encodings for these Conv-based op weight tensors. Per-channel quantization will be used when we provide multiple encodings (equal to the number of channels) for the given tensor. We see an example for convolution weight for the following cases.

  • Case 1: Asymmetric encodings without per-channel quantization

{
    "features.9.conv.3.weight": [
        {
            "bitwidth": 8,
            "is_symmetric": "False",
            "max": 3.0387749017453665,
            "min": -2.059169834735364,
            "offset": -103,
            "scale": 0.019991940143061618
        }
    ]
}
  • Case 2: Per-channel quantization encodings with 3 output channels

{
    "features.8.conv.3.weight": [
        {
            "bitwidth": 8,
            "is_symmetric": "True",
            "max": 0.7011175155639648,
            "min": -0.7066381259227362,
            "offset": -128.0,
            "scale": 0.005520610358771377
        },
        {
            "bitwidth": 8,
            "is_symmetric": "True",
            "max": 0.5228064656257629,
            "min": -0.5269230519692729,
            "offset": -128.0,
            "scale": 0.004116586343509945
        },
        {
            "bitwidth": 8,
            "is_symmetric": "True",
            "max": 0.7368279099464417,
            "min": -0.7426297045129491,
            "offset": -128.0,
            "scale": 0.005801794566507415
        }
    ]
}

Note: Per-channel quantization must use symmetric representation with offset == -2^(bitwidth-1). Per-channel always has is_symmetric = True.

INT32 Overrides

INT32 overrides can also be provided to override an op to run in INT32 precision. To support running an op in INT32 precision, INT32 overrides should be provided for all of its inputs and outputs. This will inject a Dequantize op followed by a Cast (to: INT32) op at the inputs of the op and a Cast (to: FP32) op followed by a Quantize op at the output of the op for a quantized model. We show a sample graph below where the op “Op2” has its input and output tensor overridden to INT32 through the use of external overrides. This in turn generates the second graph to support the INT32 overrides through the use of Dequantize, Cast (to: INT32), Cast (to: FP32), and Quantize op.

../_static/resources/quantization_int32_graph.png

Note: INT32 overrides are only supported for ops which do not have weights and bias.

Quantizing a Model

To enable quantization simply pass the option –input_list along with a text file containing raw data inputs to the network. Note that the inputs specified in this file should match exactly with the inputs in the .cpp file generated by conversion. In most cases, these inputs can be obtained directly from the source framework model. However, in rare cases, such as when the inputs are pruned by the converter, these inputs can differ. The format of the file uses a single line for each set of inputs to the network:

<inputFile0>
<inputFile1>
<inputFile2>

If a network contains multiple inputs they are all listed on a single line separated by a space and prefaced with the input name and a “:=”

<inputNameA>:=<inputFile0a> <inputNameB>:=<inputFile0b>
<inputNameA>:=<inputFile1a> <inputNameB>:=<inputFile1b>
<inputNameA>:=<inputFile2a> <inputNameB>:=<inputFile2b>

Examples

For graph containing a single input the input text file would contain something like:

/path/to/file/chair.raw
/path/to/file/mongoose.raw
/path/to/file/honeybadger.raw

For a network containing multiple graph inputs:

input_left_eye:=left0.rawtensor input_right_eye:=right0.rawtensor
input_left_eye:=left1.rawtensor input_right_eye:=right1.rawtensor
input_left_eye:=left2.rawtensor input_right_eye:=right2.rawtensor

Mixed Precision and FP16 Support

Mixed Precision enables specifying different bit widths (e.g. INT8 or INT16) or datatypes (integer or floating point) for different ops within the same graph. Data type conversion ops are automatically inserted when activation precision or data type is different between successive ops. Graphs can have a mix of floating-point and fixed-point data types. Each op can have different precision for weights and activations. However, for a particular op, either all inputs, outputs and parameters (weights/biases) will be floating-point or all will be integer type. Please refer to the backend supplements for the supported weight/activation bit widths for a particular op.

FP16 (half-precision) additionally enables converting the entire models to FP16 or selecting between FP16 and FP32 data-types for the float ops in case of mixed precision graphs with a mix of floating point and integer ops. The different modes of using mixed precision are described below.

Non-quantized Mode

In this mode no calibration images are given (–input_list flag is not given) to the converter. The converted QNN model has only float tensors for both activations and weights.

  • Non-quantized FP16: If “–float_bw 16” is added in command line, all activation and weight/bias tensors are converted to FP16.

  • Non-quantized FP32: If “–float_bw” is absent from command line or “–float_bw 32” is given, all activation and weight/bias tensors use FP32 format.

Quantized Mode

In this mode calibration images are given (–input_list is given) to converter. The converted QNN model has fixed point tensors for activations and weights.

  • No override: If no –quantization_overrides flag is given with an encoding file, all activations are quantized as per –act_bw (default 8) and parameters are quantized as per –weight_bw/–bias_bw (default 8/8) respectively.

  • Full override: If –quantization_overrides flag is given along with encoding file specifying encodings for all ops in the model. In this case, the bitwidth with be set as per JSON for all ops defined as integer/float as per encoding file (dtype=’int’ or dtype=’float’ in encoding json).

  • Partial override: If –quantization_overrides flag is given along with encoding file specifying partial encodings (i.e. encodings are missing for some ops), the following will happen.

    • Layers for which encoding are NOT available in json file are encoded in the same manner as the no override case i.e. defined as integer with bitwidth defined as per –act_bw/–weight_bw/–bias_bw (or their default values 8/8/8). For some ops (Conv2d, Conv3d, TransposeConv2d, DepthwiseConv2d, FullyConnected, MatMul) even if any of the output/weights/bias are specified as float in the encoding file, all three of them will be overridden to float. The float bitwidth used will be same as the float bitwidth of the overriding tensor in the encodings file. We can also manually control the bitwidth of bias tensors in such case (if encodings for it are absent in encodings json and present for output/weights) with the use of the –float_bias_bw (16/32) flag.

    • Layers for which encoding are available in json are encoded in same manner as full override case.

We show a sample json for network with 3 Conv2d ops. The first and third Conv2d ops are INT8 while the second Conv2d op is marked as FP32. The FP32 op (namely conv2_1) is sandwiched between two INT8 ops in “activation_encodings”, hence convert ops will be inserted before and after the FP32 op. The corresponding weights and biases for conv2_1 are also marked as floating-point in the JSON in “param_encodings”.

{
   "activation_encodings": {
       "data_0": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ],
       "conv1_1": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ],
       "conv2_1": [
           {
               "bitwidth": 32,
               "dtype": "float"
           }
       ],
       "conv3_1": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ]
   },
   "param_encodings": {
       "conv1_w_0": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ],
       "conv1_b_0": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ],
       "conv2_w_0": [
           {
               "bitwidth": 32,
               "dtype": "float"
           }
       ],
       "conv2_b_0": [
           {
               "bitwidth": 32,
               "dtype": "float"
           }
       ],
       "conv3_w_0": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ],
       "conv3_b_0": [
           {
               "bitwidth": 8,
               "dtype": "int"
           }
       ]
   }
}

The ops that are not present in json will be assumed to be fixed-point and the bit widths will be selected according to –act_bw/–weight_bw/–bias_bw respectively.

{
   "activation_encodings": {
       "conv2_1": [
           {
               "bitwidth": 32,
               "dtype": "float"
           }
       ]
   },
   "param_encodings": {
       "conv2_w_0": [
           {
               "bitwidth": 32,
               "dtype": "float"
           }
       ],
       "conv2_b_0": [
           {
               "bitwidth": 32,
               "dtype": "float"
           }
       ]
   },
   "version": "0.5.0"
}

The following quantized mixed-precision graph will be generated based on the JSON shown above. Please note that the convert operations are added appropriately to convert between float and int types and vice-versa.

../_static/resources/qnn_quantization_mp_graph.png

qairt-quantizer

The qairt-converter tool now converts non-quantized models into a non-quantized or quantized DLC file depending on the overrides provided during the Converter step. qairt-quantizer now can be used to quantize all the tensors which are missing encodings during qairt-converter step (fill in the gaps) or can be used to calibrate the provided encodings through a list of images. The qairt-quantizer tool is used to quantize the model to one of supported fixed point formats.

For example, the following command will convert an Inception v3 DLC file into a quantized Inception v3 DLC file.

$ qairt-quantizer --input_dlc inception_v3.dlc \
                  --input_list image_file_list.txt \
                  --output_dlc inception_v3_quantized.dlc

To properly calculate the ranges for the quantization parameters, a representative set of input data needs to be used as input into qairt-quantizer using the --input_list parameter. The --input_list specifies paths to raw image files to be used for calibration during quantization. For details refer to --input_list argument in qnn-net-run for supported input formats (in order to calculate output activation encoding information for all layers, do not include the line which specifies desired outputs).

The tool requires the batch dimension of the DLC input file to be set to 1 during model conversion. The batch dimension can be changed to a different value for inference, by resizing the network during initialization.

Additional details

  • qairt-quantizer is majorly similar to snpe-dlc-quant with the following differences:

    • qairt-quantizer can now be used to generate encodings using calibration dataset provided via the --input_list flag for the tensors for the following scenarios:

      • Fill in the gaps: If any tensor is missing encoding during the qairt-converter step i.e. the tensors for which override is not specified in --quantization_overrides or source model encodings (QAT).

      • If encodings is not specified for all the tensors via overrides or QAT encodings.

    • HTP is set as the default backend in the QAIRT quantizer, which may enable certain HTP-specific behaviors that wouldn’t be triggered by default in legacy quantizers where the backend is left empty. This difference can affect how some backend-dependent features behave during conversion/quantization.

      • For example, during quantization, an optimization called IntBiasUpdates is applied to the FullyConnected op if the backend is set to HTP in SNPE, whereas it is always applied in QAIRT.

    • The external overrides and source model encodings (QAT) are now applied during qairt-converter stage by default. So the quantizer options to ignore the overrides and source model encodings, --ignore_encodings (legacy) and --ignore_quantization_overrides are now no-op.

    • An alternative option is to the --export_format=DLC_STRIP_QUANT flag of qairt-converter, when specified the converter will ignore/remove all the encodings in the source model and output float model which can be recalibrated using qairt-quantizer and --input_list flag.

    • Another alternative for using this feature is through qairt-quantizer options --input_list and --ignore_quantization_overrides``in combination which signals the quantizer to ignores all the encodings applied during conversion and generates encodings using the calibration dataset provided via ``--input_list.

    • The float fallback feature controlled via command-line option --enable_float_fallback, present as --float_fallback in legacy quantizers is also a no-op for qairt-quantizer and can be skipped. The float fallback was added to produce a fully quantized or mixed precision graph by applying encoding overrides or source model encodings, by propagating encodings across data invariant Ops and falling back the missing tensors to float datatype. To simplify the steps, this is taken care during qairt-converter. qairt-converter applies the overrides and encodings, and the tensors which are missing encodings will fall back to the default float datatype.

    • To summarize, qairt-quantizer command-line arguments --ignore_quantization_overrides, and --enable_float_fallback are now no-op, and are applied by default during qairt-converter step itself.

      Note

      --enable_float_fallback and --input_list are mutually exclusive options. One of them is mandatory argument for quantizer.

  • Outputs can be specified for qairt-quantizer by modifying the input_list in the following ways:

    #<output_layer_name>[<space><output_layer_name>]
    %<output_tensor_name>[<space><output_tensor_name>]
    <input_layer_name>:=<input_layer_path>[<space><input_layer_name>:=<input_layer_path>]
    

    Note: Output tensors and layers can be specified individually, but when specifying both, the order shown must be used to specify each.

  • qairt-quantizer also supports quantization using AIMET, inplace of default Quantizer, when --use_aimet_quantizer command line option is provided. To use AIMET Quantizer, run the setup script to create AIMET specific environment, by executing the following command

    $ source {QNN_SDK_ROOT}/bin/aimet_env_setup.sh --env_path <path where AIMET venv needs to be created> \
                                                   --aimet_sdk_tar <AIMET Torch SDK tarball>
    
  • Advance AIMET algorithms- AdaRound, AMP and AutoQuant are also supported in qairt-quantizer. The user needs to provide a YAML config file through the command line option --config and specify the algorithm “adaround”, “amp” or “autoquant” through --apply_algorithms along with --use_aimet_quantizer flag.

  • AdaRound

  • The template for the YAML config file for AdaRound is shown below:

datasets:
    <dataset_name>:
        dataloader_callback: '<path/to/unlabled/dataloader/callback/function>'
        dataloader_kwargs: {arg1: val, arg2: val2}

adaround:
    dataset: <dataset_name>
    num_batches: 1
  • The required arguments for AdaRound are specified below.
    • dataloader_callback is used to set the path of a callback function which returns unlabeled dataloader of type torch.DataLoader. The data should be in source network input format.

    • dataloader_kwargs is an optional dictionary through which the user can provide keyword arguments of the above defined callback function.

    • dataset is used to specify the name of the dataset that has been defined above.

    • num_batches is used to specify the number of batches to be used for adaround iteration.

  • Other than the above required arguments, there are few optional args that have default values set however, the user can specify a non-default value through optional_adaround_args in config file as a keyword dictionary. The supported optional arguments are specified below.

    • default_param_bw: [int] Default bitwidth (4-31) to use for quantizing layer parameters

    • param_bw_override_list: [List of list] Each list is a module and the corresponding parameter bitwidth to be used for that module.

    • ignore_quant_ops_list: [List of str] Ops listed here are skipped during quantization needed for AdaRounding. Do not specify Conv and Linear modules in this list. Doing so, will affect accuracy.

    • default_quant_scheme: [str] Quantization scheme. Supported options are post_training_tf or post_training_tf_enhanced

    • default_config_file: [str] Default configuration file path for model quantizers

  • AdaRound can also run in default mode, without config file, by just passing “adaround” in the command line option --apply_algorithms along with --use_aimet_quantizer flag. This flow uses the data provided through the input_list option to take rounding decisions.

  • AMP

  • The template for the YAML config file for AMP is shown below:

datasets:
    <dataset_name>:
        dataloader_callback: '<path/to/unlabled/dataloader/callback/function>'
        dataloader_kwargs: {arg1: val, arg2: val2}

amp:
    dataset: <dataset_name>,
    candidates:  [[[8, 'int'], [16, 'int']], [[16, 'float'], [16, 'float']]],
    allowed_accuracy_drop: 0.02
    eval_callback_for_phase2: '<path/to/evaluator/callback/function>'
  • The required arguments for AMP are specified below.
    • dataloader_callback is used to set the path of a callback function which returns labeled dataloader of type torch.DataLoader. The data should be in source network input format.

    • dataloader_kwargs is an optional dictionary through which the user can provide keyword arguments of the above defined callback function.

    • dataset is used to specify the name of the dataset that has been defined above.

    • candidates is list of lists for all possible bitwidth values for activations and parameters.

    • allowed_accuracy_drop is used to specify the maximum allowed drop in accuracy from FP32 baseline. The pareto front curve is plotted only till the point where the allowable accuracy drop is met.

    • eval_callback_for_phase2 is used to set the path of the evaluator function which takes predicted batch as the first argument and ground truth batch as the second argument and returns calculated metric float value.

    Sample eval callback function for computing top-k accuracy metrics:

    def accuracy(output, target):
       """Computes the accuracy over the k top predictions for the specified values of k"""
    
       topk = (1,)
       maxk = max(topk)
       batch_size = target.size(0)
    
       _, pred = output.topk(maxk, 1, True, True)
       pred = pred.t()
       correct = pred.eq(target.view(1, -1).expand_as(pred))
    
       res = []
       for k in topk:
           correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
           res.append(correct_k.mul_(100.0 / batch_size))
    
       return res
    
  • Other than the above required arguments, there are few optional arguments that have default values set however, the user can specify a non-default value through optional_amp_args in amp as a keyword dictionary. The supported optional arguments are specified below.

    • eval_callback_for_phase1: [str] Path of the Eval function which takes only model as an argument and returns calculated metrics float value. This function is used to measure sensitivity of each quantizer group during phase 1. The phase 1 involves finding accuracy list/sensitivity of each module. Therefore, a user might want to run the phase 1 with a smaller dataset.

    • clean_start: [bool] If true, any cached information from previous runs will be deleted prior to starting the mixed-precision analysis. If false, prior cached information will be used if applicable

    • forward_pass_callback: [str] The path of function which takes only model as an argument and runs the forward pass on this model.

    • use_all_amp_candidates: [bool] Using the “supported_kernels” field in the config file (under defaults and op_type sections), a list of supported candidates can be specified. All the AMP candidates which are passed through the “candidates” field may not be supported based on the data passed through “supported_kernels”. When the field “use_all_amp_candidates” is set to True, the AMP algorithm will ignore the “supported_kernels” in the config file and continue to use all candidates.

    • phase2_reverse: [bool] If user will set this parameter to True, then phase1 of amp algo, that is calculating accuracy list will not be changed, whereas the phase2 algo of amp, which generate the pareto list will be changed. In phase2, algo will start, model with all quantizer groups in least candidate, and one by one, it will put nodes in higher candidate till target accuracy does not meet.

    • amp_search_algo: [str] Defines the search algorithm to be used for the phase 2 of AMP. Supported algorithms are Binary, Interpolation and BruteForce

  • AutoQuant

  • The template for the YAML config file for AutoQuant is shown below:

datasets:
    <dataset_name>:
        dataloader_callback: '<path/to/unlabled/dataloader/callback/function>'
        dataloader_kwargs: {arg1: val, arg2: val2}
    <eval_dataset_name>:
        dataloader_callback: '<path/to/labled/dataloader/callback/function>'
        dataloader_kwargs: {arg1: val, arg2: val2}

autoquant:
    dataset: <dataset_name>
    eval_callback: "qti.aisw.converters.aimet.aimet_utils.accuracy"
    eval_dataset: <eval_dataset_name>
    allowed_accuracy_drop: 0.07
    amp_candidates: [[[16,'int'],[16,'int']], [[16,'int'],[8,'int']], [[8,'int'],[16,'int']], [[8,'int'],[8,'int']]]
  • The required arguments for AutoQuant are specified below.
    • dataloader_callback is used to set the path of a callback function which returns unlabeled dataloader of type torch.DataLoader. The data should be in source network input format.

    • dataloader_kwargs is an optional dictionary through which the user can provide keyword arguments of the above defined callback function.

    • dataset is used to specify the name of the dataset that has been defined above.

    • eval_callback_for_phase2 is used to set the path of the evaluator function which takes predicted value batch as the first argument and ground truth batch as the second argument and returns calculated metric float value.

    • dataset is used to specify the name of the labeled dataset that has been defined above which is used for model evaluation.

    • allowed_accuracy_drop is used to specify the maximum allowed drop in accuracy from FP32 baseline.

    • amp_candidates is list of lists for all possible bitwidth values for activations and parameters.

  • Other than the above required options, there are few optional arguments that have default values set however, the user can specify a non-default value through optional_autoquant_args in autoquant as a keyword dictionary.

    • The supported optional arguments for optional_autoquant_args are specified below.
      • param_bw: [int] Parameter bitwidth.

      • output_bw: [int] Output bitwidth.

      • quant_scheme: [str] Quantization scheme

      • rounding_mode: [str] Rounding mode

      • config_file: [str] Path to configuration file for model quantizers

      • cache_id: [str] ID associated with cache results

      • strict_validation: [bool] Flag set to True by default.If False, AutoQuant will proceed with execution and handle errors internally if possible. This may produce unideal or unintuitive results.

    • The supported optional arguments for optional_amp_args are specified below.
      • num_samples_for_phase_1: [int] Number of samples to be used for performance evaluation in AMP phase 1

      • forward_fn: [Callable] callback function that performs forward pass given a model and inputs yielded from the data loader. The function expects model as first argument and inputs to model as second argument

      • num_samples_for_phase_2: [int] Number of samples to be used for performance evaluation in AMP phase 2

    • In AutoQuant, num_batches is supported as an optional arguments, hence a non-default value can be provided through optional_adaround_args along with other optional args of AdaRound Params in autoquant as a keyword dictionary. The args description can be found above in the AdaRound algorithm section.

Note:
  1. AIMET Torch Tarball naming convention should be as follows - aimetpro-release-<VERSION (optionally with build ID)>.torch-<cpu/gpu>-.*.tar.gz. For example, aimetpro-release-x.xx.x.torch-xxx-release.tar.gz.

  2. Once the setup script is run, ensure that AIMET_ENV_PYTHON environment variable is set to <AIMET virtual environment path>/bin/python

  3. Minimum AIMET version supported is, AIMET-1.34.0