
.. #=============================================================================
   #
   #  Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
   #  All rights reserved.
   #  Confidential and Proprietary - Qualcomm Technologies, Inc.
   #
   #=============================================================================

=========================
Qualcomm AI Runtime SDK
=========================

Qualcomm AI Runtime SDK is also referred to as Qualcomm Neural Network (QNN) in the source code and documentation.
Qualcomm AI Runtime SDK is a software development kit (SDK) for building AI based applications.
It provides tools and extensible per-accelerator libraries with uniform API,
enabling flexible integration and efficient execution of machine/deep learning networks on Qualcomm chipsets.

Contents
--------

- Converter tools to translate and optionally quantize source networks into sequence of QNN API calls.
- Per-accelerator backend libraries implementing QNN API
- OpPackage based backend extensibility
- Test tools to exercise backend libraries and converted networks
- Sample applications, OpPackage examples
- QNN SDK Reference Guide

Dependencies
------------

Point your web browser to ${QNN_SDK_ROOT}/docs/QNN/general/setup.html

=============
Release Notes
=============


2.39.0
======

**9/30/2025**

QNN API version: v2.29.0

Changelog
---------

Features
~~~~~~~~
* API:
*   Genie:
*     Added the `GenieDialog_embeddingTokenQuery` API. {148803}
*     Added the `GenieDialog_setMaxNumTokens` API. {146820}
*   HTP:
*     Added a new HTP-specific property to support a detachable buffers feature. {148227}
*     Enhanced profiling capabilities to expose detailed timing information for each component during the graph preparation phase (`QnnGraph_finalize`). {143804}
*     Implemented a feature allowing read-only weights buffers to be detached and unmapped. {141354}
*     Introduced new APIs and configuration options to support a detachable buffers feature. {143832}
*   SNPE:
*     Added new builder API for enabling accelerated HTP inititialization with a pre-prepared cache Snpe_SNPEBuilder_SetAcceleratedInit() / SNPEBuilder::setAcceleratedInit(). Support also added to snpe-net-run, snpe-throughput-net-run and snpe-parallel-run via cmd line argument --enable_htp_accelerated_init. {149873}
* Docs:
*   Updated documentation for `qairt-accuracy-debugger` to include support for the Windows on Snapdragon (WoS) platform, including updated help sections and sample commands. {149286}
*   Updated the LPAI documentation to include a summary of the required steps for model preparation. {142076}
* Genie:
*   Added new profiling option for collecting detailed trace events. {133638}
*   Added the `GENIE_STATUS_ERROR_CONTEXT_EXCEEDED` error code to provide a specific status when a prompt exceeds the model's context length limit. {145721}
* HTP:
*   Added support for multi-graph switching, which allows multiple graphs to be loaded and retained in memory simultaneously. {139603}
*   Added support for several operator fusion patterns  on the HTP backend, including combinations like Conv-Relu and Conv-Batchnorm-HardSwish. {125633}
*   Added support for the BFloat16 data type by including the necessary header and definitions in the HTP backend. {140994}
*   Minor performance improvement for benchmark models. {147751}
* LPAI:
*   Fixed an issue where the quantization process would incorrectly modify the offset specified in a `quant.json` file. {145916}
*   Resolved an accuracy issue with audio context detection models on the LPAI backend. The issue was caused by incorrect bias quantization settings for convolution and GEMM operations. {146710}
* Op:
*   GPU:
*     Added support for QNN_DATATYPE_INT_32 inputs to StridedSlice op. {142629}
*   HTP:
*     Added support for 6D variants of Cast, GatherElements, Pad, and StridedSlice with certain constraints. For GatherElements, input and index shapes must match except along the axis dimension. For Pad, padding is limited to dimensions 5D or smaller. For StridedSlice, slicing is limited to dimensions 5D or smaller, and some axis parameters are not supported. {147157}
*     Enabled support for the SFIXED_POINT_16 data type for the Sqrt Op in QNN HTP Op validation flow. {142710}
* OpDef:
*   Added support for the `RandomUniformLike` Op. This includes the ONNX to QNN IR translation in the converter and the backend implementation. {138616}
*   Updated the NonZero Op definition to clarify that it outputs -1 for padded values in static shapes. Also updated Gather and Scatter Ops to restrict index tensors to non-negative values, allowing -1 only as a sentinel value for indices generated from other Ops. {142505}
* QNN:
*   TFLite Delegate: Added support for the Broadcast_to Op. {149782}
* Tool:
*   Added native support for WoS to the Accuracy Evaluator tool. This includes updates to handle platform-specific file paths and resolves a file permission error in the SQuAD evaluation script on Windows. {136566}
*   Added support for multi-graph switching in `qnn-net-run` and `qnn-throughput-net-run` via the new custom configuration option `graphs_retention_order`. {145979}
*   Enabled support for the Windows on Snapdragon (WoS) platform in the accuracy debugger. Users can now debug models on WoS using both the CLI and Python API interfaces. {147963}
*   Converter:
*     Added reference implementations for static tensor manipulation Ops, including `Add`, `Mul`, `Sub`, `Div`, `Transpose`, and `Reshape`. {133602}
*     Fixed a segmentation fault in `qairt-converter` that occurred during float fallback for models with external data. {147000}
*     Fixed an issue where FP16 constant tensors were not correctly interpreted at the Python layer. {147009}
*     Introduced new flags to provide fine-grained control over the IR optimizer passes. {135982}
*     Removed exception handling for 6D tensors in the converter. {144599}
*     RMSNorm node names now use either the common prefix of all matched nodes in the pattern or, if no common prefix exists, the output buffer name of  the pattern. This replaces the previous rms_norm_i naming based on topological order. {146838}

Bugs
~~~~
* API:
*   HTA:
*     Resolved an application crash that occurred when calling the QNN API to get the HTA device infrastructure for performance tuning. {146157}
* DLC:
*   Fixed issues within the DLC format when per-channel block quantization is employed on a multi-graph DLC. {138853}
* Genie:
*   Fixed an accuracy bug with cross-layer attention networks when the decoder block is a single context binary. {150908}
*   Fixed an issue that caused incorrect calculation of KV cache tensor sizes on the HTP backend, which could lead to segmentation faults. {148675}
*   Fixed an issue where no output was generated for certain models when the prompt prefill phase required multiple graph executions. {145896}
* GPU:
*   Improved performance by updating heuristics for Pooling and Reduction Ops to better utilize hardware resources, addressing inference time regressions on some models. {147242}
* HTP:
*   Enabled support for using the `ScatterElements` Op within LoRA-updatable models. {147845}
*   Fixed a checksum mismatch error that could occur during graph finalization for models using LoRa. {147901}
*   Fixed a crash that could occur during long-running stress tests involving VTCM sharing. {148064}
*   Fixed a graph finalization failure by adjusting the optimization pass order for certain Ops like Split and Unpack. {141064}
*   Fixed a memory leak that occurred in the HTP backend during repeated inference runs when performance profiling was enabled. {146627}
*   Fixed an issue preventing context binary generation for models using LoRA adapters where a MatMul operation of size 16x16 was present. {149711}
*   Fixed an issue that caused graph finalization failures for certain large models on specific SoCs. {147402}
*   Fixed an issue that caused incorrect error code translation when writing shared weight buffers. {147793}
*   Fixed an issue where applying a LoRA adapter binary would fail for multicore scenarios or float-precision graphs. {149995}
*   Fixed an issue where requesting a signed PD would fail on x86 simulation environments. The configuration is now ignored for x86, as it makes no difference in that context. {145651}
*   Fixed an occasional VTCM memory allocation error that could occur during context binary generation. {145879}
*   Fixed an Op package deregistration failure that could occur in specific multi-core use cases. {143977}
*   Optimized performance for a text encoder model by successfully applying MHA-to-SHA  transformations, converting MatMuls to Convolutions, and ensuring correct quantization settings. {136947}
*   Resolved a failure in on-device context binary generation when using custom Ops. {147187}
*   Resolved an error where applying a LoRA adapter failed with the message "Apply cannot happen as context bin did not have serialized bin." {149992}
*   Resolved an issue where using Op packages in multi-threaded applications could cause a `QNN_OP_PACKAGE_ERROR_LIBRARY_ALREADY_INITIALIZED` error, halting execution. {147431}
*   Resolved memory leaks observed under specific stress scenarios. {145181}
* LPAI:
*   Fixed an issue that caused the ADSP driver to fail to load on certain Windows on Snapdragon platforms. {149188}
* Op:
*   CPU:
*     Fixed the Mod Op to align its calculation with the behavior of standard frameworks. {147060}
*     Resolved an issue that caused model failures on the CPU backend when a quantized Div Op encountered a zero-valued divisor. {150630}
* SDK:
*   Optimized specific library functions on Windows by replacing parts of the C++ standard library with native Windows API calls, reducing the overall binary size. {150497}
* SNPE:
*   DSP:
*     Resolved an issue where executing a model with a UDO package on the DSP backend could fail with a `QNN_OP_PACKAGE_ERROR_LIBRARY_ALREADY_INITIALIZED` error. {135967}
* Tool:
*   Fixed an input parsing issue in the ModelModifierArchChecker tool. {144884}
*   Resolved an issue where `qnn-accuracy-debugger` would fail with a `FileNotFoundError` when using a compiled model (`--stage compiled`). {149891}
*   Compiler:
*     Fixed an issue in the context binary generator where a SpaceToDepth Op adjacent to a graph input could cause an error. {147548}
*   Converter:
*     Enabled support for dynamic 16-bit weights by default in `qairt-converter` and `qairt-quantizer`. This resolves an issue where an unnecessary `Convert` Op was inserted for `MatMul` weights, which previously led to increased model size and reduced accuracy. A new `--disable_dynamic_16_bit_weights` flag has been added to revert to 8-bit conversion if needed. {147008}
*     Fixed a bug in the quantizer where node-squashing logic could fail for nodes that were both a graph output and had inputs with multiple consumers. {136028}
*     Fixed a bug that could cause a 'Duplicate buffer name' error during certain graph optimizations. {145690}
*     Fixed a fatal "access violation" exception that occurred when running the ONNX converter on WoS devices. {149750}
*     Fixed an issue with generating quantization encodings for models containing LSTM or GRU layers. {146424}
*     Fixed an issue with handling dynamic inputs for the slope tensor in the PReLU Op. {145599}
*     Fixed an issue with the LoRA model conversion flow where certain graph optimization passes were not being applied consistently. {150868}
*     Fixed incorrect weight broadcasting behavior in the RMSNorm and LayerNorm fusion patterns within the ONNX converter. {124105}
*     Resolved an issue where certain graph optimizations could incorrectly remove a tensor that was also a graph output. {150933}
*   qairt-tool:
*     Added support for Clip,SpaceToDepth,Relu Ops in mha2sha-v2 {149759}

Known Issues
~~~~~~~~~~~~
* Models with very large buffers (~1 GB or more) can abort during execution with "Could not create context from binary" due to FastRPC mapping failures {148198}



2.38.0
======

**9/02/2025**

QNN API version: v2.28.1

Changelog
---------

Features
~~~~~~~~
* API:
*   Generalized the `qairt.transform` API to support multiple, interchangeable transformation implementations. {138775}
*   Genie:
*     Added GENIE_NODE_IMAGE_ENCODER_IMAGE_FULL_ATTN_MASK and GENIE_NODE_IMAGE_ENCODER_IMAGE_WINDOW_ATTN_MASK node inputs. {145051}
*   GPU:
*     Added support for the `QNN_GPU_PRECISION_USER_PROVIDED` precision mode to the GPU backend extension API, allowing users to specify custom precision settings for a graph. {142096}
* Genie:
*   Added a source code example for genie-t2e-run to the SDK. {144427}
*   Added embeddingQuery support for offline embeddings in genie-app. {146044}
*   Added engine sharing support for models used across different dialogs, currently available for the HTP backend and applicable to basic and SSD dialogs. {147585}
*   Added support for encoder-decoder models in Gen AI Transformer. {136070}
* HTP:
*   Improved performance and reduced memory usage for certain vision models by removing redundant `space_rearrange` operations from the graph. {141570}
*   Removed the `-ffast-math` compiler flag from the build configuration to prevent potential numerical inconsistencies and improve accuracy alignment for floating-point operations. {139547}
* Op:
*   CPU:
*     Added support for the Logit Op. {136656}
*   GPU:
*     Added support for `INT32` data type inputs to the `ArgMax` Op on the GPU backend. {133989}
*     Added support for the CumulativeSum Op. {38682}
*   HTP:
*     Added backend support for the `STFT` Op. {134956}
*     Added documentation for dynamic dimension constraints in HTP Op definitions. {143878}
*     Added support for Int32 ElementWiseAbs and ElementWiseUnary with Abs operation. {138856}
*     Added support for signed int16 data type in Unpack Op validation. {142708}
*     Enabled support for the 5D Cast Op. {143121}
*     Enabled support for the 5D GatherElements Op with non-zero axis values. {143123}
*     Enabled support for the 5D Pad Op with a constant padding scheme for FP16 and FP32 data types. {143122}
* OpDef:
*   Added Op definition for STFT {134955}
*   Added support for `int32` and `UFIXEDPOINT8` data types for the `RandomUniformLike` Op. {146810}
* QNN:
*   TFLite Delegate: Added support for the Broadcast_to Op. {138848}
*   HTP:
*     Enabled Quant & Dequant Op between FP32 and QINT16 op validator {141056}
* SDK:
*   Added a new `RandomUniformLike` Op definition and reference implementation to align with the ONNX specification. {134859}
*   Enhanced OEM control over QNN priority levels, allowing more flexible configuration of graph execution priorities on HTP backend. {126262}
* SNPE:
*   Added documentation for low-level performance APIs under "Tutorials and Examples", "Application Tips" {145899}
* Tool:
*   Introduced a new Network Specialization module and API to programmatically convert and optimize models with multiple graph configurations into a single DLC file. This replaces the previous command-line-only workflow. {108571}
*   Added the ability to debug a specific subgraph by introducing two new command-line options: `--debug_subgraph_inputs` and `--debug_subgraph_outputs`. These options allow specifying the input and output tensors that define the subgraph to be analyzed. {127762}
*   Converter:
*     Added support for the `buffer_padding` parameter in the Buffer Op. {128998}
*     Added support for the `STFT Op in the ONNX converter. {138613}
*     Added support for the Logit Op. {138107}
*     Added support for the ONNX RandomUniformLike Op. {134348}
*     Added support for the ONNX STFT Op. {134349}
*     Enhanced the converter to automatically apply a float-fallback quantization behavior for models that contain Quantize-Dequantize nodes or are provided with quantization overrides (e.g., for LoRA). {139341}
*     First version (v0.1) of the QAIRT Quantization Specification is released which supports 2.0.0 schema version for quantization overrides file. {114160}

Bugs
~~~~
* DSP:
*   Significantly improved performance for models with a batch size greater than one by optimizing the 5D Reshape-Transpose-Gather pattern in the backend. {140837}
* Genie:
*   Added the missing 'type' field to the sampler.json configuration example. {138004}
*   Fixed a regression in Eaglet token generation rate. {145608}
*   Fixed a segmentation fault caused by uninitialized variables. {144692}
*   Fixed a segmentation fault that occurred when running LLM models with the `genie-t2t-run` tool. {147760}
*   Fixed an issue loading lm_head or LoRA adapters on Windows platforms. {143661}
*   Fixed an issue where paused queries with LUT encoder models could not resume. {145135}
*   Fixed an issue where prompt templates were not applied when GenieEmbedding_generate outputs were truncated. {143445}
*   Fixed memory leaks occurring during GenieDialog_applyLora. {136542}
* GPU:
*   Improved inference performance for select models in GPU FP16 mode on certain chipsets. {144204}
* HTP:
*   Added support for casting from `uint8` to `fp16` to resolve an accuracy issue where `uint8` was incorrectly interpreted during a cast to a float type. {135317}
*   Enabled support for asynchronous context initialization in multi-core environments. {138427}
*   Fixed a memory corruption crash that could occur in multi-threaded applications during deinitialization. {144587}
*   Fixed a segmentation fault that occurred when using asynchronous initialization on multi-core HTP configurations. {138335}
*   Fixed an accuracy issue that produced incorrect output when using LPBQ. {146380}
*   Fixed an issue where models would crash or hang on the HTP backend when the inference batch size was greater than one. {144574}
*   Fixed an issue where the `deviceGetPlatformInfo` API returned incorrect SoC information when using the non-RPC path. {141569}
*   Implemented a fix to prevent a CDSP crash when Virtual Address space is exhausted during memory allocation. {145909}
*   Resolved an intermittent failure in asynchronous execution mode that could lead to errors {138318}
*   Resolved an issue on certain platforms where a failure to lock the HMX context could cause a DMA execution failure. {138289}
*   Resolved execution failures for certain models in Gen AI corner cases. {129730}
*   Significantly improved performance for models using grouped `TransposeConv2d` by enabling an optimization that was previously restricted to operations with zero padding. {143544}
* Op:
*   HTP:
*     Added support for FP32 weight-only quantization in fully connected layers. {131398}
*     Fixed a crash in PoolAvg2d Op when reducing NxM inputs to 1x1 with padding and count_pad=0. {131311}
*     Fixed a crash occurring during GroupNorm fusion. {130501}
*     Fixed a runtime failure during context creation when a `spill_fill_buffer` was configured. {143863}
*     Fixed an accuracy issue in ElementWiseAdd Op when broadcasting a constant zero. {143254}
*     Fixed an accuracy issue in FP16 models caused by a faulty `SlicePad_shape->Transpose` graph optimization rule. {145638}
*     Fixed NullRequant Op registration failure when using w16 and per-channel quantization. {145523}
*     Improved performance of the `ReduceSum` Op for FP16 data types by ensuring a faster, optimized implementation is used. {143158}
*     Resolved a performance regression affecting model execution. {145191}
*     Resolved accuracy issue in Gather Op for depth=1 cases. {134448}
*     Resolved performance regressions for select models. {143809}
* SNPE:
*   Added support for the --optimization_preset option in snpe-dlc-graph-prepare and enabled online preparation via platform options. {135223}
*   Fixed an issue where setting HTP graph optimization levels in online preparation did not support distinct optimization levels for different SNPE instances. {142940}
*   The snpe-dlc-info tool now displays input, output, and unconsumed tensors in topologically sorted order. {146793}
* Tool:
*   Fixed an accuracy regression that could occur in certain models due to an incorrect start index calculation in a transpose operation. {144858}
*   Fixed an issue where block quantized convolution with special dimensions could cause preparation failures. {144994}
*   Resolved an issue where `snpe-parallel-run-cpp` would crash when used with the `--userbuffer_memorymapped` argument. {119102}
*   Converter:
*     Fixed a bug in Expand Op translation caused by incorrect data type population. {141810}
*     Fixed a bug in sink_transpose optimization where a transpose node could be consumed twice by the same node. {140535}
*     Fixed a bug that introduced redundant Convert nodes before LSTM/GRU nodes during mixed precision conversion. {145617}
*     Fixed an axis tracking issue in ONNX PRelu Op that could cause incorrect broadcasting. {142728}
*     Fixed an issue where 0D tensors were incorrectly retained as 1D tensors by propagating scalar tensor information as needed. {141899}
*     Fixed an issue where models with extremely small, near-zero quantization scale values (e.g., 1e-35) would fail during inference on the CPU backend. {127367}
*     Fixed an issue where the --float_bitwidth option could incorrectly update non-quantizable tensors. {145723}
*     Fixed an issue where the second input tensor of MatMul nodes from QDQ models was not correctly quantized. {136049}
*     Fixed an issue with encoding population in LayerNorm pattern matching. {141265}
*     Fixed issue where squashable elementwise operations following convolution operations caused errors when encodings of the convolution’s weights/bias were provided. {85485}
*     Improved validation in Resize optimization to prevent errors when invalid scale values are provided. {138778}
*     Resolved a model conversion failure for large ONNX models caused by excessive memory consumption. {122217}
*     Resolved an issue where recent updates to the model converter caused excessive memory consumption during graph serialization, leading to failures when creating context binaries for large models. {136952}
*     Squashed identity Expand and Tile nodes in the graph to remove redundant operations. {144693}
*     Updated the logic for matching RmsNorm patterns to improve pattern recognition. {146093}



2.37.0
======

**7/31/2025**

QNN API version: v2.28.0

Changelog
---------

Features
~~~~~~~~
* QNN HTP opdef supplement doc updated with descriptions of use of QNN_DEFINITION_IMPL_GENERATED encoding definition. {127977}
* API:
*   GPU:
*     Added support for the Qnn_DeviceHandle_t argument in the QnnContext_create API. {123584}
*     Added support for the Qnn_GlobalConfig API. {135731}
* Genie:
*   Added an async command to genie-app allowing for execution of asynchronous statements. {137243}
*   Added support for non-updatable quantization (NUQ) and grouped LoRA adapters. {138782}
*   Added the cache-groups JSON configuration option allowing for the sliding window attention (SWA) cache management policy. {135552}
*   Introduced the SSD dialog "branch-mode" config option with "top-1" and "all-expand" supported values. {134925}
*   Added Eaglet dialog support for dual head draft models. {134373}
*   API:
*     Added GENIE_NODE_IMAGE_ENCODER_IMAGE_POS_SIN and GENIE_NODE_IMAGE_ENCODER_IMAGE_POS_COS node inputs. {133935}
* HTP:
*   Support LoRA weights sharing feature by extracting updatable weights across all graphs into a shared blob. {126930}
*   Added Support for QAIRT Block Ops Stateful LSTM, Stateful GRU & Buffer Ops for FP16 precision {125048}
*   Added support for VA Reservation on Windows platforms. {138341}
*   Support LoRA weights sharing feature by extracting updatable weights across all graphs into a shared blob. {128558}
* Op:
*   GPU:
*     Added support for the GatherND Op on the GPU backend. {61057}
* OpDef:
*   Added Op definition for IsNaN. {135847}
* QNN:
*   Fixed html documentation broken links for SNPE documentation URL "Qualcomm Neural Processing SDK" under Overview -> Integration workflow and in the tutorial for Utilizing DLCs. {143420}
* Tool:
*   Lora Creator: Added support for any kernel shape for Conv in Lora Branch. This removes limitation where only 1x1 Conv was supported. {140575}
*   Converter:
*     Added support for SparseConvolution2D. {118014}
*     Optimized Lora Importer for non-updatable quantization (NUQ). {127586}
*     Resolved performance regression on CPU/DSP backends by removing redundant clip operations in the TFLite converter; now, clip is only added when required based on fused_activation_function. {123581}
*   Genie:
*     Added support for GenieEmbedding APIs in genie-app. {123549}

Bugs
~~~~
* Fixed for wrongly freeing rpc memory allocation for lora adapter in scenarios where context had multiple graphs. {138835}
* Fixed lora weight tensor names not found issue when graph transformation involved {136062}
* fixed updatable attribute tracking error for torch models {145158}
* Support is added for Conv2D ops with reuse_space_indices parameter defined. Prepare/graph finalization failures will be prevented. {143040}
* QNN Docs: Corrected html docs for qnn-net-run command line argument --output to --output_dir {144805}
* Tool Update: [Converter]: Few performances regression observed on CPU/DSP backends and fixed by removing redundant clip operations in the TFLite converter; now, clip is only added when required based on fused_activation_function. {141085}
* SNPE Tools: snpe/qairt dlc-info fixed to display the correct graph optimization level for HTP cache records generated via API Snpe_SNPEBuilder_SetInitCacheMode() / SNPEBuilder::setInitCacheMode() or net-run option --enable_init_cache {142514}
* CPU:
*   Fixed quantization issues for large models by correcting the softmax Op implementation. {140260}
*   Resolved an issue with axis permutation for BW_AXIS_SCALE_OFFSET quantization encoding in Conv operations. {138266}
* DLC:
*   Fixed small memory leak in DLC based initialization in SNPE and QNN.
* Genie:
*   Fixed a crash when running SSD or SPD dialog types on certain Linux platforms. {137954}
*   Fixed an out of bounds read issue observed on uint16 embedding LUTs. {144801}
*   Fixed issue where first context binary split does not contain sufficient information about graph variants to properly initialize the KV$ Manager. {136530}
*   Fixed issue where the draft model EOS token was not set causing an Eaglet initialization failure. {145057}
*   Fixed minor memory leaks. {136813}
*   Fixed segmentation fault when graph switching is enabled along with memory mapping. {143826}
* HTP:
*   Fixed a deadlock issue that could cause the qnn-throughput-netrun application to hang under stress conditions. {142471}
* KI:
*   In QNN HTP BE, update on the prepare sequence is causing a regression on some specific models. This will be fixed in the next release (2.36) {136438}
* Op:
*   HTP:
*     Optimized qu16 Dequantize op {136231}
*     Optimized the pattern of TransposeConv2d-Dequantize pattern near Output. {134467}
*     Optimized the pattern of TransposeConv2d-Dequantize pattern near Output. {136219}
*     Reduced preparation time for 5D operations with large batch sizes. {130280}
* SNPE:
*   Fixed a crash in snpe-throughput-net-run when the container argument was not specified before certain optional arguments. {141598}
* Tool:
*   Calibration Input Validation, Quantizer Params, Input Type Conversion handled for HTP Memory Pipeline {138064}
*   Fixed a failure in the memory pipeline when filtered inference schemas were non-sequential. {142391}
*   Ordered ONNX Runtime outputs based on output name to resolve issues in memory pipeline inference. {136967}
*   Remove backend_info from Quantizer params to resolve issue in memory pipeline compilation {136586}
*   Updated params access way of pydantic object to resolve preserve_io_datatype issue in memory pipeline {144331}
*   Converter:
*     Added support for Layernorm with multiple normalization dimension {137898}
*     Added support for matching new GeLU Op patterns that include Reshape operations to addsress an issue where semantic search models failed conversion with AutoMHA2SHA. {139465}
*     Fixed a bug in the Conv/MatMul quantizer optimization to ensure safe indexing. {142845}
*     Resolved performance regression on CPU/DSP backends by removing redundant clip operations in the TFLite converter; now, clip is only added when required based on fused_activation_function. {140762}
*     Updated conv node's weight/bias naming during BatchNorm fusion to resolve quantization parameter naming conflicts. {139997}
*     Added  support for a new pattern in RMSNORM pattern matching {134922}
*     Added fix to remove injected ops blocking supergroups {134113}
*     Fixed accuracy drop in models having shared biases {134589}
*     Updated gamma and beta shape of Layernorm Onnx Op {130934}
*     Updated Tensor Name Sanitization Logic in {141135}
*     TFLite:
*       Add support for int64 quantized bias {140882}
*   Converters:
*     Fixed issue of LayerNorm pattern mismatch. {137459}
*     Supported dynamic bias to ConvOp. {142223}
*   qairt-accuracy-evaluator:
*     Fixed inclusion of converter params in execcution summary {140752}
*     Limit parallel qnn x86 evaluations to1 {138075}
*   snpe-net-run:
*     Fixed a dynamic resizing issue in Conv op when using the --input_dimensions option. {142139}
* Tools:
*   Converters:
*     Reduced conversion time for large models with more than 10000 ops. {135822}



2.36.0
======

**6/30/2025**

QNN API version: v2.27.0

Changelog
---------

Features
~~~~~~~~
* API:
*   Added LLM support in the Python API. {118016}
*   Added support for quantizer-specific options in the Converter Python API, including parameters for `act_quantizer_schema`, `param_quantizer_schema`, and `target_backend`. These options are now available through the `CalibrationConfig` object, improving feature parity with the command-line interface. {136135}
*   Added support for the Baichuan2-7b model through the high-level Generative AI Python API, enabling both builder and executor workflows. {126702}
*   Added support for the Phi-3.5-mini model through the high-level Generative AI Python API, enabling both builder and executor workflows. {138126}
*   Added support for the Qwen2-7b model through the high-level Generative AI Python API, enabling both builder and executor workflows. {132444}
*   Enabled the generation and consumption of JSON profiling data on Windows platforms. Users can now utilize the profiling capabilities of the Python API on Windows on Snapdragon (WoS) systems. {138647}
*   Introduced a model conversion capability to modify the Auto-Regression (AR) number and Context Length (CL) of ONNX-based language models. This allows for flexible adaptation of models to different deployment requirements. {123570}
*   Genie:
*     Introduced Genie Dialog and Embedding APIs to set and get performance policy. {137070}
*   HTP:
*     Added support for `ContextFinalize` for the HTP backend, enhancing context management capabilities. {136699}
*     Implemented a URI Builder abstraction to simplify the programmatic construction of FastRPC URIs used for opening sessions with the HTP backend. {110797}
* Core:
*   Added custom Op support to `oe--gcc11.2` and `oe-gcc 9.3` toolchains for QNN OP Package Support on LE Target for HTP. {130471}
* Docs:
*   Updated the LoRAv2 tutorial to indicate support for Windows operating systems in both offline and online workflows. {138772}
* Genie:
*   Added `skip-lora-validation` option to reduce LoRA adapter switch time by allowing skipping of LoRA CRC checks on QnnHtp engines. {134913}
*   Added experimental support for the `arm64x-windows-msvc` platform. {129093}
*   Added support for Non-Updateable Quantization (NUQ) and Grouped LoRA, allowing LoRA adapter groups to share encoding bins and supporting non-updateable quant adapters. {138782}
*   Added support for pausing and resuming active queries using a signal API, introducing an architecture for resuming paused queries in SSD and basic dialogs. {119704}
*   Added support for profiling and logging of GenieEngine APIs, enabling measurement of switch time, creation time, and other metrics. {131908}
*   Added support for repetition penalties in sampling within the Genie Sampler. {118081}
* HTP:
*   Added support for HTP online graph preparation optimization level via platform options. {138420}
*   Added validation to reject Per-Graph-Execution (PGE) configurations that specify incompatible features such as shared spill/fill buffers or VTCM backup sharing. A warning is now issued to prevent these unsupported setups. {128832}
*   Enabled 64-bit UDMA support in QNN HTP, allowing access to memory beyond 4GB for large neural networks, and implemented shared-weights far mapping. {91520}
*   Enabled multi-context spill/fill buffer sharing for QNX. {128061}
*   Enhanced the HTP backend polling mechanism to support separate polling contexts and threads for each execution priority level. This design improves performance and resource management for multithreaded applications that concurrently run graphs with different priorities. {131859}
* LPAI:
*   Added support for LPAI backend RPC mode and `QNN_GRAPH_ERROR_EARLY_TERMINATION` in `qnn-throughput-net-run`. {121599}
* Op:
*   CPU:
*     Added support for Sparse Convolution 2D. {120883}
*     Updated the Cast Op to correctly map NaN (Not a Number) inputs to `True` when casting floating-point values to `BOOL8`, aligning with ONNX implementation. {136649}
*   HTP:
*     Added support for the MaskedSoftmax Op on the HTP backend for LLM use cases. {110661}
*   LPAI:
*     Added support for the `frame_pad` parameter to the Buffer Op on the LPAI backend. {128999}
* OpDef:
*   Added an optional parameter `reuse_sparse_indices` to the Conv2d Op, with default support for AIC, GPU, HTA, and LPAI backends. {118012}
* SDK:
*   Introduced `QAIRT_SDK_ROOT` as the new primary environment variable for setting the SDK path. The previous `QNN_SDK_ROOT` and `SNPE_ROOT` variables are now deprecated and will be removed in a future release. For backward compatibility, they are currently set based on `QAIRT_SDK_ROOT`. {121206}
* Tool:
*   Enhanced layerwise debugging tools to accept externally provided "golden" reference outputs for comparison. This allows users to supply their own reference data. A new option to disable layout transformation during this process has also been added to accommodate various data sources. {122717}
*   Converter:
*     Added support for the new Einsum equation `nkctv,kvw->nctw`, expanding the range of supported ONNX models. {126231}
*     Added support to serialize disconnected model inputs (dangling inputs) from the source framework into the DLC file. {139058}
*     Defer loading is now enabled by default for the ONNX converter to improve memory usage and processing time. To disable this feature, use the new `--onnx_disable_defer_loading` flag for the QAIRT converter or the `--disable_defer_loading` flag for the QNN/SNPE ONNX converter. {139858}
*     Enabled support for the `--defer_loading` option in the QNN ONNX converter when generating C++/binary outputs. This feature, which was previously unsupported for this output format, helps reduce memory consumption and processing time during conversion. {139859}
*     Removed a limitation in the ONNX converter that previously prevented using defer loading (`--onnx_defer_loading`) and ONNX model simplification in the same conversion. Both features can now be used simultaneously. {116422}
*     ONNX:
*       Added support for the ONNX Size Op, which outputs the total number of elements of an input tensor as an int64 scalar. {138523}

Bugs
~~~~
* API:
*   Fixed a bug in the converter input configuration where the data type of the first input was incorrectly applied to all other inputs. {137113}
*   Fixed a bug in the model-level API where a typo in an internal variable could cause issues with input list file generation. {137830}
*   Fixed an issue in the Quantizer API where parsing an input list file containing comment lines (e.g., lines starting with '%') could fail. {136414}
*   Fixed an issue where the GenAIExecutor would return invalid performance metrics, such as -1 or 0 for timing and tokens per second. {137575}
*   Reduced excessive warning messages generated by `qairt.compile` by correcting an internal log level configuration. {137628}
*   Refactored the Python API to ensure model configuration files (`config.json`) can be loaded correctly using standard methods like `autoconfig.from_pretrained`. {131057}
*   CPU:
*     Fixed an issue where graph composition for the CPU backend would fail with an OpConfig validation error for the Transpose Op, particularly when using the `float_precision=16` conversion option. {138242}
* Core:
*   Improved model initialization time on the HTP backend by optimizing internal system calls during runtime setup. {136899}
* CPU:
*   Fixed an issue where certain models failed during inference due to an invalid layer parameter value resulting from a GroupNorm operation failure. {135924}
* Genie:
*   Fixed a memory leak in the tokenizer implementation observed when running `genie-t2t-run` with the LoRA adapter. {130865}
*   Fixed an issue where LLM inference could produce random or incorrect output. {124867}
*   Fixed LM head execution for split LEQ models during the last iteration of prefill. {139824}
*   Fixed sampling for float16 models which would produce nonsensical response text. {134604}
*   Reduced peak RAM by removing unnecessary copies for embedding LUT encoders when running embeddings on CPU, addressing high memory usage for longer prompts. {134506}
*   Resolved a crash in the Genie runtime that occurred when using non-empty stop sequences in a dialogue query. {138311}
* HTA:
*   Fixed a segmentation fault that could occur when executing a cached model on the HTA backend if a subgraph fell back to the DSP backend. {127808}
* HTP:
*   Fixed a performance regression on the HTP backend that affected certain transformer models, including those using masked softmax. {137554}
*   Fixed an accuracy regression for models using the ResizeNearestNeighbour Op. The fix adapts the HTP backend to handle updated quantization parameters resulting from an improved CPU backend implementation of the Op. {116566}
*   Fixed an issue that prevented the DSP driver from loading correctly for multicore execution on Android. {135235}
*   Fixed memory deregistering failures in GenAI use cases by deallocating unused tensor buffers after inference completion in async mode. {129731}
*   Resolved a performance regression on the HTP backend that affected both synchronous and asynchronous inference modes for certain models. {137386}
*   Op:
*     Fixed ElementwiseFloorDiv name mismatch. {135158}
* LPAI:
*   Fixed an accuracy regression for models using asymmetric parameter quantization. A change was introduced to correctly handle the `--param_quantizer_schema` flag, which may require users to update their quantization settings. When a tensor's encoding is symmetric, the quantizer schema must now be set to `unsignedsymmetric` to ensure correct behavior. {138453}
* Op:
*   CPU:
*     Fixed a dynamic bias issue in the DepthwiseConv2d Op that caused a segmentation fault with the QNN CPU backend. {137313}
*     Fixed a memory leak in the Expand Dims Op by ensuring the freeing of space created for axis data. {138049}
*     Fixed an issue by adding INT8 support for GroupNorm Op. {135932}
*   DSP:
*     Fixed a performance regression by preventing an unnecessary Reshape Op from being added by the LogSoftmax implementation when its input and output shapes are identical. {137013}
*   HTP:
*     Added 5D rank constraints for Softmax and Conv Ops, resolving an issue with ExecuTorch QNN Delegate model preparation. {137462}
*     Fixed an accuracy drop in the HTP backend's `GridSample` Op that occurred with multi-batch inputs (batch size > 1). {134663}
*     Fixed an accuracy regression in the HTP backend implementation of the `DepthToSpace` Op. This change restores the behavior to align with previous versions, resolving potential output deviations for models utilizing this operation. {139578}
*     Resolved an accuracy issue where models using the `Concat` Op on the HTP backend could produce different and less accurate results when running without the `--debug` flag in `qnn-net-run`. {134084}
* Tool:
*   Fixed an issue where an incorrect offset was generated during the dequantization of tensors with signed symmetric, per-channel encodings. {137056}
*   Resolved a segmentation fault that could occur in the `qnn-context-binary-generator` tool during the `QnnContext_free` call. {139746}
*   Converter:
*     Added support for GRU Op quantization, specifically enabling quantization for LPAI backend by optimizing static inputs. {126350}
*     Corrected an issue that could lead to accuracy regressions on the LPAI backend for models using 4-bit activation quantization. The SDK now correctly enforces the use of 8-bit activation quantization, as 4-bit is not supported on the LPAI backend. {137976}
*     Enabled `enableQnnQuant` flag for Resize Op in-out optimization, resolving issues with Nearest Neighbor and Bilinear modes. {137641}
*     Fixed a bug in the Converter tool that ensures the correct order of input and output tensors in the QNN graph JSON file during serialization, aligning them with the IR graph. {118500}
*     Fixed a corner case in the Expand Op pattern matching, specifically resolving an issue in the Squash Tile Unsqueeze optimization that led to incorrect shape inference for multi-consumer cases. {136864}
*     Fixed a log print format issue that affected accuracy when converting LLM models with `maskedsoftmax`. {137471}
*     Fixed an issue where Batch Normalization (BN) scales and offsets were not correctly obtained from QDQ models, ensuring proper application of BN parameter encodings. {129578}
*     Fixed an issue where ONNX Logsoftmax Opset11 would add unnecessary reshapes, leading to extra transpose operations, even when input/output shapes were identical. {137545}
*     Fixed an issue where per-Block/per-Channel encodings were not correctly applied for weights during QAIRT conversion, resolving the inability to quantize DLC with 4-bit BQ weights. {134363}
*     Fixed an issue where using multiple Static Tensor nodes in a single graph would fail due to duplicate output tensor names. {136080}
*     Fixed an issue with merging `Mul` and `Add` operations into `Batchnorm` by correcting pattern definitions and adding validation checks. {136756}
*     Reduced converter memory and time usage by avoiding unnecessary access to tensor weights. {137665}
*     Removed the `beartype` import in the PyTorch converter. {134045}
*     Resolved an issue in the Layout Transform post-optimization where a node could be incorrectly squashed multiple times, causing incorrect broadcasted output shapes for certain `Reshape` and `Transpose` operations. {139382}
*     Updated tensor name sanitization logic to ensure uniqueness and prevent conflicts, resolving issues like "Compose Graph failed: Sigmoid Tensor already exists". {135409}
*     ONNX:
*       Enhanced support for the `If` Op in the ONNX converter to allow subgraphs with multiple outputs. {136721}
*       Resolved a `NameError` in the quantizer tool that occurred due to a missing internal logging function. {140893}
*   qnn-context-binary-generator:
*     Enhanced `qnn-context-binary-generator` to precompute and validate adaptation weight metadata paths, allowing early error detection for erroneous LoRA config contents and avoiding long wait times. {126629}
*   qnn-model-lib-generator:
*     Redirected error logs to `stderr` and all other logs to `stdout`. {135807}
*   Quantizer:
*     Resolved an issue in the quantizer to correctly apply per-channel quantization for grouped ConvTranspose Ops. {136585}



2.35.0
======

**5/30/2025**

QNN API version: v2.26.0

Changelog
---------

Features
~~~~~~~~
* API:
*   Added LLM support in the Python API. {118016}
*   Genie:
*     Added a data-alignment-size configuration option for dialog and embeddings APIs. {130270}
*     Introduced the GeniePipeline.h and GenieNode.h APIs, providing multimodal support. {123389}
*     Introduced the GenieTokenizer.h API. {126408}
*   HTP:
*     Added support for new memory buffer types (`QNN_HTP_MEM_WEIGHTS_BUFFER` and `QNN_HTP_MEM_SCRATCH_BUFFER`) in the `QnnMem_register` and `QnnMem_deregister` APIs. {121766}
*     Introduced API changes to support external weights and spillfill buffers. {121760}
* Core:
*   Added platform information to the JSON output of the context binary utility. {129905}
* CPU:
*   Added dangling inputs support in Graph. {134280}
*   Added Phi 3 and Phi 3.5 model configurations to the Genie SDK. {134117}
* Docs:
*   Updated QNN/SNPE documentation to include QCS8625 in the list of supported Snapdragon devices. {134450}
* Genie:
*   Added support for use-mmap on Windows platforms. {116519}
*   Enabled support for multi-modal inference with low latency through the GenIE pipeline, supporting various input/output modalities and utilizing shared embedding weights. {120507}
*   Removed printing of KPIs to stdout, favoring use of GenieProfile. {123352}
* HTP:
*   Added initial support for multi-core weight sharing during deserialization, including functions to handle VA allocation for weights per core and passing multi-core metadata. {124612}
*   Added multicore weight sharing support during deserialization to map shared weights to different cores without requiring VA reservations. {135411}
*   Added support for configuring extended_udma prepare time. {136435}
*   Added support for measuring end-to-end latency in the runtime. {98570}
*   Added support for the `QNN_HTP_CONTEXT_CONFIG_OPTION_DEFER_GRAPH_INIT` context configuration option to postpone graph-related tasks. {130605}
*   Added support for the `QNN_HTP_CONTEXT_GET_PROP_BUFFER_START_ALIGNMENT` context property to retrieve buffer start alignment. {134678}
*   Added support for the usage of external weights and scratch buffers on the HTP backend. {121767}
*   Added support to save the transport result for multicore transport during async execution. {132146}
*   Enabled support for dynamic input and output resolution for SD3 on the HTP backend. {105781}
*   Enabled the mmap budget feature for WoS to reduce peak RAM usage during context initialization for GenAI use cases. {131070}
*   Extended binary format support for spill/fill to include external buffers. {136017}
*   Implemented buffer size calculations for the HTP backend, including consideration for graph selection and calculation of maximum spill/fill buffer size. {121765}
*   Updated the Throughput Net Run (TNR) application to utilize thread_pool utilities for thread management. {113123}
* Op:
*   CPU:
*     Added dynamic dimension support for AvgPool2D. {126775}
*     Added dynamic dimension support for InstanceNorm Op. {101384}
*     Added support for the 'frame_pad' parameter in Buffer Op. {133242}
*   GPU:
*     Added support for the Cast operation from INT64 to INT32 on Windows. {132750}
*   HTP:
*     Added INT16 support for the ElementWiseAsin Op on the HTP backend. {114479}
*     Added support for the MaskedSoftmax Op on the HTP backend for LLM use cases. {110661}
*     Implemented performance optimizations for the Score Filter and NMS operations on the HTP backend. {134740}
* OpDef:
*   Added Op definition for IsInf. {125370}
* SDK:
*   Added an option to enable optrace profiling in the TNR application. {135588}
*   Enabled SNPE, QNN, and QNN delegate support for the QCM8550 platform. {129533}
* Tool:
*   Converter:
*     Added dynamic weights support for the Deconv Op in TensorFlow models. {109713}
*     Added support for Add, Subtract, Multiply, and Divide operations in Float32 precision for static tensor manipulation within the G2G IR. {125540}
*     Added support for ONNX 1.16.1 in the Ubuntu 20.04 (Focal) environment. {134975}
*     Added support for the Size operation and updated Relu opset versions in the ONNX converter to address unsupported operations in certain models. {133472}
*   Genie:
*     Introduced the genie-app command-line utility. {123548}
*   HTP:
*     Added support for the HTP MCP Binary format in the `QnnHtpBinaryBufferPrinter` tool, enabling proper parsing and printing of MCP binaries. {128507}

Bugs
~~~~
* API:
*   Allowed passing extra arguments through the Python API's `ConverterConfig` to underlying modules. {133985}
*   Fixed an encodings path issue during the build phase with GenAI models using the Python API. {133815}
*   Fixed an issue where quantized and compiled models failed during execution with the Python API when using default `CalibrationConfig` values. {134858}
*   Fixed an issue where the QAIRT Python API failed to load backend libraries (`QnnCpu.dll`/`QnnHtp.dll`) on certain devices. {134461}
*   Fixed an issue with the JSON reader setting in QNN profiling on Windows. {134565}
* Core:
*   Fixed cross SoC compatibility issues caused by unsynchronized GpuInfo fields between SocServer and SocUtility. {135786}
* CPU:
*   Fixed a memory management issue for xnnpack Conv2D nodes. {132710}
*   Fixed an issue where certain models failed during inference due to an invalid layer parameter value resulting from a GroupNorm operation failure. {135924}
* DSP:
*   Fixed a context binary generation issue on OE Linux Platform. {124376}
*   Fixed an issue where `snpe-net-run` failed due to an unavailable runtime. {135399}
*   Fixed inference time regressions observed on HTP_FP16 and HTP backends by propagating DSP architecture characteristics to the HTP core. {133777}
* Genie:
*   Fixed an asynchronous initialization issue for Windows platforms. {135904}
*   Fixed an issue where GenieDialog_save/restore could not be used with GENIE_DIALOG_SENTENCE_REWIND. {135558}
*   Fixed an issue where GenieProfiling data could report invalid initialization time data. {134498}
*   Fixed an issue where stop sequences did not work with GenieDialog_embeddingQuery. {134592}
* GPU:
*   Resolved model verification failures encountered with certain CNN models on the GPU backend, related to Conv Kernel processing. {130041}
* HTP:
*   Adjusted max PD size calculation to correctly account for far weights, resolving an issue with unexpected secondary PD triggers during specific test conditions. {127268}
*   Fixed a crash occurring in multicore graphs due to incorrect identification of spillfill memory pools by the Hexagon NN API. {135543}
*   Fixed a Stability issue with Llama 3 3B multicore models by updating the method for setting the mc_spill_fill buffer. {135253}
*   Fixed an issue where `qnn-net-run` failed to open a session due to library loading and device transport instance creation errors. {135028}
*   Fixed an issue where core information was not correctly captured in optrace for multicore execution. {133797}
*   Fixed an out-of-memory issue occurring when running Llama 3 8B models on a single core without splitting. {134696}
*   Fixed async execution failures observed while running certain models in a multicore configuration with shared buffers. {135047}
*   Fixed logic in graph switching to prevent a bug. {133794}
*   Fixed multicore async inference failures, including issues observed with Zero copy. {134701}
*   Improved model execution time performance on SM8750, addressing an issue where the execution time KPI was not being met. {128145}
*   Resolved a graph execution failure issue observed during the async_group_init_llama7b_graph_switch_no_shared_resources test. {126402}
*   Resolved an issue causing incorrect mapping of test failures in nightly reports. {125884}
*   Resolved an issue leading to a "Failed to deregister ion memory with the backend" log message during multi-threaded HTP binary execution with shared buffers. {129716}
*   Resolved differences in adapter switch time between Genie and `qnn-net-run` by addressing issues related to graph switching and power settings. {131776}
* Op:
*   CPU:
*     Fixed an issue by adding INT8 support for GroupNorm Op. {135932}
*     Fixed TransposeConv2d for asymmetric kernels in Float execution. {133778}
*   GPU:
*     Fixed accuracy errors with the ReduceSum operation when used with Image2DArray for non-Mean ops and specific dimensions. {131616}
*     Fixed inference failures in models with Argmax/Argmin Ops. {133052}
*   HTP:
*     Added support for LayerNorm when the constant input is FP16 converted to FP32. {131420}
*     Enabled UINT_8 datatype support for the StridedSlice Op on the HTP backend, resolving model conversion and graph preparation failures. {125597}
*     Fixed accuracy issue for GatherNd Op. {110126}
*     Fixed an accuracy issue with LPBQ convolution for MOE on v73. {133134}
*     Fixed an issue where the Genie output resulted in an infinite loop with WoS by updating the prompt file. {134680}
*     Fixed an issue with high power consumption for DepthwiseConv op with asymmetric stride by optimizing the pattern on the HTP backend. {133635}
*     Improved accuracy of the Swish Op. {133898}
*     Improved performance of the MatMul Op running on HVX. {135210}
*     Improved the performance of the 5D GridSample Op on the HTP backend for W8A16 quantization. {122831}
*     Improved the performance of the GridSample Op on the HTP backend by addressing tiling and scheduling issues. {126462}
* SDK:
*   Fixed an issue where some models failed at the concat operation during graph preparation. {132887}
* Tool:
*   Added a validation check for float fallback to prevent quantizer failures when encodings or calibration lists are not provided. {133463}
*   Added support for the `--onnx_batch` and `--tensorflow_batch` options in Hypertuner after QAIRT converter changes. {131064}
*   Eliminated a misleading warning message "Function not called, PrepareLib isn't loaded!" that would appear when running `qnn-net-run` successfully on HTP. {122382}
*   Fixed an issue where the `is_symmetric` value for 32-bit bias tensors was incorrectly reset during Float Fallback, causing failures when the output DLC was passed back to the quantizer. {135379}
*   Fixed quantizer to insert Convert Op for LayerNorm weights with external encoding. {134466}
*   Resolved an issue where `snpe-dlc-graph-prepare` failed for certain models due to incompatible float bitwidths when QParams were present, particularly in the float fallback path. {130558}
*   accuracy_debugger:
*     Corrected a tensor shape issue for the oneshot algorithm with ONNX batch=1; the onnx_batch override option is no longer accessible. {133915}
*   Converter:
*     Added a fix for a bug in LayerNorm squeeze_axes. {126234}
*     Added a pattern to map to expand op to reduce inference time. {132363}
*     Added a warning message for the Non-Zero Op when the output shape is dynamic. {126185}
*     Added support for a new einsum equation, expanding the range of supported ONNX models. {133824}
*     Converter-generated FullyConnected Ops now have 2D input and 2D output. {127049}
*     Ensured that `ApplyEncodings` is called by the quantizer when `--use_quantize_v2` is provided internally, even if not on the command line. {133705}
*     Fixed a bug in NonZero Op translation constant folding. {127165}
*     Fixed a bug in the squash_node_into_nn_node optimization. {126354}
*     Fixed a conversion error that occurred when `--float_bitwidth 16` was provided on the command line with existing quantization parameters. {134716}
*     Fixed a corner case in the DCE process in the converter to correctly handle node removal based on the number of consumers of output tensors. {129704}
*     Fixed an error in the squash_node_into_nn_node optimization. {132836}
*     Fixed an issue where output nodes for BatchMatMul and BatchMatMulV2 Ops were missing by adding support to convert them to FullyConnected Op. {127139}
*     Fixed an issue where the converter failed when using the `--desired_input_layout` argument with the new layout transform algorithm by unifying its behavior with `custom_io`. {136144}
*     Fixed an issue with 6D support for Concat and Constant Ops in the frontend, resolving a core dump error during quantization. {117698}
*     Fixed incorrect population of the "is_symmetric" flag, ensuring encodings are dumped correctly. {134673}
*     Fixed issue observed when several GRU share one init hidden status, add UT for bidirectional GRU. {91127}
*     Fixed JSON dumping for 4-bit quantized tensors. {133481}
*     Fixed KernelScale expansion for scalars in TFLite DeConv dequantization. {128978}
*     Resolved an accuracy regression issue related to the `squash_batchnorm` optimization in the converter by ensuring the optimization correctly handles encodings. {130130}
*     Skipped adding dummy weights and bias tensors during LayerNorm pattern matching. {128870}
*     ONNX:
*       Added a fix for axis_format handling in matmul_to_fc translation. {118318}
*       Fixed a model conversion issue with the Resize operation in the ONNX converter. {131677}
*       Fixed an ONNX conversion failure for the Sam2 Image Encoder model by addressing layout format issues for Matmul node inputs and outputs. {131098}
*   Op:
*     HTP:
*       Optimized the DepthwiseConv op with asymmetric stride to improve performance for specific models. {132474}
*   qairt-accuracy-evaluator:
*     Removed the preproc-file option from the Accuracy Evaluator CLI as it is no longer valid due to the deprecation of minimal mode. {129278}
*   qnn-onnx-converter:
*     Fixed an issue where static tensor framework trace information was missing for some tensors. {120982}
*   qnn-tensorflow-converter:
*     Added logic to ensure the min-max in TensorFlow FakeQuantPerChannel nodes are symmetric. {118672}
*   quantizer:
*     Fixed an issue with 2-bit weight quantization calculation, resolving incorrect output values. {132048}



2.34.0
======

**4/30/2025**

QNN API version: v2.25.0

Changelog
---------

Features
~~~~~~~~
* API:
*   Genie:
*     Added `GenieEngine.h`, `GenieDialog_getEngine`, and `GenieDialog_bindEngine` APIs. {126715}
*     Added GenieSampler_registerUserDataCallback API which adds a userData argument to the sampler custom callback. {130164}
*   SNPE:
*     Added Java API `setUnconsumedTensorsOutput()`, equivalent to the C/C++ builder API `Snpe_SNPEBuilder_SetUnconsumedTensorsAsOutputs()` / `SNPEBuilder::setUnconsumedTensorsAsOutputs()`. {125891}
* CPU:
*   Added axes parameter support in L2Norm. {121463}
*   Added BOOL support in CPU Concat Op. {130940}
* DSP:
*   SNPE:
*     Added the ability to display the exact priority of the HVX thread in the log to help identify potential issues related to HVX concurrency scenarios. {117790}
* Genie:
*   Added a LoRAv3 reference/sample Genie configuration to the SDK examples. {130008}
*   Added KV quantization support for  GenAiTransformer backend. {123438}
*   Added the Eaglet dialog type. {126452}
*   Added token-acceptance-rate to the GenieProfile output for some dialog types. {123350}
*   Introduced a performance optimization where logits are sampled using the native datatype output of the model. {121359}
* HTP:
*   Added support for graph switching with multi-PDs. Maximum support is 2 PDs with 16 GB physical memory; 3 PDs require a minimum of 32 GB physical memory. {119377}
*   Deprecated optrace collection via debug configuration files. Use optrace via profiling instead. {124739}
*   Fixed an issue where the number of items was missing in the multicore callback. {129636}
*   Implemented service call to do dspqueue_close for multicore environments. {126381}
*   Introduced parallel graph execution, enabling concurrent running of multiple graphs on a single HTP core to improve throughput and resource utilization {89181}
*   Performance improvement for Softmax Op with 32 channels or less. {130819}
* Op:
*   GPU:
*     Added support for GridSample Op. {127898}
*   HTP:
*     Optimized DepthwiseConv op performance for an ASR model on SM8750 HTP W8A16. {129860}
*     Optimized DepthWiseConv2d op execution by ensuring it runs on HMX {128655}
* OpDef:
*   Added dynamic shape support for FullyConnected Op. {116235}
*   Added optional parameter `buffer_padding` to Buffer Op. {125962}
* Tool:
*   Qairt tools like qairt-converter and qairt-quantizer are moving out of Beta status.{132893}
*   Performance optimization for Restormer model on SM8750 HTP A16W8. {127924}
*   Converter:
*     Added support for BQ and LPBQ in JSON serializer and deserializer. {132650}
*     Added support for quantized DLC files as input to the quantizer module.
*     1. If all tensors are quantized or overridden float, return directly.
*     2. If half-quantized DLC, dequantize the fixed-point tensors back to float before quantization.
*     3. Quantize all float tensors. {129135}
*     Added support to trigger Quantizer with float_fallback mode. {129131}
*     Enabled triggering of the Quantizer with float fallback within the converter when quantization overrides are present. The default fallback data type is float16, but float32 can be enabled using the `--float_bitwidth=32` option. {132893}
*     Fixed handling of dynamic input shapes with a more informative error message. {127631}
*     QAIRT Quantizer now skips quantization steps if float_fallback is specified for an input Quant DLC. {130397}
*     Introduced a new Converter argument to guide different Converter output export formats:
*     --export_format [\"DLC_DEFAULT\", \"DLC_STRIP_QUANT\"] {129132}
*   qairt-converter:
*     ONNX output order is now preserved by default during ONNX to DLC conversion with qairt-converter. {133697}
*     By default, qairt-converter converts an input quant source model into a DLC with Quant tensors. Tensors with missing quant parameters are converted to FP16 datatype (default). This kind of DLC can fail inference against float runtimes like QNN-GPU and QNN-CPU due to the presence of quantization parameters.
*     If `--export_format=DLC_STRIP_QUANT` is specified to the qairt-converter tool, quantization parameters are stripped out and produce a DLC with float tensors. This kind of DLC can be used to inference against float runtimes like QNN-GPU and QNN-CPU.
*   qnn-onnx-converter:
*     Added the `--preserve_onnx_output_order` option to maintain ONNX output order in the converted graph. {126070}
*   qairt-quantizer:
*     qairt-quantizer can accept a Quant DLC produced by qairt-converter to convert remaining float tensors into quant types. This requires a calibration data set to be specified through the `--input_list` flag of qairt-quantizer.{132893}
*   qairt-dlc-to-json:
*     New tool `qairt-dlc-to-json` is added to produce a JSON file from a DLC file. {132893}

Bugs
~~~~
* QNN Core: Fixed an issue where QNN Savecontext failed for multiple models on Windows platforms due to the inability to find the graph in the DLC. {130104}
* CPU:
*   Added int32 data datatype for ScatterElements. {126766}
*   Fixed L2Norm to handle multiple axis {127053}
*   Fixed verifier failures for single-layer resize models on ONNX16 framework. {124524}
*   Implemented deep copy of `opConfig` in CPU to prevent model failures. {128204}
* DSP:
*   Fixed an SNPE inference failure due to QnnContext_createFromBinary failing with a memory allocation error. {127804}
*   Fixed an SNPE inference failure where multiple models failed due to errors obtaining input tensor names {127809}
*   Fixed inference failures for specific models on HTP due to network partition issues. {131151}
* Genie:
*   Fixed issue in genie-t2t-run where dialog de-initialization data was not saved. {132621}
*   Fixed issue where GenieEmbedding_generate would return a rank of 0. {131581}
*   Fixed issue where per-context configurations were incorrectly merged with group context configurations when calling QnnContext_createFromBinaryListAsync, resulting in context creation failures. {128851}
*   Fixed issue where quantized values may overflow or underflow. {125929}
* GPU:
*   Fixed accuracy error in QnnGpuOperationTestActivationAndroid. {125640}
*   Fixed accuracy error in QnnGpuOperationTestTransposeConvAndroid. {125992}
*   Fixed inference regressions in models having Convolution Op in `gpu_fp16` mode for some devices. {120026}
* HTP:
*   Addressed inference time regressions on multiple chipsets for HTP and HTP_FP16 configurations. {128165}
*   Corrected the TransportResult resize function to properly set the number of cores. {132311}
*   Fixed a crash in libQnnHtp.so that occurred in graph switch scenarios involving spill fill buffer sharing. {131575}
*   Fixed a deadlock in `allocateAndMapPersistentSpillFillBuffer()` that occurred due to locking conflicts. {132488}
*   Fixed a hang issue in GenAI TNR tests when using asynchronous group initialization with weight sharing and spill-fill sharing with weight sharing. {132586}
*   Fixed a HexNN multi-core handling issue that occurred when executing dual graphs. {133236}
*   Fixed a LayerNorm validation failure by checking rank of bias only if it's present in LayerNorm Op. {106186}
*   Fixed a multithreaded concurrency issue with LLM and small models that caused a 'memHandles registration failure'. {131051}
*   Fixed a performance regression for a MobileBERT model that was introduced in a previous release. {132111}
*   Fixed a prepare failure for the L2Norm op with fp16 when the relaxed_precision_flag is not set during converter stage. {129566}
*   Fixed a Windows compatibility issue related to non-shared weight VA reservation. {130567}
*   Fixed an issue where attempting to register the same Op package multiple times resulted in a QNN initialization failure (error code 4005) in non-RPC mode. The QNN runtime now handles multiple registrations of the same Op package gracefully. {127483}
*   Fixed an issue where multiple VA sharing groups caused the error 'Unable to map reserved buffer for non-shared weights'. {131009}
*   Fixed an issue where QNN HTP inference failed during MC detailed profiling. {132564}
*   Fixed an issue where qnn-context-binary-generator would hang, consuming excessive CPU and memory. {126833}
*   Fixed intermittent hangs that occurred during the creation of a context from a binary in concurrent scenarios. {131049}
*   Fixed issue with 64-bit runtime option in the multicore path. {133300}
*   Fixed issue with 64-bit runtime option in the VA reservation path. {133125}
*   Fixed the checker failures related to the OpPackage example by correcting the include path. {130707}
*   Improved performance to address inference time regressions observed on multiple chipsets. {131073}
*   Resolved an issue related to spill-fill buffer sharing, which caused incorrect output. {124544}
*   Resolved an issue with x86_prepare failures during savecontext. High CPU utilization during graph preparation was addressed. {125093}
*   Resolved failures in LoRA v2 test cases due to DSP transport call issues, impacting multi-model context and graph switch scenarios. {130142}
*   Resolved inference time regressions on SM8750. Avoided broadcast overhead on mul_op to improve performance of uint16 elementwise multiplication. {125746}
*   Reverted the enablement of the 64-bit flag to address reported hangs. {130301}
*   Updated PGE support check to use support Features on SoC Model. {127754}
* LPAI:
*   Fixed a failure in LPAI direct mode {131750}
*   Fixed an issue where LPAI single layer models were failing. {130729}
* Op:
*   DSP:
*     Supported LayerNorm; modified the hard code check. {122112}
*   HTP:
*     Added 5D support for float Sigmoid. {128867}
*     Addressed performance issues when converting models with w8a16 compared to w8a8 on SM8350 by optimizing matmul and Gemm OPs. {121404}
*     Fixed a QNN context-binary-generator failure due to a TCM insufficient tile error when processing a custom model. {129510}
*     Fixed context binary generation failures for ArgMin/ArgMax ops due to TCM overflow. {108763}
*     Fixed model validation errors during context saving, specifically addressing issues with the DepthToSpace Op. {131083}
*     Fixed numerical issue for DepthwiseConv2d -> HardSwish in a MobileNetV3 model. {128158}
*     Fixed rank constraints of Op replacement rule. {130194}
*     Fixed ReduceMax FP16 compilation error. {127900}
*     Improved DepthwiseConv2D performance. {126421}
*     Optimized Reshape Ops when PCQ is enabled on constant tensors going into a MatMul Op, improving performance. {130415}
*     Registered QInt16 for Concat Op to resolve graph preparation failures when using QuantInt16 tensors. {125735}
*     Resolved an issue where context binary size calculation failed during graph preparation. {124130}
*     Resolved an on-device hang issue during execution of Dynamic MobileNet V2, specifically during the Transpose Op {126806}
*     Resolved context binary generation failures for the BevFormer model with AMP encodings. {129991}
*     Signed 16-bit 5D tensors are currently not supported for ElementWiseRsqrt operations on HTP. {122496}
* SDK:
*   `ReleaseNotes.txt` renamed to `QAIRT_ReleaseNotes.txt` and now contains release notes for both Unix and WoS. {127817}
*   Fixed build issues in Qnn SampleApp, Qnn SampleAppAsyncExecution and Qnn SampleAppSharedBuffer. {131442}
*   Removed "pytorch to onnx conversion avoidance suggestions" from QNN SDK Docs. {132125}
*   Updated QNN SDK documentation and dependency check script to include required Ubuntu host version. {131279}
* SNPE:
*   Fixed API `Snpe_SNPEBuilder_SetInitCacheMode()`/`SNPEBuilder::setInitCacheMode()` breakage for non-HTP backends when using the `snpe-net-run` option `--enable_init_cache`. {129545}
*   Fixed the `--enable_init_cache` option (API `SNPEBuilder::setInitCacheMode()`/`Snpe_SNPEBuilder_SetInitCacheMode()`) in `net-run` for AIP runtime. {131929}
* Tool:
*   Converter:
*     Added separate APIs for BQ/LPBQ weight dequantization by decoupling them from the quantizeDequantize function to improve flexibility and enable future enhancements. {132236}
*     Added validation to the layernorm pattern matching to ensure the pattern is valid. {129246}
*     Corrected an issue where qnn-context-binary-generator logged an incorrect QPC path when the --backend_binary option was used. {126169}
*     Corrected the allowed length for pad amounts for 4D tensors in the emitter. {132185}
*     Enabled data invariant optimizations for the Tile Op. If the input of Tile Op is quantized, the input dataType and qInfo are copied to the output. {126372}
*     Fixed a bug in inferOutputShapes of ElementwiseBinary Op where alignChannel was not correctly set based on input tensor dimensions. {128406}
*     Fixed a segfault issue in IrJsonDeserializer during deserialization of newly generated model JSON files. {129816}
*     Fixed an issue in the quantizer that caused conversion failures for models with explicit Quantize/Dequantize ops due to incorrect output tensor datatype. {133050}
*     Fixed an issue where Accuracy Evaluator runs failed at the Netrun stage. {129997}
*     Fixed an issue where context binary generation failed with a 'Graph Finalize failure' when using multi-Qranium pipelined partitioning. {124908}
*     Fixed an issue where FOLD_MULTIPLE_TRANSPOSE was incorrectly pruning graph outputs. {127963}
*     Fixed an issue where qnn-context-binary generation failed for LVM UNet models due to tensor updateability and GroupNorm Op validation errors with the HTP backend. {127887}
*     Fixed an issue where the qnn-context-binary-generator tool failed on Windows-X86 when processing LoRAv3 models. {130894}
*     Fixed Converter failure for Cast Op when using float_fallback. {132918}
*     Fixed index error failure in remove identity optimization. {125867}
*     Fixed issue when folding multiple transposes to retain graph output names. {128685}
*     Fixed issue where bias of FullyConnected layers was incorrectly set to FP32 precision when inputs were INT16 and weights were INT8. {126705}
*     Fixed issue where tensor encodings were lost after quantization in float fallback mode. {132617}
*     Fixed Layout Transform to avoid unintentionally loading deferred weights. {132173}
*     Fixed Quantizer failure when using '--use_native_output_files' flag with models having mixed output types by correctly distinguishing between float and native outputs. {107518}
*     Resolved a serialization issue with MatMul ops involving int16*int16 data types when using dynamic 16-bit weights. {129733}
*     ONNX:
*       Added support for dynamic inputs for Clip Op. {124203}
*       Fixed an issue in the Converter to ensure correct name sanitization following C++ naming conventions. {129356}
*       Fixed axis tracking in ScatterElements. {118614}
*       Fixed issue for reverse GRU Op to ensure the correct order of input names for the first output. {130544}
*       Fixed issue in Expand Translation. {132126}
*       Updated translation for ExpandOp to reduce inference time. {127065}
*   qairt-accuracy-evaluator:
*     Fixed issue where the input list was incorrectly passed to the quantizer. {130537}
*     - Added support for the 'algorithms' quantizer parameter in the evaluator.
*     - Provided input shape to the converter for PyTorch models. {126291}
*   qnn-accuracy-debugger:
*     Enhanced the qnn-accuracy-debugger tool to provide more meaningful metrics for intermediate tensor cosine similarity. {126437}
*   qnn-net-run:
*     Resolved an issue in accuracy evaluator runs where the error "'Namespace' object has no attribute 'preserve_graph_output_order'" was encountered. {132180}
*   qnn-onnx-converter:
*     Aligned the ONNX Resize Op translator's behavior with ONNX definitions. {123092}
*   Quantizer:
*     Fixed incorrect bitwidth assignment for shared static tensors, which could lead to quantization issues. {127777}
*   snpe-architecture-checker:
*     Fixed an issue where snpe-architecture-checker would fail due to an uninitialized variable. {126778}
*   snpe-stress-net-run:
*     Fixed a memory leak issue when loading QNN models. {128498}

Known Issues
~~~~~~~~~~~~
* Tool:
*   Converter:
*     Models with sparse tensors may fail in the conversion step with the error message:"'Tensor is incorrectly sparse', 'Op config was not well formed.'". This will be addressed in an upcoming release. {132699}
*     Models with Conv Ops may fail at generate Context-binary or prepare graph on HTP with the error message: "'[ERROR] Tensor <#xx>and <#xx> have mismatching datatypes. 0x408 != 0x232.', '[ERROR] Op specific validation failed.',". If the OpConfig validation fails due to a Conv Op (FP16 Activation, FP16 Weights, FP32 bias) in the converted DLC, please specify `--float_bias_bitwidth=16` to avoid the OpConfig validation failure. {134712}



2.33.0
======

**3/31/2025**

QNN API version: v2.25.0

Changelog
---------

Features
~~~~~~~~
* QNN Core: Implemented netRun directMode for Windows, enabling execution on native aDSP LPAI backend. {113779}
* API:
*   HTP:
*     Introduced QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT. This new data format allows for reduced latency when transferring I/O tensors to and from the backend accelerator. {118664}
* Genie:
*   Added LoRA adapter application latency metrics to the `GenieProfile` output for performance analysis. {123346}
*   Enabled modification of the sampler type within the `GenieSampler_applyConfig` function to allow more flexible sampler configurations. {127121}
*   Introduced the GenieLog.h API, providing logging capabilities within the Genie library. {98349}
* HTP:
*   Improved performance in scenarios where graph inputs are used by Gather Ops. {129605}
* SDK:
*   Core:
*     Added a new QNN Sample App demonstrating graph selection and switching for multi-graph use cases. Also improved memory handling in exit path for all the sample apps. {124467}
* Tool:
*   Added support for defer loading in case of quantization override. {127344}
*   TNR APP: Enhanced MSE precision to 15 digits and enabled md5sum for result verification. {122723}
*   Converter:
*     ONNX:
*       Added multi-graph deduplication to reduce context binary size and improve memory efficiency for LLMs. {116193}
*       Added support for moving tensors from constant nodes to initializers to improve memory efficiency for LLMs {112211}
*       Improved handling of external data loading for ONNX model initializers, including support for 0-D tensors. {126087}

Bugs
~~~~
* AIC:
*   Added missing operator support to resolve compilation failures for the Falcon 40b model at the context binary stage. {127360}
*   Fixed an issue causing "Graph Finalize failure" errors in QNN models within the digest. {128459}
* CPU:
*   Fixed an issue where the converter was failing to generate activations for multiple ONNX models leading to a netrun failure. {60913}
*   Fixed inference failures observed with custom models on the CPU backend. {128291}
* Genie:
*   Fixed an issue that resulted in degraded text generation performance after a KV cache rewind. {121634}
*   Fixed incorrect path in HTP BGE-Large SDK docs tutorial. {130311}
* HTP:
*   Fixed accuracy issue with W16A16 Batchnorm. {121288}
*   Fixed an issue in LoRA patching that caused the num_blobs check to fail, preventing successful patching when the quantization parameter blob was missing in the adapter patch. {125972}
*   Fixed an issue to avoid crashes related to the heap memory due to the profile sizes. {123330}
*   Fixes to resolve issues related to sparse weight compression during context binary generation for DLBC models. {126398}
*   Improved graph spill fill buffer size for graphs containing Add Ops. {122012}
*   Improved performance of Space Rearrange Op by correcting Central Tiling to use HVX implementation of space_rearrange. {126760}
*   Resolved an issue where incorrect VTCM memory was allocated when the `vtcm_size` parameter was set to a negative value. {125808}
*   Resolved performance regressions observed during online graph preparation on the HTP backend when using FP16 precision. {126671}
*   Restored Forced Preemption mechanism functionality by ensuring QNN_ENABLE_HEXAGON_V73 flag is correctly enabled for auto builds. {128242}
*   Updated RPC call status for multi-core in Windows to ensure that errors are correctly propagated. {128467}
*   WoS:
*     Fixed the HTP spill fill buffer by resolving max_size=0 regression in x86_64 system. {125414}
* LPAI:
*   Fixed an issue preventing QNN SampleApp from running LPAI models. {126973}
*   Fixed an issue with accuracy drop for some models after applying overridden relu encoding. {128307}
* Op:
*   HTP:
*     Fixed accuracy failures with HTP FP16 custom MobileNet_v2 mixed precision models due to rounding issues in half-float to signed 16-bit conversion. {121932}
*     Fixed accuracy issues in MirrorPad and EdgePad Ops when input and output quantization configurations differed. {125565}
*     Fixed an issue in patch binary generation for animation and watercolor concurrencies. {126263}
*     Improved performance for long context length models by enabling Native KV Format optimizations when memory handle is used. {124602}
* Tool:
*   Addressed SDXL Models Conversion Failures. {113458}
*   Enable node sanitization for quant overrides in Accuracy Evaluator to use correct weights. {125574}
*   Ensured ChannelShuffle output transposed to NCHW when --preserve_io layout is enabled. {123922}
*   Fixed an error when max_samples is set to -1. {129254}
*   Fixed conversion issue with LayerNorm Op with AXIS=-3. {126434}
*   Fixed failures in Netrunner stage for models with 'preserve_io_datatype' enabled in the configuration. {127071}
*   Fixed graph execution failure on inference. {126792}
*   Fixed issues preventing AccuracyDebuggerSimulationChecker and AccuracyDebuggerChecker from running correctly with TensorFlow models. {126636}
*   Fixed QAIRT accuracy evaluator failure with mobilenetV3 model. {126296}
*   Implemented enhanced error handling in qnn-net-run to provide more informative error messages to the user. {124486}
*   QNN Debugger: Addressed bugs in add/skip layers options when using online preparation. {128209}
*   Converter:
*     Added support for specifying batch size during model conversion when using the TensorFlow converter. {94854}
*     Corrected the handling of Bias tensor encodings in FC and MatMul Ops. {120720}
*     Corrected the layout override logic for the Select Op. {110665}
*     Fixed a bug in Op sequence matching for the GroupNorm Op {124757}
*     Fixed an issue in the Converter that ensures disable BN squash when conv node's weight/bias overrides are present. {124293}
*     Fixed an issue that conv2d bias gets incorrect scale when input[0] is overridden to float {116164}
*     Fixed an issue where duplicated parameter tensor names could occur during the IR graph building process due to sanitization {100554}
*     Fixed an issue where folding transposes would cause graph output names to be lost. {125412}
*     Fixed argument parsing errors in QNN converter that caused failures for pytorch models on cloud when used with QNN-AIC tool. {126985}
*     Fixed the logic of "add_op_to_backend" in QnnCastTranslation {117228}
*     Resolved an accuracy issue observed with the Mask2former head when using FP16 precision. {123657}
*     Resolved an issue where model input encoding was not correctly derived from Quantize/Dequantize Ops after quantization nodes were removed. {124268}
*     Updated QNN model structure to reduce model size when Expand Ops are present. {122529}
*     ONNX:
*       Added support for three new einsum equations, expanding the range of supported ONNX models. {113767}
*       Fixed a bug in axes format population for the Pool Op {118828}
*       Fixed an issue where GroupNorm Op was not getting correct gamma and beta tensor values, which may have led to accuracy issues in optimized graphs. {119523}
*       Fixed an issue with the ElementwiseSelect Operator that resulted in incorrect input dimensions. {127541}
*       Fixed graph name inconsistency in qairt lora converter workflow. {104805}
*       Fixed propagation of user encodings in IdentityOp {126845}
*   MHA2SHA:
*     Updated instructions to resolve save context failures with LLAMA_3.2_3B model when using SHA artifacts. {128705}
*   qairt-accuracy-debugger:
*     Added support for multi-processing and bug fixes related to petr model architecture in layerwise snooping {125819}
*     Removed GPU runtime support for layerwise/cumulative layerwise snooping algorithms in Accuracy debugger. {123945}
*     The framework option in the API is now irrelevant, since the tool is now able to identify based on the model passed. {124162}
*     Updated Layerwise and Cumulative Layerwise snooping algorithms to work with new quantized encodings. {126445}
*   qnn-accuracy-debugger:
*     Disable CPU runtime for layerwise/cumulative layerwise snooping algorithms in Accuracy debugger as CPU runtime does not support mixed precision models. {123938}
*     Fixed issue with --set_output_tensor argument processing in add_layer_types for the inference engine. {107697}
*     Resolved input size mismatch error preventing debugger execution on Bert_Large_BoolMask model. {117930}
*   qnn-context-binary-generator:
*     Fixed context binary generation failure with AIMET encodings when using Per Channel Quantization (PCQ). {119824}
*   Quantizer:
*     Fixed an overflow issue in profiling data casting {123812}



2.32.0
======

**2/28/2025**

QNN API version: v2.24.0

Changelog
---------

Features
~~~~~~~~
* Genie:
*   Added dialog priority support with GenieDialog_setOemKey and GenieDialog_setPriority. {124438}
*   Added Windows build support for the source code examples. {107778}
*   Reorganized the Genie SDK documentation. {125499}
* HTA:
*   Converter will now translate ResizeBilinear and ResizeNearestNeighbor Ops to Resize Op {108065}
* HTP:
*   Added FuSa coverage for fp16_l2norm, enhancing code reliability for automotive applications. {125464}
*   Established standard thread group values for QNN. {111627}
*   Implemented support for 50us OP bounding in QNN, enabling HMX Op bounding to control operator execution time. {99668}
*   Native KVcache now provides bit-exact accuracy compared to non-native implementations. {118670}
* Op:
*   GPU:
*     Support ElementWiseSoftplus operation and ElementWiseNeuron with SOFTPLUS parameter. {40029}
* QNN:
*   Added an interface to check if tensor get/free callbacks are set {120293}
*   OpDef:
*     Added dynamic shape support for DepthWiseConv2d op. {116233}
* Tool:
*   accuracy-debugger:
*     Enabled support for pre-quantized models in the accuracy debugger tool. {123644}
*   Converter:
*     Added support for Device NMS (QDetect) in the QNN converter for AIC. {68665}
*     Added support for partial IO for Preserve IO Datatype. The --preserve_io_datatype flag now accepts specific inputs/outputs for datatype preservation. {117580}
*     Enabled QnnIR backend for all toolchains having QNNHTP support. {111592}
*     ONNX:
*       Added LayoutInferer support for CustomOp, enabling layout transformations for custom operations based on user-defined XML op definitions. {104942}
*   qnn-onnx-converter:
*     Added support for BQ encodings in the QNN converter to generate encodings files compatible with downstream tools. {108867}

Bugs
~~~~
* QNN Core: Resolved a segmentation fault that occurred in the Squeeze operation when all input and output dimensions are 1. {122278}
* CPU:
*   Added Rounding mode support in MaxPool 2D {120796}
*   Resolved a segmentation fault that occurred with shared bias models. {120819}
*   Resolved inference time regressions observed on some qualcomm platforms by addressing frequency scaling issues. {122404}
*   Fixed Finalize failure for ScatterElement {124109}
* Genie:
*   Fixed a performance regression for kv-share dialogs using the token query API. {125852}
*   Fixed a qnn-genai-transformer-composer failure when preparing LoRA adapters. {125328}
*   Fixed issue where a double free could occur when using backend extensions. {127304}
*   Fixed issue where a Gen AI Transformer dialog attempts to double free memory. {125676}
*   Fixed issue where multi-token stop sequences were not fully omitted in queryCallback and KV$. {123501}
*   Fixed issue where SPD token rate is incorrectly reported when the query is aborted. {119574}
*   Fixed issue where tokenizer state is corrupted after a query abort. {121193}
*   Resolved an issue that caused FP16 model validation to fail {119655}
* GPU:
*   Fixed an issue where multiple models failed on SM4250-IOT with a Graph Execution failure when using the GPU backend. {121468}
*   Fixed an issue where multiple models failed with Graph Execution failure on the GPU backend. {96038}
*   Resolved a BCT verifier accuracy drop found in QNN for one model when using DLCs from an earlier SDK version. {124475}
*   Resolved an issue that caused onnx-custom_esrgan models to fail verification on the GPU backend, improving accuracy for Procyon benchmarks. {124167}
*   Resolved an issue where stress tests were failing on QCS9100 with GPU backend, resulting in the application being killed. {114399}
*   Resolved an issue where stress tests were observing segmentation fault on qnn gpu in SXR2330p and SM8750 platforms. {107686}
* HTP:
*   Fixed a memory management issue that caused inference failures due to DMA handle deregistration errors on LE targets. {127040}
*   Fixed a timeout issue in SNPE util. {120943}
*   Fixed an accuracy issue in the Logsoftmax kernel with FP16 data types, improving results when input data has a large range. {123146}
*   Fixed an issue on SM8750 targets where Llama7B models using async group init and VA sharing failed due to problems unmapping from reserved space. {124202}
*   Fixed an issue on SM8750 where async group initialization failed due to a failure to register ion memory with the backend. {124192}
*   Fixed an issue that caused GGUF models to fail during the save context stage. {123937}
*   Fixed an issue where context binary generation was failing for certain models when using quantization overrides. {125743}
*   Fixed failures in LoRA async group init and VA sharing test cases on SM8750 platform caused by context creation issues. {124191}
*   Resolved a memory allocation issue in the LoRa adapter when running in loop mode. {123895}
*   Resolved a memory leak in QnnHtp that occurred only in the HNRD path. {121286}
*   Resolved a performance regression in ORT-QNN_EP on Hamoa by optimizing debug log handling. {126106}
*   Resolved an accuracy compatibility issue between QNN versions 2.25.x and 2.26.x. Models converted with QNN 2.25.x and run with 2.26.x will now correctly handle the 'offset_alt_sign' parameter. {114958}
*   Resolved an accuracy regression in several models that was introduced by a performance optimization. {120633}
*   Resolved an issue that caused power configuration settings to fail in HNRD. {122227}
*   Resolved failures observed in certain custom auto-generated models. {126211}
*   The Mempool queue size has been updated to 10log2 for the queue size, resolving random segmentation faults {122897}
*   Ops:
*     Fixed an accuracy issue with dynamic sigmoid for uint16 data types, specifically when operating with a dynamic depth of 1. {122923}
* Op:
*   CPU:
*     Added support for 5D inputs for the MatMul Op in xnnpack. {119813}
*     Added support for 5D tensors in Elementwise Comparison Ops. {125481}
*     Improved MatMul Op execution time on Windows. {124990}
*   HTP:
*     Improved accuracy for HandTracking models. {118813}
*     Resolved an accuracy issue with the 16-bit rsqrt Op when using a non-zero output offset. {121777}
* QNN:
*   Fixed an issue where compose_graph fails for Split with BOOL_8 input/output. {120284}
*   Fixed an issue where MQ Pipelined Partitioning fails in context binary stage when used with Network specialization for large LLM models. {125162}
*   Reduced verbosity by suppressing 'Bad quantization: zero scale!' log messages, improving terminal readability. {123651}
* SNPE:
*   DSP:
*     Fixed an issue where multi-SNPE instances could not run concurrently on QCS610. {119378}
* Tool:
*   Added support for 'unsignedsymmetric' in Act & Param Quantizer Schema in Pythonic APIs. {119262}
*   Fixed a failure in qairt-converter when using mixed precision quantization_override. {112259}
*   Fixed a minor bug in qnn-accuracy-debugger that caused it to misread the QNX password during remote execution. {117655}
*   Fixed an issue where models were failing with the error 'Graph Finalize failure' or 'Create From Binary failure'. {120821}
*   Fixed an issue where savecontext was failing due to the inability to create specific operations. This issue was caused by graph prepare failures. {119573}
*   Hypertuner is updated to use qairt-converter instead of qnn-converter. {120124}
*   Optimized the performance of mask2former head under fp16, which improves the speed of inpainting tasks. {120937}
*   AIMET:
*     Fixed a circular import issue in Emitter. {126085}
*     Fixed an issue in LWQWrapper for overriden float tensors while filling missing encoding in embedded IR flow. {125710}
*   Converter:
*     Added support for constant input in BatchNorm Op. {117961}
*     Added support for constant scalar input in the PyIrConstant tensor. {117953}
*     Added support for QAIRT command-line arguments to specify desired input layout and output shape. {106635}
*     Added validation in match_base_layernorm for layernorm pattern matching {119491}
*     Enable Constant folding for OneHot Op {116983}
*     Ensured that the 'axis' attribute defaults to the last dimension for Softmax and LogSoftmax Ops. {118361}
*     Fixed a bug in the squash reshape optimization {125214}
*     Fixed an issue where bias was not correctly handled in the GEMM Op during model conversion. {120022}
*     Fixed an issue where per-channel quantization for Conv2D bias was not being performed when using the --use_per_channel_quantization flag. {118497}
*     Fixed the broadcasting failure in the squash_eltwise_into_conv method in optimization {119746}
*     Removed dependency on the Python rich library. {124683}
*     Resolved an accuracy issue observed with the Mask2former head when using FP16 precision. {123656}
*     Updated the Onnx Runtime Version from Onnx-1.17.1 to Onnx-1.18.0 {125506}
*     Added axis tracking for ExpandOp {118618}
*     ONNX:
*       Added support for the ONNX IsNaN Op. {115649}
*       Addressed an issue where input preprocessing encodings were being ignored, leading to unquantized nodes in the converted model. {124609}
*       Fixed an issue in constant folding for the Where Op. {122514}
*       Fixed an issue in static alpha conversion for the PReLU Op. {124487}
*       Fixed an issue in the QNN converter related to batchnorm operator. {115134}
*       Fixed an issue where the converter would fail with an UnboundLocalError when processing subgraphs of the BERT Large model with quantization overrides. {110453}
*       Resolved an issue where the converter was adding unnecessary int16 to int8 conversions for reshape ops in float-fallback mode. {126844}
*       Resolved issues in Slice and Concat Op translation. {123355}
*       Support Onnx LSTM with pre-quantized weights and biases {114557}
*       Fixed constant folding logic in ReduceOp translation {125098}
*     Relay:
*       Corrected an error in TFLite CustomOp conversion and Conv2D dequantization. {120231}
*   qairt-quantizer:
*     Corrected an issue where the quantizer was not properly quantizing conv-relu patterns in certain network configurations. {123638}
*   qnn-onnx-converter:
*     Fixed an issue in the ONNX converter that caused errors when handling specific axis layouts in the Buffer Op. {125932}
*     Fixed an issue where input names were incorrectly propagated to converters for TFLite/TF models. {126321}
*     Improved accuracy for the XLM-RoBERTa model after quantization {123142}
*     Resolved an issue where the `align_matmul_ranks` parameter was incorrectly set in the optimizer module. {125315}
*   qnn-profile-viewer:
*     Fixed chrometrace generation error for aarch64 Windows-based QNN profile viewer {124295}
*   qnn-tools:
*     Corrected an issue where the QAIRT-Accuracy-Debugger was using act_quantizer_schema instead of param_quantizer_schema. {123950}
*     Fixed an error in QAIRT-Accuracy-Debugger that occurred when interrupted, resolving a missing 'logger' attribute. {123944}
*     Fixed an error that occurred when using the memory_efficient option in the QA accuracy debugger, improving resource management during debugging. {124312}
*     Fixed an issue in the QAIRT CLW tool where the same MSE values were being displayed for all layers {126079}
*     Fixed an issue that caused incorrect detection of conv-bn-relu and related supergroups for non-ONNX frameworks. {124419}
*     Fixed an issue that caused incorrect framework instance loading for the Accuracy Evaluator tool. {123598}
*     Improved memory management in qairt-converter to prevent 'Killed' errors when converting large models. {124197}
*     Improved the error message in QAIRT-Accuracy-Debugger when the engine_path is missing, providing clearer guidance to the user. {123893}
*     Resolved an issue in QA accuracy debugger that caused CUSTOM_OVERRIDE_GENERATION_FAILURE on cacat ops due to file name limitations. {124567}
*     Resolved an issue in QA accuracy debugger that caused errors with concat ops due to input mismatch. {124313}
*     Resolved errors related to graph execution and preparation during inference. {122816}
*     Resolved multiple issues in QAIRT-Accuracy-Debugger related to verifiers, file name size, concat ops input and memory efficient operation. {123966}
*     Resolved QAIRT converter failures related to memory allocation errors for certain models. {122818}
*     Updated QAIRT documentation to remove/update unsupported argument options and clarified the tool's current capabilities. {123956}



2.31.0
======

**1/27/2025**

QNN API version: v2.24.0

Changelog
---------

Features
~~~~~~~~
* Genie:
*   Added dialog debug configuration option. {120242}
*   Added GenieDialog_signal API. {108345}
* SDK:
*   Added a comparator utility to the SDK package. {113366}
*   Added a logging utility to the SDK package. {111111}
* Tool:
*   Converter:
*     Updated command-line argument names for QAIRT Converter and Quantizer to ensure consistency. {113895}
*     Enhanced accuracy verification reports by adding visualizations to the HTML summary. {118519}

Bugs
~~~~
* CPU:
*   Fixed a segmentation fault in the MatMul Op when broadcasting {121586}
* GPU:
*   Fixed execution errors seen in models with FullyConnected layer on QCM2290 {121549}
* Genie:
*   Improved numerical stability during embedding requantization in genie-t2t-run. {121964}
*   Fixed an issue with incorrect value data types in the GenieProfile JSON output. {122753}
*   Fixed a crash when running the lookahead decoder dialog while setting up attention masks and position embeddings. {119566}
* HTP:
*   Fixed an issue where the QnnMem_register API incorrectly returned success instead of an error when provided with an invalid file descriptor. {121838}
*   Fixed an issue where model deserialization with weight sharing failed. {115559}
*   Optimized Transpose operation performance by removing unnecessary InputTilingGuards and adjusting tiling behavior to avoid extra operations. These changes address a performance regression observed with certain model configurations. {122718}
*   Fixed race conditions that could occur when loading and unloading the HTP library during concurrent use of online and offline prepared models. {122853}
*   Improved responsiveness of model cancellation during concurrent operations by optimizing context registry. {119785}
*   Fixed an issue where the AlignSliceNoTile operator was not using the correct tiling of its operand, which could lead to incorrect model execution. {121833}
* Op:
*   HTP:
*     Fixed an issue where the validator was using the incorrect offset value for UInt16 input during the Tanh Op, resulting in a unit test failure. {119203}
*     Fixed an issue in the dynamic reshape validation function where a dimension mismatch occurred when the reshape dimensions had a size of 0 {119475}
*     Fixed an issue that prevented correct tiling of merged hardswish and convolution Ops when a Slice_shape Op was also present. {120254}
* QNN:
*   Fixed a verification failure for custom InceptionV4 models with NHWC input format on HTP/CPU/GPU backends by regenerating golden files with the correct input format. {115643}
* SDK:
*   Added support for compiling and executing models larger than 2GB on v73 targets. {117846}
* Tool:
*   Fixed a crash issue in qnn-throughput-net-run when exiting due to an exception. {119822}
*   Converter:
*     Fixed an issue that prevented RMSNorm from being folded when the epsilon tensor had more than one element. {120828}
*     Corrected GRU Op optimization to properly apply encoding information. {118950}
*     Improved handling of empty inputs for GRU and LSTM Ops. {118517}
*     Updated ONNX framework version information in QNN documentation. {121305}
*     Fixed an issue that caused inconsistent input tensor order when using the custom_io option due to the use of std::set. {118065}
*     Fixed an issue where activation encodings were not correctly computed as unsigned symmetric when using the '--act_quantizer_schema unsignedsymmetric' option. {121156}
*     Fixed an issue where ReLU symmetric external encoding was not being applied correctly. {121106}
*     Fixed an issue where Batchnorm weights had an incorrect datatype when using asymmetric overridden encoding. {122617}
*     Fixed an issue with the 'axis' value when merging a reshape_transpose_reshape pattern into a channelshuffle Op. {119408}
*     ONNX:
*       Fixed an issue to handle static first input to Gemm Op. {119669}
*     TFLite:
*       Added support for int64 bias in TFLite Conv2D Op. {117443}
*       Added a pattern to dequantize constant expressions. {119330}
*   qnn-accuracy-debugger:
*     Fixed issue to filter unwanted entries in tensor mapping. {110602}
*   qnn-context-binary-generator:
*     Lora adapters are now placed in the directory specified by the "--output_dir" option, along with serialized binaries. {112277}
*   qnn-net-run:
*     Fixed the logic for casting input file data from float to unsigned or signed integer types. {114988}
*   qnn-net-run/throughput-net-run:
*     Enhanced console output during execution with more detailed information and improved error reporting. {111802}
*   qnn-tflite-converter:
*     Fixed an issue where the dequantize node had the wrong datatype in quantized TFLite models. {119293}
*   qnn-throughput-net-run:
*     Fixed a memory leak by optimizing memory allocations and deallocations. {120806}

Known Issues
~~~~~~~~~~~~
* HTP:
*   Potential Performance Changes in Certain Models. This release includes key updates to the HTP backend. While these changes are designed to generally improve performance, some models may exhibit altered performance characteristics. We are actively working to fine-tune these updates, and we expect to see gradual improvements to enable consistent performance across all models in upcoming releases. {120672}



2.30.0
======

**12/31/2024**

QNN API version: v2.23.0

Changelog
---------

Features
~~~~~~~~
* CPU:
*   Added support for the Mistral model in Transformer Composer. {108967}
*   Added support for Cerebras models in Transformer Composer. {108973}
*   Added dynamic dims support for Reduce Sum Op {119871}
*   Added bias support for matmul int8. {116143}
*   Added support for the Gemma-2B model in Transformer Composer. {108974}
*   Op:
*     Added dynamic dims support for elementwise unary. {111527}
*     Added Col2Im Op support {94453}
* Genie:
*   Added GenieDialog_setStopSequence API to allow updating the stop sequence configuration between dialog queries. {119225}
*   Added support for dialog custom sampler implementations. {114278}
*   Added GenieProfile.h APIs. {98351}
*   Added GENIE_DIALOG_SENTENCE_REWIND sentence code option. {112310}
* HTP:
*   Updated HTP Backend Extensions to support PGE. {90047}
*   Improved the inference performance of the resize bilinear Op for a specific configuration. {121624}
* Op:
*   HTP:
*     Added support for a8w8 with Lorav2 for specific use cases {119750}
* OpDef:
*   Added support for dynamic shapes in the ExpandDims Op. {106391}
*   Added dynamic shape support for ReduceSum Op. {116505}
*   Added support for negative begin and end index for StridedSlice Op {115216}
*   Added dynamic shape support for Squeeze Op. {106395}
* Tool:
*   Converter:
*     Added support for optional I/O tensors using the custom I/O configuration file by setting 'Optional' to True. {96240}
*     Enhanced ONNX shape inference capabilities in the QNN Converter using symbolic shape inference {68671}
*     Added framework tracing validation to check if all ops and tensors from framework model are traced. {113589}

Bugs
~~~~
* CPU:
*   Corrected the output index for the MultiClass NMS Op. {119428}
* GPU:
*   Resolved an issue where setting context priority configuration was not applied correctly when loading a serialized graph. {119800}
* Genie:
*   Resolved an issue where reloading an LLM model in the same process on SM8750 resulted in an NPU memory registration error. {121114}
* HTP:
*   Resolved a memory leak issue during backend initialization. {116963}
*   Resolved an issue that caused residual DMA memory allocation in LLM applications after model execution. {119791}
* Op:
*   HTP:
*     Resolved an issue with GatherOp that caused failures in ORT UTs by correcting data access from the correct axis. {119202}
*     Resolved a serialization failure issue when using concat with mixed input types in PDP models, caused by unaligned space_rearrange. {118483}
*     Improved the inference performance of the RMSNorm Op. {118204}
*     Resolved an error in the QNN context binary generator when using Depthwise Convolution followed by Pointwise convolution with Lora. {119098}
*     Resolved an accuracy regression issue in the LpNormalization layer on the NPU, which was caused by an overflow/saturation issue. {118927}
*     Improved the accuracy of RMSNorm with depth not divisible by 32 {117009}
*     Resolved an accuracy issue with RMSNorm in w8a8 quantization by adding support for the RMSNorm pattern, which also improved the cosine similarity and MSE/MAE metrics. {115333}
*     Improved the inference performance of the AvgPool Op. {113285}
*     Resolved an accuracy issue with Concat Ops when nested concats are connected to the output of a graph. {95253}
* QNN:
*   Resolved an issue where graph finalization could fail when partitioning models containing unsupported operations, such as GroupQueryAttention, for CPU execution. {118021}
* Tool:
*   Converter:
*     Resolved an accuracy issue observed with the Mask2former head when using FP16 precision. {119782}
*     Converter now correctly preserves input tensor datatypes based on the 'preserve_io datatype' command-line option. {110910}
*     Resolved a converter issue that caused a failure when using float overrides along with float fallback option. {109415}
*     ONNX:
*       Resolved an issue in framework tracing that caused some tensors to be missed. {114454}
*     TensorFlow:
*       Resolved an issue in GenericBatchNorm node fusion. {109556}



2.29.0
======

**11/30/2024**

QNN API version: v2.22.0

Changelog
---------

Features
~~~~~~~~
* Added 16KB alignment support for Android libraries in QAIRT SDK to enhance memory management. {118369}
* Genie:
*   Added support for multistream embedding to text dialog. {116357}
* Op:
*   HTP:
*     Added support for fp16 ResizeTrilinear Op. {90078}
* OpDef:
*   Added support for dynamic shapes in the ExpandDims Op. {106391}

Bugs
~~~~
* Timestamps will now be included in logs generated by the QNN backend. {107477}
* Genie:
*   Fixed handling of rope-theta and rope-scaling configuration {118935}
*   Improved prompt processing time for SSD dialogs. {118412}
* HTP:
*   Fixed stability issue in context management with mutex protection for thread group. {115281}
*   Fixed an accuracy issue that occurred for Gridsample5d when c_in = 1 {117159}
*   Resolved a race condition during thread group creation, preventing thread exhaustion under heavy system load. {113867}
*   Reverted change causing accuracy issue with quantized 16 bit Layernorm {114681}
*   Fixed an accuracy issue related to the transpose convolution operation. {99685}
*   Fixed the tcm migration logic to ensure tensor properties are correctly propagated from the producer op to the consumer op {107562}
* Op:
*   HTP:
*     Fixed a performance regression in StridedSlice Op {111501}
*     Fixed LayerNorm accuracy regression on DML after reshape optimization. {114361}
* Tool:
*   Converter:
*     Improved batchnorm 16bit quantization case handling {112194}
*     Fixed accuracy issue related to models with Reshape to 6D Ops {112510}
*     Fixed a segmentation fault in the nice-vit model conversion process, enhancing stability during model conversions. {116716}
*     Fixed the ONNX converter's incorrect quantization setting for the third input of the ScatterElements Op. {52434}
*     Fixed an issue in the validation process for dynamic shaped ONNX models {114349}
*     Updated the axis tracking logic for the RoiAlign Op. {111504}
*     Fixed an issue in the Converter that ensures correct assignment of the graph.preserve_io_datatype_passed and graph.preserve_io_datatype parameters. {108568}
*     Fixed bug in quantization of Elementwise Binary Ops when the output is non-quantizable and one of the inputs has quantized data type while other input is float32. {111864}
*     Mapping int64 inputs to int32 inputs without inserting extra cast {114926}
*     Fixed the bug in ElementwiseProduct Optimization {118492}
*     Onnx:
*       Fixed a segmentation fault issue that occurred during conversion of certain models. {111793}
*   Quantizer:
*     Added a check to honor the asymmetric 16-bit override for RMSNorm Op, ensuring it remains asymmetric instead of being modified to symmetric. This change improves accuracy compared to the simulation that generated the overrides. {117381}
*   qnn-accuracy-debugger:
*     Fixed bug in qnn-accuracy-debugger when sanitize tensor name following converter's node naming conventions. {114802}
*   qnn-net-run:
*     fixed uninitialized variable issue to make dspfreq at highest level {117573}

Known Issues
~~~~~~~~~~~~
* GPU:
*   Inference failures observed in models with BatchNorm operations when using large dimensions on specific target devices. {113878}



2.28.0
======

**10/31/2024**

QNN API version: v2.21.0

Changelog
---------

Features
~~~~~~~~
* QNN Core: QNN context binary inspection structures for binary info and graph have been updated to version 3. Applications using QnnSystemContext_getBinaryInfo() and QnnSystemContext_getMetadata() must check the version field before unpacking the structures for the specific version. {105510}
* CPU:
*   Update MatMul Op for Dynamic Dimensions support {99967}
* Genie:
*   Added RopeScaling config in genie {114027}
*   Added SSD support for GenieDialog_embeddingQuery dialogs. {112355}
*   Added alibi and absolute positional encoding support in Genie. {111986}
*   Enabled Embedding API Support {110544}
*   Added GenieDialog_save and GenieDialog_restore. {101610}
* Tool:
*   Quantizer:
*     Added support for Int32 Quantization Override {102811}

Bugs
~~~~
* Genie:
*   Fixed memory leak in model loading when setting use-mmap to false in genie config. {112619}
*   Fixed a memory leak issue that occurred when using GenieDialog_free {113599}
*   Fixed a stability issue during repeated LoRA adapter application and query operations {113058}
* HTP:
*   Fixed context generation failure for a Customer specific Prediction model in FP16. {111220}
*   Fixed accuracy issue for a customer specific model with Conv2D Op in FP16 mode. {109130}
*   Fixed power config ID leak when using SNPE and QNN together that was causing stability issues. {112976}
* KI:
*   DLBC weights are not supported on mobile platforms. {98793}
* Op:
*   HTP:
*     Optimized performance for some 5D transposes. {110124}
*     Fixed a bug in Gather Ops causing failure when using negative indices. {101700}
* OpDef:
*   Fixed StridedSlice OpValidation for dynamic shape {113974}
* Tool:
*   Converter:
*     Fixed RMSNorm fusion for models where the topological order of nodes differs from their sequential order. {114174}
*     Fixed a bug that prevents the larger Layernorm pattern from matching and instead a smaller RMSnorm pattern is matched {114000}
*     Fixed a shape mismatch error for the Concat Op that occurred under specific conditions involving continuous Concat Ops and Nontrivial layouts. {106276}
*     Onnx:
*       Fixed model conversion failures for models with Concat and GridSample operations with varying input layouts. {83315}
*     Relay:
*       Added support for Quantized BatchMatmul Op in TFLite Converter. {103242}
* Tools:
*   Fixed GatherElement accuracy issue when the input index includes negative numbers. {108251}




2.27.0
======

**9/30/2024**

QNN API version: v2.20.0


Changelog
---------

Features
~~~~~~~~
* Op:
*   HTP:
*     Improved the accuracy of the INT4 quantized MatMul operation. {102436}
* OpDef:
*   Added dynamic shape support for ElementWiseNeuron op. {98225}
* SDK:
*   License:
*     Separated the license restrictions section into two parts: one for general restrictions and another for a prohibited items list. {110808}
* Tool:
*   Converter:
*     Added converter support for the QLinearMatMul Op. {108360}
*     Added converter support for the QLinearConv Op. {108055}
*     Added support for generating a model summary {68668}
*     ONNX:
*       Added support for pattern matching matmul with bias in MHA to SHA conversion {108280}
*       Added translation for if op {106903}
*   onnx-simplifier:
*     Added the following HTP-specific post-quantization adaptations:
*     1. Output transposed keycache: Avoids repetitive transpose of key state tensors.
*     2. Output new key value only: Reduces memory traffic. {100771}
*     API:
*       Added optional arguments to the simplify API. {108865}
*       Added an optional `debug` argument to the simplify API. {111632}

Bugs
~~~~
* HTP:
*   Fixed a bug that caused unexpected DMA buffer size increase when loading multiple QNN models. {108359}
*   Fixed issue with preparing Sigmoid Op when depth is set to 1 {111805}
*   Fixed a failure that occurred during HTP Op validation for tensor parameters. {108651}
*   Fixed an issue where FP16 was not supported during online prepare in some corner cases. {108611}
*   Fixed an issue where the QCM6490 platform was unable to enter sleep mode after model execution using the HTP runtime. {107650}
*   Introduced the 'weights_packing' custom graph configuration to reduce the context binary size of the UNet model. {109039}
*   Enabled weight sharing across 64 graphs (from 32) {110963}
*   Fixed issue with loading HNRD in browser sandbox mode. {111276}
* Op:
*   CPU:
*     Added 6D elementwise Ops. {90840}
*   HTP:
*     Improved support for 4D gridsample with large height. {109185}
*     Fixed an accuracy issue for Quantize -> Dequantize sequences with zero-point uint8 quantization. {106398}
*     Improved softmax performance by optimizing the reshape rule. {105726}
*     Fixed a performance issue in Depthwise Convolution when it is the first layer of the model and quantized to int8. {105848}
*     Fixed an accuracy issue for a specific floating-point convolution configuration. {103137}
*     Fixed bugs that prevented the Conv3D and ConvTranspose3D operations from working correctly in the QNN EP. {99686}
* OpDef:
*   Fixed a bug in the L2Norm operation that prevented models using a tensor for the "axes" parameter from being converted correctly. {103018}
* Tool:
*   Converter:
*     Fixed GELU fusion for models where the topological order of nodes differs from their sequential order. {107651}
*     Fixed an issue where L2Norm had the wrong axis after sequence matching. {106280}
*     Fixed a bug where quantization overrides for LSTM/GRU Ops were not propagated correctly during Op expansion. {103668}
*     Fixed an issue where the conv+bn fusion was not being disabled when the conv node was the graph output. {107011}
*     Fixed a bug in validation of dynamic shaped ONNX models. {97108}
*     Fixed the issue of input axis format for the groupnorm pattern. {101786}
*     Added support for 6D ReshapeOp, ElementwiseUnaryOp, ElementwiseOp, ReduceOp, GatherOp  and TileOp. {102597}
*     Fixed an issue where the Topk Op's K value was invalid. {108738}
*     Enable new pattern for fusing Groupnorm when the input to the pattern is 4D and output is 3D {111503}
*     ONNX:
*       Fixed Einsum accuracy issue. {106632}
*   Quantizer:
*     Fixed a bug in per-channel bias with float_fallback. {105658}
*   qnn-accuracy-debugger:
*     Resolved subgraph extraction failures affecting certain models. {102417}

Known Issues
~~~~~~~~~~~~
* Tool:
*   Converter:
*     ONNX:
*       Shape mismatch errors might occur if the models having consecutive concat operations where at least one input buffer is nontrivial and models having specific sequence Reshape-> transpose-> reshape. {83315}



2.26.0
======

**8/31/2024**

QNN API version: v2.19.0


Changelog
---------

Features
~~~~~~~~
* DSP:
*   Upgraded  Hexnnv2 to DSPCore1.53.0 {106900}
*   Added default support for new LSTM optional inputs and parameters. {84456}
* LPAI:
*   Added Graph SetConfig and GetProperty functions. {108224}
* Op:
*   GPU:
*     Added support for ScatterElements op. {105612}
*     Added support for ElementWiseSign and the SIGN parameter in ElementWiseUnary op. {96752}
* OpDef:
*   Added dynamic shape support for Softmax op. {98596}
*   Added dynamic shape support for Transpose op. {96464}
* Tool:
*   Converter:
*     Added support for antialias attribute of ONNX Resize operator with linear interpolation mode. This is only supported with 4D inputs currently {91793}
*     ONNX:
*       Added functionality to output only the last logit. {100770}
*       Enabled support for additional Einsum equations. {106670}
*       Enhanced model conversion efficiency by eliminating superfluous Transpose nodes around Elementwise Op {106452}
*       Added Conversion support for "largest" attribute in TopK Op {98063}
*   qairt-accuracy-debugger:
*     Added documentation for the tool. {101018}
*   qnn-net-run:
*     Defined return codes for improved error handling and debugging. {51791}

Bugs
~~~~
* Improve the convolution performance when its output stepsize is super small {105298}
* API:
*   HTP:
*     Fixed a missing nullptr check for optional tensors in the QnnBackend_validateOpConfig API. {107560}
* Core:
*   Improved memory-mapped user buffer registration API to handle duplicate address/offset gracefully, particularly in recurrent networks. {106104}
*   Fixed validation in TransposeConv3D Op definition for scenarios without bias. {105775}
* DSP:
*   Fixed a graph finalization failure for ElementWiseNeuron. {102401}
* HTP:
*   prevent setting of certain context configs for QnnContext_createFromBinaryListAsync API {108200}
*   Reduced RPC delay during mapping and un-mapping to reserved space for I/O, improving performance. {104994}
*   Fixed an issue with unaligned space_rearrange. {102552}
*   Improved split lm_head layer performance by optimizing convolution. {101471}
*   Fixed a failure in HTP op validation for tensor parameters. {108651}
*   Fixed GroupNorm Ops to handle optional input tensor default values. {103928}
* MHA2SHA:
*   Enhanced LoRA capture with stricter conditional checking {106718}
* Op:
*   CPU:
*     Added 6d elementwise ops. {90840}
*   HTP:
*     Added bool8 support for Tile Op. {105915}
*     Fixed a failure during context binary generation for Image Embedding models {106954}
*     Optimized the performance of A16W16 MatMul. {104441}
* QNN:
*   Fixed an accuracy issue with the Mul Op in FP16 that affected some models {102413}
* SDK:
*   Updated the Python dependency script to support both Python 3.8 and Python 3.10. {107182}
* Tool:
*   Expanded subprocess timeout in accuracy debugger, facilitating process completion for larger models. {108194}
*   Converter:
*     Fixed a bug in applying quantization overrides when a RMSNorm pattern is folded into RMSNorm QNN Operator {108587}
*     Added support for string datatype in customop {90245}
*     Mapped the cast op to constant op in case of static input to the cast op. {105496}
*     Fixed a converter failure due to a segfault in onnx simplifier {108191}
*     ONNX:
*       Fixed conversion failure due to axis tracking for specific models with qairt-converter. {106296}
*     TFLite:
*       Fixed multiple Converter and Quantizer issues for the FullyConnected Op in QNN TFLite Converter {102333}
*   Quantizer:
*     Fixed a bug in propagating quantization encodings around reshape ops inserted during optimizations {105883}
*     Qairt:
*       Fixed a bug in applying quantization overrides for static input tensors of data invariant operators {108077}
*   qairt-accuracy-debugger:
*     Fixed an issue while passing device ID to the debugger for AIC runtime. {104795}
*     Fixed an issue when using --add_layer_outputs with qnn as the executor type. {107151}
*   qnn-accuracy-debugger:
*     Fixed an issue where the quant_checker was failing for the BERT_Large_Packed_Compressed_Mask model. {103062}
*     Fixed an issue that prevented the generation of the layerwise.csv file for Densenet169 and ViT models. {102761}
*     Fixed an issue where the float_bias_bitwidth parameter was not being properly passed to the converter for fp16 precision. {106483}

Known Issues
~~~~~~~~~~~~
* Tool:
*   Convertor:
*     ONNX:
*       Fixed the axis tracking logic for multiple-input ops like Concat and Elementwise_binary/Elementwise_ternary. Known issues due to this fix:1. Shape mismatch issue in Concat op when several continuous Concat ops can be folded into one, and at least one of the Concat op's input buffers is nontrivial. 2. Shape mismatch issue when there is a node sequence (Reshape(4D->6D) -> Transpose -> Reshape(6D->4D)) that can be merged into DepthToSpace op. {83315}



2.25.0
======

**7/31/2024**

QNN API version: v2.18.0


Changelog
---------

Features
~~~~~~~~
* CPU:
*   Add support for RMSNorm op {96059}
* GPU:
*   Support QNN_MEM_TYPE_DMA_BUF memory type {87377}
* Op:
*   GPU:
*     Add support for RMSNorm op. {96640}
* OpDef:
*   Added Op definition for RmsNorm. {96058}
* Tool:
*   Converter:
*     Optimized the implementation of expand LSTM Op structure in the converter. {88467}
*     Added fix to remove identity patterns emerging from a sequence of Reshape and Transpose ops {100733}
*   Quantizer:
*     Fixed accuracy drop at output of Cast Op (INT32 -> uFxp8) by inserting Quantize (FP32 -> uFxp8) Op after Cast (INT -> FP32). {92014}
*   qnn-net-run:
*     Added support for creating input and output tensors with DMA Buffer memory. {88150}

Bugs
~~~~
* Fixed dynamic convolution accuracy issue by optimizing the rules. {98776}
* Fixed fp16 convolution accuracy issue by optimizing the rules to let it enter im2col impl instead of reference code. {98313}
* Improved VGG model performance at the cost of increased init/deinit time. {98714}
* API:
*   Corrected a syntax error in the QNN_HTP_CONTEXT_CUSTOM_CONFIG_INIT macro. {105497}
*   HTP:
*     Fixed API Compliance failure for unmapped memhandle, ensuring proper memory mapping. {104185}
* CPU:
*   Add int8 support for elementwise Elu {96855}
*   Fix maxPool2D parameter selection {100268}
* GPU:
*   Allow context priority config option to be set while loading context binary. {97780}
* HTA:
*   Add validator to filter unsupported Elementwise Op parameters. {102580}
* HTP:
*   Fixed issue where initial batch size was overwritten to 0 in old libQnnHtp.dll and new Windows driver use case. {104118}
*   disable compress_weights graph option enabled by default {101594}
*   Removed unnecessary warn log {104062}
* Op:
*   HTP:
*     Fixed an issue in the support update that caused the loss of functionality for converting data types from float16 to int32 and float16 to float32, impacting data type conversions in certain operations. {103148}
*     Fixed accuracy bug with StridedSlice where height and width are sliced. {100958}
* Tool:
*   Converter:
*     Added an input length conditional judgment before checking the second and third inputs of GroupNorm op. {90777}
*     Fixed broadcasting error for constant input to Quantize/Dequantize Linear ONNX Ops, ensuring correct input handling. {101731}
*     Fixed issue where some models with LSTM and NTF format input failed to convert. {99233}
*     Fixed input shape mismatch issue of LayerNorm op. 1) adjust_layernorm_buffers: Change data_axis_formats[0] if input[0] buffer.axis_format changes. 2) axes_to_spatial_first_order: Use data_axis_formats as a reference instead of output_axis_format. {92998}
*     Fixed an issue in the calculation of padding for deconv {103543}
*     ONNX:
*       Mapped RMSNorm pattern in ONNX networks to a QNN RMSNorm Op {104312}
*     Onnx:
*       Added fix for name conflict in naming policy {90651}
*     Relay:
*       Added new Op Support for BatchToSpace and SpaceToBatch Ops to the TFLite Converter. {100933}
*   qnn-accuracy-debugger:
*     Enabled CPU runtime in inference engine and handled architecture for PyTorch. {97304}
*   qnn-tensorflow-converter:
*     Fixed batchnorm sequence matching issue to align with instance norm/layer norm. {100590}
*   qnn-throughput-net-run:
*     Fixed error message related to opening QcSoCServiceUtils.dll in Android builds {100959}



2.24.0
======

**6/30/2024**

QNN API version: v2.17.0


Changelog
---------

Features
~~~~~~~~
* Added onnx-simplifier and onnx-runtime versions to sdk.yaml {95974}
* Improves performance of fp16 nms for single batch and single class {93288}
* API:
*   Added QnnContext_createFromBinaryListAsync. {94719}
*   Introduction of two new APIs and five new tensor types to update data of static tensors and quantization encodings of activation tensors. {90410}
* BatchToSpace:
*   CPU:
*     added support for optional crop parameter {77952}
* CPU:
*   Add NodeFusion for Elementwise Neuron {99755}
*   Fix integer rounding in crop_and_resize op {100490}
* GPU:
*   Improve memory footprint of a finalized graph. {25508}
*   Enabled GPU Runtime for Windows platform on Hamoa. {100913}
* Op:
*   HTP:
*     Updated Gather op to support per-channel quantized tensor {90075}
* Tool:
*   Converters:
*     Onnx:
*       Enabled support for GroupNormalization opset version 18. {99796}

Bugs
~~~~
* Corrected behaviour of QUInt16 LayerNorm operation when the Gamma tensor uses QUInt16 datatype. {100283}
* Resolved memory violations in the kernel classes in QNN GPU. {98721}
* CPU:
*   Made destroy sequence thread safe {98101}
* GPU:
*   Fix Inference failures in models having ReduceMean op. {95811}
* HTP:
*   Fixed bugs related to VA reservation {101238}
*   Fixed issue during process-exit stage {99620}
*   Fixed accuracy issue of fp16 depthwise convs and TransConv2d for v73 and above {96108}
* KI:
*   OpDef:
*     HTP:
*       Bug in HTP GatherElement datatype support {100515}
* Op:
*   CPU:
*     Support S_FIXED_32 bias in batchnorm op {82940}
*   HTP:
*     Fixed CreateSparse op config validation failure {100298}
*     optimize the init time of FP16 Conv with large height. {97885}
*     optimize the inference time of FP16 Conv with large height. {88937}
* SDK:
*   Fix HNRD failure with graph containing null type tensor {97758}
*   Add version information to libraries and executable files {97358}
* Tool:
*   Converter:
*     ONNX:
*       Promoted 0D to 1D to fix conversion issue in Squeeze op {87224}
*       Fixed conversion issue for TransposeConv1d op {87204}
*       Added support for negative paddings using CropAndResize op. {90661}
*       Fixed issue in ThresholdedRelu op {100605}
*       Added fix in transpose squashing. {96277}
*       Added fix in mixed precision quantization. {90671}
*       Added fix in reshape folding. {95929}
*       Fixed an incorrect mapping RMSNorm pattern to LayerNorm Qnn Op. {101236}
*   Converters:
*     Fixed regression due to redundant transpose ops introduced during graph optimizations. {95994}
*     Relay:
*       Fixed a tflite conversion failure in populating Quantization Encodings for the L2Norm Op. {98699}
*       Fixed a tflite conversion failure in population of quantization encodings for Softmax and DepthwiseConv Ops. {98564}
*   qairt-converter:
*     Added support for boolean and int64 in Dump IO Config Template {97578}
*   qairt-quantizer:
*     Added support for Unsigned Symmetric in Param and Act Quantizer Schema Options {98393}
*   qnn-accuracy-debugger:
*     Fixed issue to handle qnn list format for multiple inputs in tool {98895}
*     Fixed issue to filter unwanted entries in tensor mapping. {85953}
*   snpe-accuracy-debugger:
*     Fixed issue related to wrong variable. {99205}
* Tools:
*   Converters:
*     Fixed an issue in Converter to allow for the Graph input datatype to be correctly updated to FP16 from FP32. Converter is expected to generate FP32 graph {94136}
*     Fixed squash_identity logic for Python IR graph where all the consumer nodes of a parent node of squashed node will be updated with correct output buffer name. {98891}
*   Quantizer:
*     Added fix for Segmentation Fault issue when using algorithms cle flag {99230}
*   qnn-accuracy-debugger:
*     handled args checks for snooping {97297}



2.23.0
======

**5/31/2024**

QNN API version: v2.16.0


Changelog
---------

Features
~~~~~~~~
* Added documentation on the usage of the QnnMem API for the QNN GPU backend. {92973}
* API:
*   Added QNN_PROFILE_EVENTUNIT_NONE. {94459}
* CPU:
*   Added 32-bit bias support for InstanceNorm Op {96361}
*   Add 32bit bias support for quantized models for InstanceNorm, LayerNorm and BatchNorm Op {94777}
* HTP:
*   optimize performance on lenovo's vae encoder and decoder {94425}
*   Add INT64  support for cast op {64595}
*   Fix for I/O memory registration failure {93736}
*   Optimize performance of some Gen AI models {94422}
* Op:
*   CPU:
*     Added 32-bit bias support in BatchNorm {96379}
*     Added 32-bit bias support in LayerNorm {96373}
*     Add support for Buffer Op {88410}
*     DeformConv2D op support {46880}
*   GPU:
*     Support ElementwiseFloorDiv op amd ElementWiseBinary Op with FloorDiv param. {63986}
*   HTP:
*     Updated Convert op to support QUINT8 per-width quantized -> QUINT16 per-tensor quantized {90076}
*     Updated Gather op to support per-channel quantized tensor {90075}
* OpDef:
*   Added op definition for CombinedNMS. {92037}
*   Updated Buffer Op definition for multi-frame support. {96065}
*   HTP:
*     Updated op definition for ElementwiseUnary Sin and Cos to support FP {96306}
* SDK:
*   Added support for DLC in QNN SDK for windows. {95862}
*   Updated the default QNX logger in QNN. {80637}
*   Introduction of the Genie SDK add-on which replaces the QNN Gen AI Transformer SDK add-on. Please see the ${SDK_ROOT}/doc/Genie/ SDK documentation for more details. {99680}
* Tool:
*   Converter:
*     - Update clear help message for argument "--enable_framework_trace".
*     - Disable framework trace for other converters than onnx converter. {94461}
*     - Add implementation for framework op tracking for graph quantization optimization stage {94372}
*     ONNX:
*       Added Unit Tests for ThresholdedRelu op {92470}
*   qnn-accuracy-debugger:
*     user documentation for quant checker is added {85830}
*   qnn-context-binary-utility:
*     Added support to write all quantization parameters into json file. {80026}
* [Core]:
*   Added support for traceinfo in dlc {94379}

Bugs
~~~~
* - Fixed documentation bug: C API reference not properly hooked up in table of contents. {98950}
* Fix GridSample not fitting in TCM. {97245}
* HTP:
*   Resolve the potential memory leaks in termination stage {96986}
*   Fixed memory leak that occurs during detailed profiling {98212}
*   fix  memory leak in user driver {95881}
* Op:
*   CPU:
*     Set default params in Matmul {97239}
*     Fixed zero division in Rsqrt due to quantization of small float values {92092}
* OpDef:
*   HTP:
*     Updated op definition for ElementwiseUnary Abs to support 5D FP and Quant. {94800}
* SDK:
*   Add Python 3.10 support for check-python-dependency on Windows {96535}
* TOOLS:
*   CONVERTERS:
*     Fix the small bug in transpose axis format {97027}
* Tool:
*   Converter:
*     ONNX:
*       Fixed conditions for computing pad sizes {95930}
*   Converters:
*     Fixed issue observed with Matmul optimization when input buffer axis format matches op data axis format. {96047}
*     Relay:
*       Fixed a tflite conversion failure by adding dequantize reduce pattern pass {94022}
*   qnn-net-run:
*     Fixed aborting of tool when it is run on device with 4 or less cores. {98372}
*     Disable optimizations on iterator variable to populate input tensors correctly. {96382}
* Tools:
*   Converter:
*     Fixed the small bug in op graph optimization {96941}
*   Converters:
*     Onnx:
*       Fixed bug in PreLU Op translation when alpha is shared by multiple Ops {97107}



2.22.0
======

**4/30/2024**

QNN API version: v2.15.0


Changelog
---------

Features
~~~~~~~~
* CPU:
*   Fixed memory leaks and heap buffer overflows in QNN CPU {76027}
* Op:
*   CPU:
*     Fixed BatchSplit to take numRois instead of total boxes in AxisAlignedBboxTransform. {92569}
*     Add support for ReduceSumSquare op. {91308}
*   GPU:
*     Support QNN_DATATYPE_UFIXED_POINT_4 in ElementwiseSelect op. {89047}
*     Support Concat op with input rank = 5 and axis =1. {90877}
*     Support QNN_DATATYPE_UFIXED_POINT_4 static inputs to BinaryElementwise op {89055}
*     Support QNN_DATATYPE_UFIXED_POINT_4 static inputs to Gather op. {89052}
*   HTP:
*     Improved dma efficiency for convolution in some large models. {94172}
* OpDef:
*   Added 0D support for ElementWiseBinary ops. {89707}
*   Added op definition for ReduceSumSquare. {91309}
*   Added 0D support for Reshape ops. {70572}
* QNN:
*   Update list of supported chipsets {87619}
* SDK:
*   Merged qnn.yaml into sdk.yaml. {75630}
* Tool:
*   qnn-context-binary-utility:
*     Added support for Qnn_TensorV2_t. {91872}
*   qnn-model-lib-generator:
*     Consolidated qnn-model-lib-generator scripts into one Python implementation. {64215}
*   qnn-net-run:
*     Added client profiling level to capture application-only profiling data. {91537}
*   qnn-platform-validator:
*     Added Windows support. {91558}

Bugs
~~~~
* Enhanced accuracy for adaptations for visual attention layers in LVM models. {92163}
* CPU:
*   Fixed memory leaks in SNPE gtestDnnRuntime due to QNN CPU. {91857}
*   Update avx2 and fma support for x86 {86736}
* GPU:
*   Disabled binary elementwise fusion for logical operations as the feature is not currently supported. {93543}
*   Fixed possible out of bounds memory access in various Ops {90947}
*   Fixed crash seen on 8650 in FP16 mode {90724}
* HTP:
*   Fix for context binary creation failure for specific backend {89904}
*   Fixed issue with graph preparation using fp16 ops in the ARM64X library. {95062}
*   Addressed SSR (SubSystem Reset) occurring between init and execute. {91821}
* KI:
*   TFLite pre-quantize model with quantize/dequantize as last node will second last tensor name as graph output name {90973}
* Op:
*   CPU:
*     Fixed LayerNorm op heap overflow {90736}
*     Fixed ReluMinMax for ElementWiseNeuron Op {95213}
*     Fixed ReluMinMax for ElementWiseNeuron Op {93983}
*   GPU:
*     Fix bug in Split op having 5D inputs. {93523}
*     Fix accuracy issues seen in some ReduceMean configurations {71736}
*   HTP:
*     Support 5D Prelu. {92186}
*     Fix Input Parameter not found issue for MatMul {95170}
*     Fix prepare failure related to 5D concat Op {90885}
*     Optimized SlicePadShape for FP32. {88759}
*     Fix accuracy issue of StridedSlice op when vtcm size is set to 8 {89662}
*     Fixed HardSigmoid QU8 issue. {90260}
* Tool:
*   Converter:
*     Fix incorrect output name issue when TFLite pre-quantize model has quantize/dequantize as last node {90973}
*     Fixed bug to allow Mul Add to be fused as Batchnorm when preceded by Conv {94502}
*     ONNX:
*       Updated the ThresholdedRelu expansion. {94315}
*       Fixed duplicate buffer issue and fixed axis tracking issue. {87113}
*     TFlite:
*       Fixed unstable results when specifying multiple out nodes {92630}
*   qnn-net-run:
*     Fix crash in qnn-net-run when graph execution is skipped using "__" as input for "-input_list". {94500}
*   qnn-onnx-converter:
*     Fixed issue obersevd with is_static attribute from ONNX Frontend Translation for ScatterNd, ScatterElements, GatherND Ops. {92920}
*   quantizer:
*     Fixed issue observed when bias of conv op need to be per-channel quantized in mix-precision mode. {80334}
* Tools:
*   Converter:
*     Adding reduction attribute as none in case of attribute is not available in original graph {95874}
*     Div op support is added in Tensorflow converter. {95855}
*     Added scatternd and gathernd support in Tensorflow converter {91897}
*     Fixed the small bug in onnx softmax translation {93028}



2.21.0
======

**3/29/2024**

QNN API version: v2.15.0


Changelog
---------

Features
~~~~~~~~
* API:
*   Added QnnContext_createFromBinaryWithSignal API {86294}
*   Clarified QnnProperty capability descriptions. {90915}
*   Added QNN_MEM_TYPE_DMA_BUF QnnMem type. {88940}
* GPU:
*   Added QnnMem API support for the QNN GPU backend. {10291}
* Op:
*   CPU:
*     Added support for optional param time_major and support for multi time-step input and output in LSTM Op. {78820}
*     4D input for DistributeFPN {91960}
*     Added support for CreateSparse op {53230}
*     Added support for SparseToDense op {68806}
*     Added support for GetSparseIndices op {53231}
*     Added support for GetSparseValues op {53232}
*   GPU:
*     Support QNN_DATATYPE_BOOL_8 inputs/outputs in Reshape op. {88979}
*     Support QNN_DATATYPE_BOOL_8 outputs in Cast operation. {88978}
*     Support QNN_DATATYPE_INT_32 datatype in Concat op {89045}
*     Support ScatterND operation {88977}
*     Support QNN_DATATYPE_INT_32 in BinaryElementwise op {89049}
*   HTP:
*     Added support for ElementWiseBinary {57626}
* OpDef:
*   Added optional reset input to GRU Op. {90828}
*   Updated mask tensor input description for the MaskedSoftmax op. {88412}
*   Added QNN_DATATYPE_SFIXED_POINT_4 and QNN_DATATYPE_UFIXED_POINT_4 support for Quantize and Dequantize ops. {91536}
*   Added Op definition for Buffer. {72273}
* SDK:
*   Added supported capabilities table to the SDK documentation. {52457}
* Tool:
*   Converters:
*     Added support for sparse tensors {88641}
*   Quantizer:
*     Added a new standalone qairt-quantizer tool equivalent to snpe-dlc-quant. This new tool takes a float DLC and produce a Quantized or Mixed Precision DLC. {90514}
*   qnn-accuracy-debugger:
*     enabled new quantization options for accuracy debugger {92083}
*     provided user to pass the plots they want to generate in quant_checker {88818}
*   qnn-accuracy-evaluator:
*     Added plugins for external use {74447}
*     - Replaced keyword 'platform' with 'inference_schema'
*     - Added support for providing CLI args to context-binary generator and netrun in model config
*     - Refactored providing backend extension params under single subsection instead of under 'compiler_params' and 'runtime_params' subsections {79259}
*   qnn-model-lib-generator:
*     Added support for aarch64-windows-msvc. {87847}
*   qnn-net-run:
*     Support configuration for profile max events {74652}
*     Added retrieve_context_timeout option. {88667}
*     Added support for 0-D graph input/output tensors. {44309}
*     Add configuration options to specify maximum number of tasks that run in parallel when graphs are executed asynchronously. {81908}
*     Added --validate_binary option to validate a context binary before deserialization. {87353}
*   qnn-onnx-converter:
*     - Added --validate_models flag to enable validation of optimized onnx model against original onnx model. {68698}
*   qnn-tensorflow-converter:
*     - Added --validate_models flag to enable validation of optimized tensorflow model against original tensorflow model. {68698}
*   snpe-accuracy-debugger:
*     provided user to pass the plots they want to generate in quant_checker {88818}

Bugs
~~~~
* CPU:
*   Fixed memory leak in CPU BE {91859}
* GPU:
*   Resolved stability issues in QNN GPU in Multi-threaded runs. {85184}
* HTP:
*   Improved multi DSP PD handling {90712}
*   Bug causing execution failure with previously generated context binary related to NonMaxSuppression op. {88740}
*   Optimize some segmentation model performance {87696}
* LPAI:
*   Fixed compiler version mismatch check mechanism. {89532}
* Op:
*   CPU:
*     Support S_FIXED_32 bias in batchnorm op {82940}
*     fp16 datatype support for Cast op {88450}
*   GPU:
*     Fix bug in Cast op. {90901}
*     Resolve accuracy errors in BinaryElementwise ops. {89993}
*     Resolved GPU inference failures. {90369}
*     Fixed accuracy issues in UnPack operation {91514}
*     Fix inference failures in models having consecutive BinaryElementwise ops {89258}
*   HTP:
*     Repair graph finalize for fp models given small vtcm size. {88988}
*     Fix accuracy bug with graph optimization related to Cast and Quantize Ops {86947}
* SDK:
*   Fixed incorrect path referenced in QNN_README.txt {89309}
* Tool:
*   Fix missing modeltools on Windows platform. {90879}
*   Accuracy-Debugger:
*     Fixed issue with debugging single layer model which doesn't have weights. {90492}
*   Converter:
*     Added support for new layernorm op sequence for multiple LLM models. {83558}
*     Added condition to skip layernorm mapping if encodings provided is incorrect. {90285}
*     Onnx:
*       Add support for GlobalPool3D {87216}
*       Fixed an error due to serialization when trying to convert models larger than 2GB. {87479}
*       Add post Reshape op for Matmul op when one of its inputs is unsqueezed. {87217}
*       Support FP16 model conversion. {83579}
*       Insertion logic of duplicate buffer for PRelu op is corrected. {92747}
*   Converters:
*     Added support for weight tensor sharing across multiple Prelu nodes {80946}
*   qnn-accuracy-debugger:
*     Fixed issue in debugger execution with backend config enabled for Auto devices. {90151}
*     Refactored the argument parsing and validation strategy {86189}
*     Added --help argument validation. {91029}
*     Fixed issue in debugging models with custom OP for Auto devices. {91502}
*     Fixed context-binary-generator and Net-runner failures for WoS with backend configs {92340}
*     Refactored the argument list passed to inference_engine through layerwise algorithms {91902}
*     handled extra new lines in input_list.txt in quant_checker {90704}
*   qnn-accuracy-evaluator:
*     Added support to include user provided netrun params while building netrun command for running on target {90993}
*     Set default value dsp_arch when running on target, if not provided {91841}
*     Add support for new converter params {89386}
*   snpe-accuracy-debugger:
*     Refactored the argument parsing and validation strategy {86189}



2.20.0
======

**2/29/2024**

QNN API version: v2.14.0


Changelog
---------

Features
~~~~~~~~
* API:
*   Introduced Qnn_TensorV2_t. Tensor V2 adds API support for sparse tensors, dynamically shaped tensors, and graph execution early termination. This is an ABI backwards incompatible change, clients must recompile their applications and model.so libraries. {84712}
* CPU:
*   0D tensor support. {44307}
* Op:
*   CPU:
*     Added support for optional param time_major in GRU Op. {79656}
*   GPU:
*     Support GroupNorm operation. {87375}
*   HTP:
*     Added support for ElementWiseNeuron {57619}
*     added support for xor operation. {66124}
* OpDef:
*   Added support for single batch input and output in DistributeFPNProposals op. {84672}
*   Added sparsity support to Relu and Batchnorm Ops. {88095}
* SDK:
*   Python dependency installer script outputs summary table displaying recommended and installed versions of each python dependency {85974}
*   Added SDK documentation for Qnn_TensorV2_t under API->Usage Guidelines. {88096}
*   Added ArgMax custom op example for CPU, GPU, HTP and DSP backends. {87132}
* Tool:
*   Converters:
*     Added Converter support for MaskedSoftmax Operator. {76262}
*     support hardsigmoid in onnx converter {52388}
*   qnn-context-binary-utility:
*     Added support for Qnn_TensorMemType_t. {87597}
*   qnn-net-run:
*     Allow --dlc_option in qnn-net-run and qnn-context-binary generator to take in multiple DLCs as a comma separated list. {86257}
*   qnn-profile-viewer:
*     Improve output of execute queue wait stats. {82530}
*   qnn-pytorch-converter:
*     Enabled preserve_io feature. {75965}
*     add support of aten::upsample_linear1d for pytorch converter {71465}
* Tools:
*   TFLite Converter: add l2_normalize support in TFLite converter {49366}
*   Converters:
*     Added Masked Softmax Optimization
*     - This feature enables the pass that creates a MaskedSoftmax Op and rewrites the graph to include this Op. This is mainly found and applicable for NLP models.
*     - Added --apply_masked_softmax option to enable the pass. It takes "compressed" and "uncompressed" value.
*     - Added --packed_masked_softmax_inputs option to obtain the packed input tensor name in case of Compressed MaskedSoftmax Op.
*     - Added --packed_max_seq option to obtain number of sequences to be packed in the given input tensor. Applicable for Compressed MaskedSoftmax Op. {68666}
*   Quantizer:
*     Added unsignedsymmetric quantization schema support {87310}
*     - Added --act_quantizer_calibration, --param_quantizer_calibration, --act_quantizer_schema, --param_quantizer_schema and --percentile_calibration_value options.
*     - Added new calibrations methods - mse, entropy, percentile, sqnr and min-max.
*     - Added support to set/override default quantization schema. Supported options are symmetric, asymmetric. {80662}
*   snpe-dlc-quantize:
*     Added --act_quantizer_calibration, --param_quantizer_calibration, --act_quantizer_schema, --param_quantizer_schema and --percentile_calibration_value command line options. {87311}

Bugs
~~~~
* Graph no longer contains the DSP_ARCH setting but inherits it from Device instead. {83364}
* CPU:
*   Padding CPU Native Tensor only for XNNPACK {87444}
*   Fixed the crash which is observed when Camera starts. {87845}
*   Updated default value of sample parameter in qnn-genai-transformer-composer to generate consistent output {89499}
* HTP:
*   Fixed leakage occurring during context binary data creation. {87606}
*   Fixed execution failures associated with detailed and linting profiling levels. {88269}
*   Fixed execution failures associated with detailed and linting profiling levels. {89951}
*   Fixed accuracy issue with specific padding cases. {88205}
*   Fixed prepare failure for some models {87744}
*   Fixing leaks that happened in specific cases during online prepare. {87492}
*   Fixed bug for some shared buffer use cases {87995}
* Op:
*   GPU:
*     Improved accuracy in models having Softmax op with channel dimensions > 16384 in GPU_FP16 precision. {85957}
*     Fix memory access bug in Reshape Op {88082}
*   HTP:
*     Fix accuracy issue in fp16 GroupNorm by handling large height/width in good manner {88852}
*     Improve performance for specific LSTM op configurations. {84874}
* SDK:
*   Eliminate the redundant .cpp and .h files located in the share/qnn/converter directory {75106}
* Tool:
*   qnn-accuracy-debugger:
*     Fixed failure in propagating model inputs for auto platform. {89330}
*   quantizer:
*     Fixed an issue that overridden encoding of bias not working. {81827}
* Tools:
*   PyTorch Converter: Change default layout of PadOp in PyTorch converter from NCHW to NHWC {56384}
*   Accuracy Debugger: Support the debugging on dspv68 {86768}
*   Accuracy debugger: Made --default_verifier argument case insensitive. {85848}
*   Converters:
*     Pytorch:
*       Fixed an issue with reading and applying quantization data from fakequant nodes in Pytorch networks {68003}



2.19.0
======

**1/31/2024**

QNN API version: v2.13.0


Changelog
---------

Features
~~~~~~~~
* API:
    - Added QnnContext_validateBinary API {86244}
   HTP:
    - Added QNN_HTP_GRAPH_CONFIG_OPTION_MAX to config VTCM size in QnnHtpGraph.h {86240}
* CPU:
   Op:
    - Add Masked Softmax support {68601}
* HTP:
    - Added max supported rank to 5d for in[0] and out[0]  for Convert Op {86809}
    - Optimization for GenAI for reducing LLM memory footprint. {83356}
   Op:
    - Optimization for GenAI to reduce memory footprint. {81973}
* SDK:
    - Added support for GenAiTransformer add-on package (EXPERIMENTAL). Enables running LLM/LLaMA models on CPU. {85246}
* Tool:
   qnn-net-run:
    - Added new command line options graph_profiling_start_delay and graph_num_profiling_executions. {81294}
* Tools:
   Converters:
    Onnx:
     - Reduced peak memory utilization for Onnx converter by sharing static tensors between Onnx model and IR graph. {85401}

Bugs
~~~~
* API:
   DSP:
    - Removed QnnDspError.h from SDK header {86313}
* HTP:
    - Fixed offline preparation freeze issue on mixed precision quantized model. {82990}
    - Fixed performance regression on several models, improve O2 performance to better than qaisw-2.18.0 {85193}
    - Fixed Op validation failure for 16bit dynamic matmul impacting some models {86153}
* Op:
   CPU:
    - Added support for batching in gather_nd op {81216}
    - Support for negative indices in gather op {59239}
    - Fix memory accumulation in Conv2D prepare {86267}
   GPU:
    - Fix ArgMax/ArgMin accuracy bug with UINT dataType {86854}
   HTP:
    - Fix accuracy regression issue in l2norm op {87332}
    - Fix accuracy regression issue in instance norm op {87074}
* OpDef:
    - Fix issue where constraint for split_index parameter for Split op allowed creation of empty output tensors. {84750}
* SDK:
    - Fixed issue in SDK Docs API usage guidelines where QNN_GET_ERROR_HANDLE was erroneously referenced. {87112}
* Tool:
   qnn-context-binary-generator:
    - Enable memory optimization for context binary generation from DLCs when input/output types are specified as memhandles {87629}
   qnn-onnx-converter:
    - Fixed op name not present issue: if framework level Op name is not present, updating the same with autogenerated Op name. {82132}
   snpe-accuracy-debugger:
    - Enabled HTP support for --compiler_config argument. {87368}
* Tools:
    - Fixed bug to correctly convert shared static tensor to FP16 {81475}
   Converter:
    - Support optional initial_h and initial_c in Onnx bidirectional LSTM {86935}
    TFlite:
     - Fixed data type mismatch issue for TFLite pre-quantized model {73432}
   Converters:
    - Fixed support for assigning input dtype in PyTorch converter {64335}
    - Support custom relay op with singleton pattern to fix duplicate registration error {78913}



2.18.0
======

**1/5/2024**

QNN API version: v2.12.0


Changelog
---------

Features
~~~~~~~~
* API:
    - Allow QnnGraph_finalize for deserialized graphs created via QnnContext_createFromBinary. {83408}
    - Introduced the QnnGraph_getProperty API. {83279}
    - Introduced the QnnGraph_prepareExecutionEnvironment and QnnGraph_releaseExecutionEnvironment APIs. {81912}
    - Added QNN_CONTEXT_CONFIG_BINARY_COMPATIBILITY context config and QNN_CONTEXT_ERROR_BINARY_SUBOPTIMAL context error code. {83460}
* Core:
    - Support Windows x86 FP16 offline cache generation of QNN and SNPE {83318}
    - Support FP16 online prepare inference on Hamoa of QNN and SNPE {83318}
* HTP:
    - Performance optimizations for various ops {74631}
    - Add A16W16 opvalidator support {81375}
* Op:
   CPU:
    - Added support for negative indices in gather op {79305}
   GPU:
    - Extending LayerNorm axes functional support allowing normalization across non-channel axis (batch, height, width) and allowing batch != 1 {35550}
* OpDef:
    - Added support for optional param time_major in GRU Op. {79655}
    - Added support for negative index values in Gather Op. {79304}
    - Added support for optional param time_major and support for multi time-step input and output in LSTM Op. {74284}
    - Added op definition for MaskedSoftmax {65770}
* SDK:
    - Add ARM64EC python extension modules for WoS {77740}
    - Add native ARM64 snpe-dlc-quant {77740}
    - Modify lib/python structure to organize python extension modules by platform {77740}
    - Updated documentation for HTP linting profiling example command. {80028}
    - Update QNN Documentation for PyTorch Custom Op {77637}
    - Add supported SOC table to SDK documentation. {68121}
* Tool:
   Converters:
    - Support group_norm in pytorch converter {60263}
    - Add Xor support in onnx, relay and tensorflow converters. {66128}
    - Removed check to convert fp32 tensor to fp16 to handle case when bias is set to fp32 using float_bias_bw flag {70549}
    - Enhance the graph optimizations for onnx framework by integrating transformations like node cleanup. removing unused inputs and removing zero dim initializers etc. {68670}
    - Added support for ONNX and Tensorflow model loaders in QNN-SDK , Which provides the consistent APIs to query models properties such as model's input names , output names , node information etc. {75444}
    - Einsum equations are node of the model in textual format. Here support is added to handle onnx models conversion with einsum node in QNN-SDK. {68690}
    - Updated squashing logic to avoid removing model outputs {84793}
    - Added simplification, shape inference and other optimizations support of 2GB+ ONNX models in QNN-SDK. All the APIs consistent across different onnx versions (i.e onnx-1.6 , onnx-1.11). {68674}
    - Added support for low level APIs which allows easy traversal of graph and modification of graph in QNN Converter. {68673}
   qnn-accuracy-evaluator:
    - use_memory_plugins flag introduced to enable memory plugin based evaluation.
    - Add memory plugins required for mobilenet evaluation {77768}
   qnn-hypertuner:
    - This story shall enable tuning in the hypertuner using a software backend known as "Hextimate". Currently, experimental and only available for QNN-SDK for Auto {81080}
   qnn-net-run:
    - Introduced new option "--platform_options" which is used to platform config option while creating backend handle. {66044}
    - Introduced use_mmap option which will enable users to use Memory mapped I/O buffers instead of raw buffers, to pass the context binary data to backend. {80741}
   snpe-accuracy-debugger:
    - Added new "Tensor inspection" feature . This feature compares given target outputs with reference outputs. {84190}
    - Added new "Compare Encodings" feature. This feature extracts encodings from a given SNPE DLC file, compares it with the given AIMET encodings, and outputs an Excel sheet highlighting mismatches. {83436}

Bugs
~~~~
* CPU:
    - Fixed default value of mode param in Space to Depth {78536}
    - Fixed syntax error causing Memory issues in L2norm Op {83411}
    - Increased precision of softmax output to fix regressed models with large number of softmax ops {82831}
* GPU:
    - Fixed accuracy issues in Concat op in GPU_HYBRID mode. {82938}
    - Fix consecutive BinaryElementWise corner case graph failures {81037}
* HTP:
    - Introduced encapsulation for the prepare library for the purposes of thread-safe access in order to resolve several issues related to concurrency. {84681}
    - Fixed the async execution failure that depends on QnnSignal {81368}
    - Fixed a run-time crash where system attempted to double free resources. Only occurred in process teardown after a graph failed to be created. {84535}
    - Fixing multi thread power voting problem on V68 platform {81807}
    - Fix potential mutex deadlock in QNN HTP SSR routine {84392}
* SDK:
    - Fix qnn-netron broken link in SDK tools documentation. {85851}
* Tool:
   Converter:
    - Convert 6D transpose to fewer rank to bypass backend limitation {81289}
    - Fix Onnx Converter DequantizeLinear when input is constant {83724}
    Onnx:
     - Enforce h/c input buffers of LSTM to be NONTRIVIAL {33599}
   Converters:
    - Fixed the input and output axis formats for transpose identity for NFC case. {80430}
    - Update the cast squash to only squash to next when there is next node.
    - When using custom_io for input/output layout but input/output axis format is set NONTRIVIAL, we believe the origin axis from user provided custom_io yaml and do the permute injection. {73366}
   Quantizer:
    - Added change to set the offset of Convert Op output tensor to 0 when the mode selected is Symmetric {82104}
   qnn-accuracy-evaluator:
    - Add support for "use_per_row_quantization" to take multiple values using '|' in inference_schema {84187}
* Tools:
   Converter:
    - Fix the issue in which backward LSTM is translated to forward LSTM with input_names reversed, but leaving direction flag backward {84422}



2.17.0
======

**11/30/2023**

QNN API version: v2.11.0


Changelog
---------

Features
~~~~~~~~
* GPU:
    - Extending support for serialization/deserialization of > 2GB Context Blobs using 64-bit offset flatbuffers {76075}
* API:
    - Introduced QNN_GRAPH_CONFIG_OPTION_SET_PROFILING_STATE and  QNN_GRAPH_CONFIG_OPTION_SET_PROFILING_NUM_EXECUTIONS graph configuration options. {78532}
    - Added QNN_PROPERTY_MEMORY_SUPPORT_MEM_TYPE_ION and
    - QNN_PROPERTY_MEMORY_SUPPORT_MEM_TYPE_CUSTOM capabilities. {68442}
    - Introduced the QnnError.h API. {76270}
* CPU:
    - Improved CPU performance on Windows targets {79302}
    - Optimized native memory utilization. {69880}
* HTP:
    - Enable MonacoAU {82903}
    - Improved a16w4 kernel selection, improving performance and power on LLaMA style networks. {82977}
    - Improved runtime memory utilization notably reflected for LLaMA style networks. {82977}
    - Updated backend extensions config - changed graph object to graph array to allow different graphs have different set of properties. {77487}
    - Optimized elementwise multiple, min/max and leakyRelu after concat. {83385}
    - Performance optimization related to Swish operation {81207}
    - Added default support for new LSTM params in HTP Core {79211}
    Made Op Package interface file changes.
      - Removed REGISTER_PACKAGE_OPS and REGISTER_PACKAGE_OPTIMIZATIONS in Init function
      - Added new unified core init macro INIT_PKG_CORE_INIT_FUNC() {74824}
* LPAI:
    - Add support of multiple model generator version. {82754}
* SDK:
    - Added source framework and Android NDK version info to sdk.yaml. {81491}
* Tool:
    Converter:
      - Fixed wrong pattern matching for ReluOp issue. {67474}
    Converters:
      - --float_fallback option will set the operators to FP16 for the operators which doesn't have encodings in the quantization_override file. {64837}
      - Warnings will be raised when --float_fallback option is used with --quantization_overrides option for the operator which are missing encodings. {65645}
      - Added support for LeakyRelu Op {78486}
    qnn-accuracy-debugger:
      - Added option, --golden_output_reference_directory, to allow user to provide golden reference output. {80201}
      - Added new "Compare Encodings" feature. This feature extracts encodings from a given QNN net JSON file, compares it with the given AIMET encodings, and outputs an Excel sheet highlighting mismatches. {79392}
      - Added new "Tensor inspection" feature . This feature compares given target outputs with reference outputs. {81337}
      - Added layerwise snooping feature option which extracts single node/supergroup one by one and create a subgraph to compile/run on target using golden reference output of previous node as it's input. The subgraph’s output is then compared with golden reference. {81812}
    qnn-accuracy-evaluator:
      - Enabled support for CPU and GPU backends for aarch64-android target {81174}
* Tools:
    Pytorch converter:
      - Added support for OneHot op. {58290}
    Converter:
      - Fix param name parsing issue in pytorch converter {74576}
    Converters:
      - Updated algorithm to fix Tensor Layout from Constant operator when it is located ahead of Concat Operator {78590}

Bugs
~~~~
* CPU:
    - Fix node fusion of opPackage node with builtin node {79966}
* GPU:
    - Fix bug in LayerNorm and InstanceNorm Ops where full float kernels were being enqueued for HYBRID mode {82594}
    - Fixed graph failures seen in ElementWiseBinary operations on non-PT devices {80207}
    - Fix de-init failure for multithreaded use cases {81396}
* HTP:
    - Minimize the performance regression on onnx11_custom_ear_23_uc.v.1454.1.0_06412396_video_seg_w8_a8 {78449}
    - Fixed potential segmentation fault for non 8-bytes aligned weights {83164}
    - Reduced peak memory consumption when loading context binary with shared weight buffer {82544}
    - Fixed a performance issue in SpaceToDepth. {81572}
    - Fixing the SoC name string in HexNN {83789}
    - Introduced encapsulation for the prepare library for the purposes of thread-safe access in order to resolve several issues related to concurrency. {80886}
    - Fixed 16bit convolution accuracy regression issue on >=v73 hexagon architectures observed post 2.13 release by correcting kernel selection. This may result in some inference speed regression, but will be on parity with 2.13 release. {80794}
* Op:
    CPU:
      - Added 5D support for Softmax Op. {82074}
    DSP:
      - Add re_quant nodes for concat5D_d32 inputs {78037}
      - Fix issue for custom_shape_error models {71777}
    HTP:
      - Fixed a fp16 elementwise add accuracy problem. {78516}
* SDK:
    - Added multi-thread support for Prepare library unloading. {73396}
* Tool:
    ONNX Converter:
      - Fixed SqueezeOp negative axes issue. {56069}
    Converter:
      Onnx:
        - Fix the problem of GRU not outputting the hidden layer {78599}
    SampleApp:
      - Fix issue where SampleApp incorrectly checked batch size. {80016}
      - Fix issue where SampleApp required input list tensor ordering. {80018}
    qnn-accuracy-debugger:
      - Fixed issue observed when generating mapping between qnn and framework node names. {82916}
    qnn-accuracy-evaluator:
      - Updated inference schema naming to include converter param 'use_per_channel_quantization' and support its multiple values {83055}
* Tools:
    Converter:
      - Fix tensorflow strided_slice conversion for out of range start/end {81917}
    Converters:
      - Updated algorithm for assigning Tensor Layout which removes the need for using "--input_layout" overrides in some models {76267}
      - Fixed performance regression caused by failure to squash Batchnorm Op in certain cases {81127}
      - Fixed issue where conversion may fail for networks having Pool with pad values. {80186}
      Onnx:
        - Added support for converting static inputs to Expand Operator and fixed a bug in Reshape Operator due to mismatch of Numpy and ONNX Opdef. {76889}



2.16.0
======

**10/31/2023**

QNN API version: v2.10.0


Changelog
---------

Features
~~~~~~~~
*GPU:
   - Expand graph node optimizations to select consecutive ElementWise operations {75912}
   OP:
    - Support Concat op with 5D inputs and axis >=2 {76217}
    - Support 5D inputs to StridedSlice op. {76572}
    - Added support for Elementwise Xor {66127}
    - Add support for ElementWiseNeuron op {57617}
*Tool:
    - added htp backend config support for "weight_sharing_enabled" flag in qnn-net-run and throughput-net-run.
   qnn-net-run:
    - Add support in qnn-net-run backend extension input config.json to create tensors which shares the same buffer with different offsets.  {79698}
    - Added new options in config.json to configure the context creation with which user can enable graphs selectively using graph name. {79728}
    - Updated to accept a DLC path as --dlc_path argument in conjunction with libQnnModelDlc.so as the --model argument to compose and execute models from DLCs.
    - Optimize qnn-net-run to minimize the number of I/O tensor allocations. {78520}
   qnn-context-binary-generator:
    - A new option "input_output_tensor_mem_type" is introduced, which will set the I/O Tensors mem_type during graph compose phase. {78937}
   qnn-accuracy-evaluator:
    - Replace platform with inference-schema in CLI args
    - Replace platform with inference_schema in model config {78745}
   qnn-context-binary-generator:
    - Updated to accept a DLC path as --dlc_path argument in conjunction with libQnnModelDlc.so as the --model argument to generate context binaries from DLCs. {80411}
   Quantizer:
    - Fixed an issue by not converting Cast to Convert if next op is float {77215}
    - Added a Quantizer pass to make static inputs of Elementwise Op float if the output is overridden to float. {76670}
   PyTorch Converter:
    - Add support for custom op in QNN product {44164}
   Converters:
    - Resolved node name collisions appearing in qnn-model-lib-generator  {73648}
    - Add rectangular SpaceToDepth op support to handle SpaceToDepth pattern in Pytorch model. {68739}
    - added broadcast support for layernorm op weights and bias {71154}
    - Added batch_norm ND support in tflite/pytorch converter. {52396}
    Onnx:
      - Fixed conversion failure for gather op with scalar indices {79170}
*HTP:
   - Added spill-fill buffer sharing across multiple contexts {79452}
   - Added weight sharing feature. When similar graphs containing common weights. "Weight share" feature can help reduce RAM and ROM usage.  {78155}
   - Added support for offset based shared buffers {78968}
   - added FP16 support for TopK op {78517}
   - enabled A16W16 (quantized 16 bit weights) support  {78667}
   - improved performance in some networks by propagating height1_sequence at softmax in earlier opt_phase. {77941}
   - Enabled asynchronous execution for QNX platform with V73 hexagon accelerator architecture
   - qnn-net-run executes the graph asynchronously by default for QNX with V73 hexagon accelerator architecture. To execute a graph synchronously, "--synchronous" argument needs to be explicitly passed when running qnn-net-run.
   - Performance can vary because of system load and operations such as file IO, memory read/write, etc. Clients can profile performance by setting options such as "--max_input_cache_tensor_sets" and "--keep_num_outputs" with "qnn-net-run" {65255}
   API:
    - added support for enableGraphs context config. Added support for weight_sharing_enabled htp backend CustomConfig.
    - Added a custom memory type for offset based shared buffers {78775}
*API:
   - Added a note that when QnnGraph_executeAsync fails it does not call the notify function. {71543}
   - Added QNN_DATATYPE_SFIXED_POINT_4 and QNN_DATATYPE_UFIXED_POINT_4 data types {72276}
   - Added QNN_CONTEXT_CONFIG_ENABLE_GRAPHS, QNN_CONTEXT_CONFIG_MEMORY_LIMIT_HINT, and QNN_CONTEXT_CONFIG_PERSISTENT_BINARY context configuration options. {79316} 
*SDK:
   - added example config on weight sharing enablement in htp backend config
   - Introduced libQnnModelDlc.so utility library to support QNN graph composition from a DLC.
   - Adding support for SoC sm8650 {79382}
   - Adding support for Compute SoC: SC8380XP
   - Adding Windows arm64x binaries {79888}
   - Added ARM64X support information to SDK documentation  {81641}
*CPU:
   - added signed fixed point 32 datatype support for Dequantize op {78057}
   - Fix segfault with multiple batch in AxisAlignedBboxTransform {77116}
   - Added support for bool datatype in ScatterNd op {77896}
*LPAI:
   - Add support for Support ElementWiseNeuron op with sub operations: Gelu, HardSwish, Relu, ReluMinMax, Sigmoid, Tanh {58656}
*Enabled FP16 support on WaipioLE {80556}

Bugs
~~~~~~~~
*OpDef:
   - Fixed index value constraint range for in[1] in Gather Op. {77681}
   - Fixed QNN_OP_GRID_SAMPLE_PADDING_MODE_REFLECTION example in GridSample Op. {78587}
*HTP:
   - Move the rule of "adding transpose before formatweights" earlier. {77500}
   - Optimized layernorm op implementation. Removed+6 redundancy transposes. {69878}
   - Fix QNN Example Op Package Compiling issue with unused cost function {78245}
   - fixed performance in some models when per row quantization is used {79240}
   - Fix profiling regression {75583}
   - Fix some overhead issues due to previous profiling changes {75596}
   - fixed an issue with incorrect profiling updates during asynchronous execution. {80867}
   - Adding hexagon arch configure code in QNX driver to address device creation failure issue. {80278}
   - fixed issue with offline preparation of context binary for 4MB and 2MB targets {74807}
   - Fix LLM 1B prepare issue {80175}
   - fixed a graph prepare failure in minimax_op due to VTCM oversize issue {75879}
   - Fixed offline context binary creation issue in some networks due to inconsistent vtcm tensors for binary ops. {78070}
   - Fixed a crash which occurs in multi-thread use case. {79035}
   - returns correct error code when context configs not set properly {81001}
   - Update internal free context sequence to fix performance hit. {77527}
   - fixed a deadlock by replacing an active waiting mechanism with a semaphore.  {77271}
*Tools:
   PyTorch Converter:
     - Fixed parameter quantization override {77739}
   Converters:
     - Resolved node name collisions appearing in qnn-model-lib-generator  {76244}
     - Updated the validation to see if the weights of FC and BN are eligible for optimization of BN into FC. {77909}
     - Updated the injection of pre/post reshapes of FC conditionally. {79558}
   qnn-throughput-net-run:
     - fixed a redundant resource move in asynchronous execution control causing crash in some scenarios {80023}
   qnn-context-binary-generator:
     - Users can give the original graph name in input config.json and application will sanitize it before further use to align with graph name after conversion. {63869}
   qnn-net-run:
     - Users can give the original graph name in input config.json and application will sanitize it before further use to align with graph name after conversion. {63869}
   qnn-accuracy-evaluator:
     - Fixed parsing and handling of params from config file to converter command. {77315}
     - Introduced model simplification step before node name sanitation to fix node name mismatch.  {76813}
     - Fixed intermediate cleanup of artifacts {77957}
     - Added "simplify_model" flag which was introduced to enable/disable model simplification of ONNX models {80466}
   Converters:
     - Fixed a conversion failure for Networks with CumSum Op  {77027}
     - Fixed a conversion failure when folding Concat Ops {76310}
     - Fixed a conversion failure for Networks having Layernorm with keepdims=false {77577}
     - Fixed a conversion failure for Networks having Layernorm on Width axis {77578}
     - Fixed a conversion failure for Networks having Layernorm on Width axis {77290}
     Onnx:
       - Fixed a conversion failure when Onnx inferShape API returns an empty graph {73297}
       - Add support for Gather Op with negative indices {77438}
     Pytorch:
       - Fixed an issue with applying overrides {78266}
*GPU:
   - Fix graph finalize failure seen with some Conv2d operations {78167}
*CPU:
   - added support for int32 hidden_state_offset parameter in LSTM op {76999}
*SDK:
   - Added offline graph prepare support for QCS6490 and QCS8550 targets {76547}
*DSP:
   - Fixed graph prepare failure for quantized LSTM Op {66132}


2.15.0
======

**9/29/2023**

QNN API version: v2.9.0


Changelog
---------
Features
~~~~~~~~
* HTP:
    - Added support for ElementWiseBinary
    - Introduced backward incompatible changes to HTP core API for custom op development. See the Op Package Migration Guide for more information.
    - Enabled support for 5D split ops
    - Enable  > 2GB  context binary support
    - Fixed oppackageManager cleanup crash for online prepare
    - Removed hard check for API version backward compatibility in custom op package. Added forward compatibility check for API version in custom op package
* OpDef:
    - Clarified behavior with regards to how the parameters normalize and centered affect glimpse window in ExtractGlimpse.
    - Added support for cubic as an interpolation mode in the Resize Op.
    - Added If op definition.
* API:
    - Clarified QnnSignal behavior when used with QnnGraph_executeAsync.
    - Added QNN_PROPERTY_BACKEND_SUPPORT_COMPOSITION capability
    - Added QNN_DATATYPE_FLOAT_64 data type.
    - Added QNN_PROPERTY_TENSOR_SUPPORT_CONTEXT_TENSORS capability
    - Added QnnBackend_setConfig API.
    - Deprecated QNN_TENSOR_ERROR_ALREADY_EXISTS and QNN_TENSOR_ERROR_NAME_HASH_COLLISION error codes. QnnTensor_createContextTensor and QnnTensor_createGraphTensor will no longer generate them.
    - Added QNN_PROFILE_CONFIG_OPTION_ENABLE_OPTRACE and QNN_PROFILE_EVENTTYPE_TRACE.
    - Added QnnGraph_createSubgraph.
* Core:
    - Adding PSNPE CAPI based sample app
* Tool:
   qnn-op-package-generator:
    - Added -DPREPARE_DISABLED to the HEXAGON_CXX_FLAGS variable in the auto-generated Makefile.
   ONNX Converter:
    - Added support for start, end attributes in Shape op
    - Added support for coordinate_transformation_mode attribute in RoiAlign op
* CPU:
    - INT4 support for dequantization op.
    - INT8 support for LE targets
   Op:
    - Added INT8 support for CRD mode in SpaceToDepth
    - Added support for 4D GatherElement
    - Added support INT32 for Elementwise Min/Max
    - Added support for cubic interpolation mode in Resize Op.
    - Added support for GroupNorm
* Tools:
   Converters:
    - The --arch_checker option will be deprecated by 2.17 and transition to a standalone qnn-architecture-checker tool.
    Onnx:
     - Added support for converting Resize Bicubic Op to QNN
   Quantizer:
    - Removed the deprecated Algorithm "bc" from the Quantizer arguments and documentation
   qnn-architecture-checker:
    - Added standalone architecture checker tool. Added modify option to apply modifications to models.
   qnn-net-run and qnn-context-binary-generator:
    - Add profiling_option option.
   Accuracy Evaluator:
    - Enabled htp_mcp backend
   Accuracy Debugger:
    - Added support for json output format
* DSP:
   Op:
    - Support Elementwise XOR

Bugs
~~~~
* HTP:
    - Added width tiling of fp16 instancenorm
    - Fixed multithreading map access out-of-range crash
    - Fixed device registration failure during power config for non-RPC use case on v68 devices
    - Graph weights access performance optimization
    - Fixed performance regressions observed in select op in fp16
    - Fixed accuracy regression issue in some models caused by Conv+Prelu fusion optimization
   Op:
    - Fix accuracy issue on some AvgPools
* Conv Udo example is fixed on PT and non PT builds
* SDK:
    - Fixed broken links in PSNPE C API documentation
    - Fixed breaking dependency installation for scipy and numpy version for check-python-dependency.
    - Fixed loadqnn tutorial error.
    - Fix to enable SM7550 SOC
    - Make the use of env var HEXAGON_SDK consistent in SecurePD Add-on
    - Fixed deregister issue for LoadQNN TA
* Tools:
   Converters:
    - Fixed multi-batch conversion failure on SSD models
    - Fix some issues of gather op and ScatterND op
    Pytorch:
     - Enabled a pass for Common Subexpression Elimination that fixes an issue where the same Static tensor will be copied with different tensor names
    Onnx:
     - Added support for converting static inputs to Pow Operator
    TF:
     - Added a check to prevent matching Mul + Add to Batchnorm if the datatype of input is not float. Also fixed a bug where static tensors were always created using float32 dtype.
   quantizer:
    - Fix the bug of the Matmul bw when overridden
    - Fix the issue where different multiplicative factors were used when converting encodings from 8 -> 16 and 16 -> 8
   Quantization checker:
    - Added a missing argument to method call for dynamic input dimensions.
* KI:
   Tools:
    Converters:
     Onnx:
      - Conversion may fail with error message 'Failed QNN validation for layernorm_2' for networks containing Layernorm pattern when NONTRIVIAL layout is specified in converter command
    Quantizer:
     - Quantizer fails for some Mixed Precision models with an error "RuntimeError: Invalid QnnModel constructed". This is a known issue where Convolution weights & bias get different float bitwidth assigned. As a workaround set overrides to both weights & bias tensor
* HTA:
    - Added support of Pooling 16bit for large dimensions
* Core:
    - snpe-parallel-run fixed for --userbuffer_memorymapped for WoS
    - Fixed --debug not emitting intermediate tensors for offline cache based execution
    - Memory Mapped Userbuffer Sample App - added error handling for incompatible data types.
    - Fixed CAPI MemoryMappedUSerBuffer Sample App for multi-buye data types
    - Adding PSNPE API domentation for CAPI
    - Fixed size limiation in deserialization of large dlcs (like quantized llama_2B)
* GPU:
    - Fixed init time regressions seen in some models
    - Fixed init time regressions seen in some models
* CPU:
    - Added bool support in op package
   Op:
    - Added support in ElementWiseSign for multiple input datatype
* DSP:
   - Fixed concat max tensor number issue
   Op:
    - Optimize the reciprocal op implementation
* Tool:
   ONNX converter:
    - Fixed issue causing "TopK op has no attribute axis" error
* HEXAGON_SDK_ROOT must be set to hexagon-sdk-5.4.0 and HEXAGON_TOOLS_ROOT must be set to 8.7.03 for customers generating UDO with PT builds
* Fixed failure in GRU model with snpe-net-run
* Fixed UDO conversion issue observed on some socs.
* Fixed bugs in inception_v3 example and it is functional with all runtimes


2.14.0
======

**8/31/2023**

QNN API version: v2.8.0


Changelog
---------

Features
~~~~~~~~
* Tools:
   Converter:
     - Allow only output tensors in the source model to be marked as QNN_TENSOR_TYPE_APP_READ. All other tensors with zero consumers will change from being APP_READ to NATIVE
     - Tensor with no consumers and not an actual graph output will be set to NATIVE for QNN Onnx Converter
     - Update tvm version to support pytorch 1.13 version
     - Updated SDK documentation to reflect Custom Op requirements
     - Replaced Cast Op with Convert op when input is boolean
     - Added PyTorch Conv1d/Conv3d Op support
     - Added support for fallback dtype
     - Added a Graph pass that matches Space2Depth Op (CRD & DCR) from Reshape - Transpose - Reshape pattern
   Onnx converter:
     - Fix the type bug of dequantize and remove the disconnect node before optimization.
     - Added support for BoxWithNMSLimit.
     - Added default attribute perm for Transpose Op
     - Added support for TransposeConv3d
     - Added negative max_output_boxes_per_class parameter support for NonMaxSuppression.
     - Add quant/dequant's encoding into input when with input->quantize->dequantize
     - Added support for GenerateProposals.
     - Added support to convert function nodes. The converter always does inlining of function nodes
     - Added support for BBoxTransform
     - Added support for RoIAlign
   qnn-context-binary-generator:
     - Added profiling_level option
     - Added set_output_tensors option
   qnn-net-run:
     - Added context configuration option for async execution queue depth
     - Added set_output_tensors option
   Quantizer:
     - Made optimizations for operations having same quantization parameters for inputs and outputs
     - updated sdk documentation for option --restrict_quantization_steps
   Pytorch converter:
     - Added dry_run option to Relay based conversion
   qnn-context-binary-utility:
     - Initial release.
* SDK:
   - Added sdk.yaml and qnn.yaml SDK informational files.
   - Support Windows 11 x86 Host
   - removed all hexagon-v65 related artifacts
   - Added android artifact to Windows SDK
   - check-python-dependency now will required user to activate python virtual environment before execute the script
   - Added Softmax examples
   - Add Converters and offline prepare tools support on x86_64 Windows.
* OpDef:
   - Added optional parameter mode to SpaceToDepth
   - Updated BatchPermutation Op to use shape of in[1] to determine batch dimension of out[0]. Relaxed constraint on index values of in[1].
   - Updated constraint of out[0] index values to be based on FPN levels for DistributeFpnProposals
* API:
   - Introduced QnnProfile_ExtendedEventData_t and QnnProfile_getExtendedEventData to support binary large object data.
   - Added the QNN_DATATYPE_STRING data type for scalars.
   - Added QnnProfile_setConfig and QNN_PROFILE_CONFIG_OPTION_CUSTOM and QNN_PROFILE_CONFIG_OPTION_MAX_EVENTS configuration options.
* DSP:
   - Added support for ExtractPatches.
   OP:
    - Added support for GatherElements
    - Added support for HardSigmoid
    - Added support for ElementWiseBinary
    - Added support for ElementWiseNeuron
    - Added support for ElementWiseUnary
* HTP:
   - Enabled v73 QEMU driver
   - Added support for "hestimate" (execution estimates) information, provided during offline graph prepare/finalize.
   - Enabled HTP online prepare for aarch64-oe-linux-gcc9.3 target
   - Enabled online prepare feature for aarch64-ubuntu-gcc9.4 target
* CPU:
   - Add Int8 support for QNN CPU OpPackage
   - Add Int64 support for Gather Op
   - Update the depth_to_space logic for asymmetric block dims
   OP:
    - Added support for GroupNorm
    - Added support for GRU
* GPU:
   - Added support for CRD mode for DepthToSpace
   - Added support BOOL_8 inputs to Cast operation
   OP:
    - Support ElementWiseUnary Op.
    - Support ElementWiseBinary Op
    API:
    - Add QnnGpu_MemoryLayout_t enum to QnnGpuOpPackage.h

Bugs
~~~~
* HTP:
   - Add constraint when moving flat slicepad_shape from/to vtcm
   - Fixed issue with some models when preparing for FP16
   - Fix accuracy issues on certain models with avgpool 3D
   - Fixed "Stub lib id mismatch" failure when backend is loaded concurrently with SNPE on different threads
   - Added support for ReduceSum rank 5
   - Improved performance of ResizeTrilinear uint8 Op
   - Fixed issue when Logger is initialized after multiple backend initializations in different threads
   - Fixed graph prepare issue due to "ReduceMean"(RMSNorm) VTCM oversize
   - Display progress bar during online prepare stage on Windows
   - Fixed issue with accuracy drop on softmax+matmul
   - Fixed failure in Conv op creation related to weights to vtcm operation
   - Fixed FP16 related error due to model serializer change
   OP:
     - Fixed a vtcm overflow problem of large input batch reduce min/max op, and a vtcm allocation bug of multiply op when one of the inputs is a scalar.
     - Added uint8 support for maxpool w77s44p00
     - Fixed a vtcm overflow problem of padding the graph input, and fixed fp16 mul padded input error.
* GPU:
   - Fix bug in custom OpPackage example to allow only valid kernels to be passed to Backend
* DSP:
   - Fix for scale-range changing to support higher accuracy
   - Fix PRelu accuracy issue
* SDK:
   - Fix SecurePD loader and qnn example don't print log
* Tools:
   Onnx converter:
     - Added support for TransposeConv3d
     - Added support for converting static inputs to several Unary Elementwise Operators
     - External quantization overrides are not applying for MatMul ops, instead QNN quantizer generated encodings are being used. An accuracy drop can be observed for networks having MatMul ops.
     - Fixed issue with hardswish related optimization
     - Fixed issue with axes_to_spatial_first_order optimization in Elementwise Ops
   Converters:
     - Remove transposing weights multiple times in Lstm and add dynamic input for Gemm.
     - Disabling optimization of sequence when encodings are present
     - Resolved OpValidation error related to LayerNorm Op caused due to the unsqueezed Gamma/Beta tensor being > 1D rank
     - Disabling squashing of Mul+Add into BN when encodings are present
   quantizer:
     - Avoid act's bw changing according the weight/bias's bw
     - Fixed issue with large scale values being produced in some models starting with FC layer
   check-python-dependency:
     - Fixed Numpy and Scipy dependency issue


2.13.0
======

**7/31/2023**

QNN API version: v2.7.2


Changelog
---------

Features
~~~~~~~~
* Tools:
   Converter:
     - Changed the logic for converting 1dOp into 2DOp by expanding along H dimension instead of W dimension.
     - Changed the translation of FloorDiv operator to ElementWiseDivide if the datatype of input is Int32.
     - GRU weights are shared across time unrolling step.
     - Added support for Float32 bias in Float16 execution
    Core:
     - User will be able to skip graph execution when there are multiple graphs present in a context.
* DSP:
    Op:
      - Added support for logSoftmax
* GPU:
   - Performance improvement on Kodiak and Cedros devices.
    Op:
      - Support broadcasting in ElementwiseSelect op.
      - Support 3D inputs in LayerNorm op.
      - Support broadcasting of batch dimensions in MatMul op.
      - Support reduction along batch for 4D inputs in Reduce Op.
      - Support  QNN_DATATYPE_INT_64 input datatype to Cast op.
      - Support inputs with rank < 4 and batch > 1 for rank=4 for LayerNorm op.
* HTP:
     - Introduce O3 Optimization.
     - Added support for 16 bit activations to ElementWiseSquaredDifference
    Op:
      - Added support for uint8 window7x7 stride3x3 maxpool ops.
      - Added support for GroupNorm
* SDK:
   - Added libQnnSystem.so for Hexagon targets.
   - Updated Pandas version in check-python-dependency script to 1.1.5
* OpDef:
   - Added op definition for Conv1D
   - Added op definition for TransposeConv1D
   - Added op definition for ElementWiseXor
   - Added op definition for DepthWiseConv1D
* SNPE SDK:
   - Add HtpPrepare.dll push step for HTP online prepare flow of windows tutorial (tutorial_inceptionv3_win).
* QNN SDK:
   - Add HtpPrepare.so push step in HTP section of android doc as only HTP offline prepare is mentioned here, better to leave a note here (htp_execution_tutorial_2.rst.in).
* API:
   - Added QNN_PROPERTY_GRAPH_SUPPORT_PER_API_PROFILING capability.
   - Added QNN_GRAPH_ERROR_GENERAL error code.
   - Added QnnSystemContext_getMetadata and deprecated QnnSystemContext_getBinaryInfo.
   - Added QNN_SIGNAL_ERROR_INCOMPATIBLE_SIGNAL_TYPE error code and clarified unconfigured QnnSignal behavior.
* Documents:
   - Update latest PyTorch Op support.
* MCP:
   - Combining IO DMA buffers as a perf optimization.
* CPU:
   - Fixed CollectRPNProposal kernel data passed.
* KI:
   - HTA BE support enabled for QRB5165.UBUN.2.0 targets based on GCC9.4 toolchain
Bugs
~~~~
* Tools:
   - Fixed conversion error when the bias_add having different bias shape with channel of preceding Conv.
   - Fixed a bug that matmul+add with matmul's dimension not 2 is mistakenly optimized.
    ONNX converter:
     - Support Split op in opset13, and keep the axes format in layernorm if  input_buffers axis_format is equal to node.op.data_axis_formats.
     - Fix the issue of onnx split translation.
    Quantizer:
     - Fixed the bug that caused the Static input tensors to use the weight_bw instead of activation bw by default
    Converter:
     - Enabled row wise and 4-bit quantization for MatMul Ops.
     - Fixed an error related to python type signature of c++ Set datastructure caused by python3.8 upgrade.
     - Some models with biasadd having bias tensor shape different than the channel shape of the preceding Conv will see failure during conversion in Opvalidation.
     - Fixed a bug that matmul+add with matmul's dimension not 2 is mistakenly optimized
    TF Converter:
     - Support optimized Gelu pattern that contains Mul instead of Realdiv.
     - Added support for conv2d_transpose layer with asymmetric strides
    KI:
     - Quantized models with LSTM Op will fail during inference.
     - Arch_checker will fail with an error related to python type signature of c++ Set datastructure.
* HTP:
   - Fixed vtcm oversize issue for large input node followed by a concat.
   - Add boundary check of gather_element's index generic implementation.
   - Repair bug in ReduceMean optimization during prepare.
   - Fixed issue with some models when preparing for FP16
   - Fixed set context_priority during qnn-throughput-net-run execution
   - Add RESOURCE_HVX flag for custom when using default Op registration. This fixed HVX stuck issue in Custom OP registration
    Op:
     - Fixed a vtcm overflow problem of large input depth matmul.
* DSP:
    Op:
     - Supported Reshape from 4d to 5d.
* SDK:
    SampleApp:
     - Fix issue where multi-target op package failed to load.


2.12.0
======

**6/30/2023**

QNN API version: v2.7.1


Changelog
---------
Features
~~~~~~~~
* Saver:
   - Added configuration option to control output filenames.
* OpDef:
   - Added op definition for ElementWiseBinary
   - Added optional parameter aligned to RoiAlign Op.
   - Added optional input batch splits and optional outputs batch splits, keeps, and keeps size to BoxWithNmsLimit Op.
   - Added optional parameter weights and optional output batch splits to AxisAlignedBboxTransform Op.
   - Added optional parameter allow_invalid_roi to RoiAlign Op.
   - Added optional parameter bbox_xform_clip to GenerateProposals Op.
   - Updated out[0] of DistributeFpnProposals to provide a -1 index value for invalid Rois.
   - Added Op definition for GroupNorm.
* Tool:
   - Support qnn-platform-validator on Windows
  qnn-net-run:
     - Added support for execution timeout
     - Support input tensor caching.
  Converter:
     - Added a new transformation to change MatMul into FullyConnected even without Bias.
     - Added a fix to account for the difference in the offset sign and usage when quantizing tensors
     - Modified the output names generated by Pytorch Converter and TFlite Converters
     - Changed the axis tracking behavior to match the TF & Onnx Converters.
     - Added support for new commandline argument to preserve the input layout and datatype as the source framework model
     - Added a new pattern to squash BatchNorm into FC + Reshape.
  Pytorch Converter:
     - Set model default input and output formats as spatial-first format (NHWC).
* GPU:
   OP:
     - Support 3D inputs in InstanceNorm op.
     - Support GELU operation.
* API:
   - Added QNN_GRAPH_ERROR_TIMED_OUT error code
   - Added QNN_COMMON_ERROR_RESOURCE_UNAVAILABLE error code.
* SDK:
   - Removed unused libPlatformValidatorShared.so artifacts.
* CPU:
   - Added depthwise+relu node fusion logic for INT8 ops.
   - Added 6D Support for Elementwise mul
   - Add allow_invalid_roi parameter in RoiAlign
   OP:
     - Added Support for ElementWiseNeuron
* HTP:
   - Added QNN signal timeout feature
   - Added backend extension support for extreme power saver performance profile mode
   - Added support for PD restart using FASTRPC_SESSION_CLOSE
   - Improved model loading times (FR78518)
   - Cleaned up use of QNN_ERROR_UNKNOWN_ERROR return code.
   - Added support for missing ElementWiseUnary operations: Abs, Asin, Atan, Ceil, Cos, Exp, Floor, Log
* DSP:
   - Supported absolute input value for MultiClassNMS operation.
* HTA:
   - Updated documentation for supported 16bit Ops.
Bugs
~~~~
* GPU:
   OP:
    - Fix bug in Squeeze Op validator which allowed unsupported dimensions
* HTP:
    - Fixed mem grow size cannot set to a smaller value issue.
    - Fix the scale limit of u8 elementwise addsub.
    - Fix the bug of passing down crouton_from_vtcm in dequantize.
    - Fixed undefined symbol for SecurePD QNN.
    - Improved performance of ElementWiseGreater op.
    - Fixed VTCM oversize issue with Gather op.
    - Fixed issue with serializing SpaceToDepth op.
    - Accuracy failure caused by tile misalignment (8b & 16b difference).
    - Improved model VTCM size dependent preparation robustness for FP16 precision.
* API:
   HTP:
     - Added support for QnnSignal timeout.
* Tools:
   qnn-net-run:
     - Fixed incorrect number of files being saved using --keep_num_outputs arg.
     - Correct the number of outputs generated when executing a static batched model with qnn-net-run in Async mode.
   ONNX converter:
     - Added support for constant data tensor as input to Gather Op when the index tensor is 0D (scalar).
     - Fixed Layernorm float dtype overrides, ensuring all tensors have same data type.
     - Fixed issue with Convert op wrongly inserted after a Dequantize op
   Converters:
     - Fixed issue related to Axes of Bias input to Conv Op
     - Fixed a bug where the inputs to Concat Op have different layouts
   Quantizer:
     - Fixed an error related to locking the WeakPtr associated with the Bias tensor to Convolution Op
     - Fixed an issue that prevented weights & bias inputs of Batchnorm from being set as FP16
* SDK:
   - Fixed SecurePD stack overflow issue
* DSP:
   - Fixed issue for loading context from binary getting wrong tensor input/output
* Saver:
   - Increase decimal precision when recording float values.


2.11.0
======

**5/31/2023**

QNN API version: v2.7.0


Changelog
---------

Features
~~~~~~~~
* Op:
    ONNX converter:
      - Added support for Mod
* OpDef:
    - Added op definition for ElementWiseNeuron
* SDK:
    - Added support API table in SDK documentation
    - Removed caffe support from qnn-quantization-checker, qnn-accuracy-evaluator, qnn-netron, and Golden-I
    - Upgraded Linux development host to Ubuntu 20.04 LTS
    - Upgraded Python support to version 3.8
    - Upgraded Android NDK version to 25c
    - Added support for Tensorflow version 2.10.1
    - Added support for ONNX version 1.11.0
    - update docs in SecurePD addon to reflect new directory structure
* API:
    - Added QnnSignal timeout configuration
    - Correct and add some error code returns
    - Added QNN_COMMON_ERROR_INCOMPATIBLE_BINARIES common error code
* HTP:
    - Reject second connection to QNN HTP BE libraries. libQnnHtpPrepare.so, libQnnHtpVXXStub.so, libQnnHtpVXXSkel.so are affected.
    - For x86 offline context binary generation, progress animation is added to indicate the generator still in progress.
    - ElementwiseUnary op support updates
* CPU:
    - INT8 support enabled for LA targets.
    - Removed DetectionOutput clipping
* Tools:
     Converter:
      - TensorFlow: Added support for ExtractPatches.
     TF Converter:
      - Added support for Tensorflow 2.10.1

Bugs
~~~~
* HTA:
    - Fix Concat Accuracy inside HTA Compiler
* HTP:
    - Fixed accuracy bug in Transpose-Reshape-Transpose op chain
    - Fixes DEF_OPTs related to VTCM movement surrounding the "ScatterInverse" op. Previously the related model would run into an op creation failure and not successfully prepare due to a downstream op which requires a TCM tensor type to get a non-TCM tensor type.
    - Fix QNN Graph finalize issues for certain models
    - Fix accuracy issue in FP16 layernorm operation
    - Fix graph finalize issues on certain floating point models
* SDK:
    - Fix doc bug for SecurePD QNN.
    - Fixed SecurePD stack overflow issue
* Tools:
    Converter:
      - Updated algorithm to handle axes transformation for Elementwise Ops and fixed a bug when squashing a Gather Op where output is same as input which would result in a KeyError
      - Fix conversion error when an operator's output is used as graph output and the UDO input at the same time
      - Fix the graphs output missing issue when the UDO output is used as graph output and the next operator's input at the same time.
      - Fixed ScatterElements quantization issue
    ONNX converter:
      - Keep the depthToSpace op's input and output axis format as NSC
* GPU:
    OP:
      - Fixed bug in Concat to change axis param from mandatory to optional.
* DSP:
   - Fixed bug for logger create.
   - Fixed op package generation issue.


2.10.40
=======

**5/10/2023**

QNN API version: v2.6.0


Changelog
---------

Features
~~~~~~~~
* HTP:
   - Set graph priority mappings to legacy pre-qnn-2.8.0 values
   - Added support for the backend platform options configuration
* API:
   - Added platform options backend configuration.
* SDK:
   - Made SDK structure updates related to unified software stack
   - Updated setup scripts and associated documentation
   - Made significant documentation content and style updates
   - Retired support for arm-android and qnn-caffe-converter, removed corresponding artifacts

Bugs
~~~~
* HTP:
   - Fixed an object use-after-free / segfault issue.


2.10.0
======

**4/28/2023**

QNN API version: v2.5.1


Changelog
---------

Features
~~~~~~~~
* OpDef:
    - Added op definition for ElementWiseMod.
    - Added Op definition for ElementWiseAsin.
    - Added op definition for ElementWiseFmod.
* API:
    - Added QNN_COMMON_ERROR_INCOMPATIBLE_BINARIES common error code.
* HTA:
    - Refactored  DepthwiseConv2d Op to support padding and dilation parameters.
* HTP:
    - made stricter constraints for moving indices of scatternd into vtcm to address accuracy loss in some models
    - added ROI Align Op broadcast support
* GPU:
    - Added support for QnnOpPackage_ImplementationV2_0_t.
    Op:
      - Support Pack operation with 1 input.
* DSP:
    - Remove DetectionOutput clipping (#866).
    Op:
      - Support Cast from BOOL_8 to UFIXED_POINT_8.
* CPU:
    - DeformConv2D op support.
    - Added Support for Mod.
    - Added support for ElementWiseUnary.
    - Fix double free in BoxWithNMSLimit due to dynamic output size.
    - Fix double free in GenerateProposal due to dynamic output size.
    - Support optional output in NMS.
    Op:
      - Elementwise Asin support.
* SDK:
    - Updated documentation to separate API and Operations sections.
    - Refine the Example XML OpDef Configs page in documentation
* Tools:
    Quantizer:
      - cleanup / fixes in LSTM op
    Converter:
      - Support elementwiseAsin op in converter.
      - Add Scatter/ScatterElements support in onnx converter.
      - Allow multiple outputs, if all same data type, in split like Ops, for support of mixed precision use cases
      - Added a new optimization sequence to convert BatchNorm into FullyConnected when applicable.
      - Add gather_nd support in tflite/pytorch converter.
      - Solve CenterNet conversion error.
    ONNX Converter:
      - Fix conversion issues for GRU op.

Bugs
~~~~
* HTP:
   - Stricter at constraint of moving indices into vtcm for scatternd at vivo's model.
   - Support Elementwise Sin/Cos with INT8 precision.
   - Improved batch to space performance in certain configurations.
   - Fix fail to finalize Graph on certain networks.

* DSP:
   - Fixed ElementWiseAdd performance issue.
   - Fix of backend features for multi-threads condition.
* Tools:
    - Support for CRNN model
    Converter:
      - Fix quantization override issue for tflite converter.
      - Fix Cast bug and update ArgOp/TransposeOp support.
      - Optimize Gather op's indices_buff in 'remove_identity'.
      - Fixed RoiAlign validator error for certain models.
    Quantizer:
      - Fixed issue with encodings not being consumed properly for PRelu op, due to name mismatches with original model.



2.9.0
======

**3/31/2023**

QNN API version: v2.5.0


Changelog
---------

Features
~~~~~~~~
* OpDef:
    - Added constraint for dilation > 0 in Convolution Ops.
    - Added ElementWiseUnary op definition.
    - Added op definition for NonMaxSuppression.
    - Constrained all ND inputs to have a rank greater than 0.
* CPU:
    - NonMaxSuppression op support
    Op:
      - Transpose Conv 3D support in CPU
* Tool:
    qnn-net-run:
      - Added keep_num_outputs option
    ONNX converter:
      - Added support for NonMaxSuppression op
    qnn-net-run:
      - Added batch_multiplier option
* SDK:
    - Added libQnnJsonProfilingReader.so
* HTP:
    - Optimized pad, transpose operations and VTCM utilization for certain network configurations
    - Fix accuracy issue for INT16 Div operation
    - Improve performance for GridSample operation
    Op:
      - Added support for NonMaxSuppression
* GPU:
    - Support context priority config.
    Op:
      - Support QNN_DATATYPE_FLOAT_16 datatype and non-multiple of 4 input size in Lstm op.
* API:
    - Added QNN_PROPERTY_GRAPH_SUPPORT_EXECUTE capability
* DSP:
    Op:
      - Added support for dilated conv3d
Bugs
~~~~
* CPU:
    - InstanceNorm fix 3d tensor support
* HTP:
    - Fixed accuracy issue in ReduceMax op.
    - Bug fixed for an unexpected error reported for certain graphs during execution with detailed profiling.
    - Fix tensor IDs being casted to a different data type before printing to logs.
    - Fix accuracy bug in 16bit LayerNorm implementation.
    Op:
      - Fix u16 mul crash cases when InA is in 111d format.
* Tool:
    qnn-op-package-generator:
      - Fix CPU OpPackage compilation error seen in 2.8.0
    Quantizer:
      - Fixed per-channel quantization failures caused by incorrect retrieval of static bias input tensors
    Converter:
      - Transpose Op optimization has bug in some cases which has been fixed.
      - User quantization overrides take precedence over external override JSON file values when generating graph
    Onnx:
      - Models with opset version <=11 with a Softmax on channel dimension and input > 2d may see an error running on 2MB VTCM HTP targets and GPU targets because of a required C*H*W reshape which results in a larger dimension
      - Added support for null tensor handling in Slice Op
* HTA:
    - Added validation for FC dimension. Y cannot be bigger than 1024 due to HTA HW support limitation.
* DSP:
    - Fixed Prelu_v2 repression issue
    - Fixed encoding op for ContextCreateFromBinary
    - Fixed op-package support issue on LE devices
    - Fixed softmax accuracy issue for SNPE2 DSP in dynamic encoding mode

2.8.0
======

**2/28/2023**

QNN API version: v2.4.0


Changelog
---------

Features
~~~~~~~~
* Tools:
    qnn-net-run:
      - Add native_input_tensor_names  option to specify native input file data types per input.
    qnn-context-binary-generator:
      - Added support for a context binary with multiple models.
    Quantizer:
      - Added support for quantized LSTMs
      - Added support for infinity
    Converters:
      Onnx:
        - Added support for Sign.
* API:
    - Added new QnnProfile event types to support QnnGraph_executeAsync profiling.
    - Add QnnGraph continuous profiling.
    - Add Qnn_Priority_t QNN_PRIORITY_NORMAL_HIGH.
* HTP:
    - Added a new priority "normal high" which is between normal and high priority levels.
    - Optimized int32 compare operations
    Op:
     - Added support for GridSample.
     - Added support for ElementWiseSign op.
* OpDef:
    - Added UINT32 support for in[1] in Gather op.
    - Added op definition for ElementWiseSign.
    - Clarify DetectionOutput::out[1] and align to backend behaviour.
* CPU:
    - Update BoxWithNMSLimit for static output size
    Op:
     - Add DistributeFPNProposal support
     - Added support for Sign op
     - Added Support for ExtractPatches Op
* DSP:
    - Offline prepare support on Windows QNN DSP
    Op:
     - Transpose5d hookup.
     - EltwiseAdd5D hookup
     - Reshape5D and RoiAlignV2 hookup

Bugs
~~~~
* Tools:
    Converters:
      - Resolved a bug in tracking consumers of a buffer when squashing Identity Op
      - Added the ability to add Bool8 tensor in converted .cpp files as String for QNN Converters
    ONNX Converter:
      - Fixed TransposeOp input axis format NT issue.
    loadqnn:
      - Fixed securepd client reorder option issue
* HTP:
    - Solve vtcm overflow issue happened when change data layout: from uint8 flat to uint8 crouton in tcm.
    - Fixed a race-condition in concurrent backend init/deinit calls.
    - Fixed accuracy issue in per-channel quantized DepthWiseConv2d op
    - Fixed issue with FP16 operations in some networks
    - Fixed issue with VTCM overflow in some networks
    - Fixed model preparation issue in some networks due to insufficient TCM size error
    - Fixed performance issue when model prepared with HVX threads higher than available in HW.
    - Fixed batch multiple support.
    - Improved inference time for networks with batch>1.
* DSP:
    - Fixed pad5d regression issue.
    - Fixed model execution issue due to reshape.
* HTA:
    - Added limitation of total Concat channel to 4096 when one of the channels is not aligned by 32.
    - Added validation for FC dimension. Y cannot be bigger than 1024 due to HTA HW support limitation.
* GPU:
    - Improved accuracy in FP16 mode with Kailua.LA.1.0-01005-STD.INT-1 META onwards.
    Op:
     - Support large dimensions in ReduceMean op.
* SDK:
    - Updated documentation for DSP backend.

2.7.0
======

**2/07/2023**

QNN API version: v2.3.2


Changelog
---------

Features
~~~~~~~~
* OpDef:
   - Added op definition for ExtractPatches.
   - Added INT32 support for in[1] in GatherNd op.
* CPU:
     - Fix output dim issue with fully connected op.
     - Added support for Uint32 in Index Tensor of Gather Op.
   OP:
     - PoolMax3D support.
     - Batch Permutation Op support.
     - Add CollectRPNProposals support in CPU.
     - Add support for MatMul bias optional input.
* Tool:
    qnn-net-run:
      - Support symmetric quantization.
      - Add input data type support for QNN_DATATYPE_SFIXED_POINT_8, QNN_DATATYPE_SFIXED_POINT_16, QNN_DATATYPE_SFIXED_POINT_32, and QNN_DATATYPE_UFIXED_POINT_32
      - Introduce use_native_input_files and use_native_output_files options. Deprecate the input_data_type and output_data_type options.
    qnn-context-binary-generator:
      - Add backend_binary option to output the backend specific cache.
    Converters:
      Onnx:
       - Added support for NonZero.
* API:
   - Deprecate Qnn_SocModel_t.
* DSP:
   - Updated enum names in QnnDspGraph_Encoding_t.
   - Added support for securepd on v66 target, subject to supported soc limitations.
   OP:
    - Added 5D support for Concat.
    - Added support for PoolMax3d.
* HTP:
   - Support u16 and fp16 GridSample in HTP.
   - Enable ElementwiseLess operation with INT32 precision.
   - Enable ElementwiseEqual operation with INT32 precision.
   - TopK now supports up to K <= 256 hardware accelerated.
* SDK:
   - Add V66 Secure PD.

Bugs
~~~~
* CPU:
   - Fixed a memory leak in math library.
   - Fix Memory leak observed in QML allocation.
   - Add int32 support for ElementWise Neg.
* HTP:
   - Fixed soc (miss)detection issue.
   - Fixed fully connected layer performance regression in some cases.
   - Fix potential double unmapping
   - Relax the restriction of slice_shape and conv fusion.
   - Fix missing nullptr check in perfsettings
   - Fixed memory leak occurred when log module is initialized multiple times.
   - Fix Graph Finalize issue on some graphs that use ElementwiseSquaredDifference operation.
   - Fix Graph Finalize issue on some graphs that use ReduceMean operation.
   - Solve memory leak while calling QnnLog_create and QnnLog_free with iterations.
   - Due to store buffer, memory order is not consistent with program order.
   - Fillmore FP16 test enablement is disabled.
   - Fixed with more tiling rules.
   - Fallback dil conv to ref implementation if inputs doesnt fit in vtcm and cant be tiled.
   - Consider padding when doing inplace concat.
* DSP:
   - Fixed context caching by changing add-tensor mechanism.
   - Solve DSP backend accuracy issue introduced by dynamic encoding enablement.
   - Fixed DSP backend does not support QNN_DATATYPE_UINT_8 datatype as input which cause validation failure.
   - Fixed model caching with tensor name for input tensors
   - Fixed undefined symbol in securepd
* HTA:
   - Activated verbose level as HTA level to produce detailed profile information. Execution time will be much slower by bigger graph.
   - Added validation for unsupported dimensions greater then 4D.
* Tools:
      - Fixed an if check which was missing the len() when checking for number of inputs to Elementwise Ops.
      - Fixed an assumption that Gamma/Beta are the 2nd input when squashing a Layernorm pattern.
      - ONNX Converter support GridSample op in SNPE & QNN
    Converters:
       - Fixed a bug in the optimization that merges Matmul + Reshape + Add to FC Op that would incorrectly insert the FC Op before the Constant Bias Op
       - Fixed a couple of bugs in the Converter
      Onnx:
       - Added support to translate GlobalAvgPool1D Op in the Converter.
       - Add a default_attrs param to function extract_attributes to get a default attributes if needed.
       - When x input is constant, allow DequantizeLinear and quantizeLinear caculate it's tensors.
* Op:
   GPU:
    - Fix graph prepare bug for large dimensions in Softmax op.


2.6.0
======

**12/30/2022**

QNN API version: v2.3.1


Changelog
---------

Features
~~~~~~~~
* OpDef:
   - Added Op definition for DistributeFpnProposals.
   - Added QNN_DATATYPE_INT_32 support for CropAndResize in[2].
   - Added QNN_DATATYPE_INT_32 support for ScatterNd in[1].
   - Added Op definition for Nonzero.
   - Added Op definition for CollectRpnProposals.
   - Added support for broadcasting in ElementWiseLess Op.
* CPU:
   - Added support for 3 dim input in instanceNorm op
   - Added 'Axes' parameter support in L2Norm op
   - Added dynamic tensor support for DepthWiseConv
   - Added support for ScatterElements Op
* HTP:
   - Graph option added to set number of HVX threads.
   - Config option enabled to read and set number of HVX threads using QNN apps.
   - Support v69 and v73 targets with HTP oppackage.
* Tools:
   Onnx converter:
    - Support transposeconv1d, map transposeconv1d to transposeconv2d
   Converters:
    - Changed output datatype of Argmax Op to Int32 from Uint32
* OP:
   CPU:
    - Added support for NonZero op
    - INT32 support for scatterND

Bugs
~~~~
* Tools:
   Tensorflow converter:
      - Fix the bugs of lstm with stacked cell.
   Onnx converter:
      - Models with opset version <=11 with a Softmax on channel dimension and input > 2d may see an error running on 2MB VTCM HTP targets and GPU targets because of a required C*H*W reshape which results in a larger dimension
      - Support ChannelShuffleOp's quantize encoding Inherit the encoding of the previous node.
* HTP:
   - Improved pytorch op MultiheadAttention performance when batch=1.
   - FP graphs is not supported on select SoCs.
* CPU:
   - Fixed padding parameter calculations in PoolAvg3d op
   - Fixed op validator issue in tile op
   - Fixed failure when adding CropAndResize op to the graph
   - Added dynamic tensor support for DepthWiseConv
* DSP:
   - Fixed multi-thread priority issue
   - Fix for model context binary with tensor name
   - Fixed backend terminate issue in multi-thread test case
   - Fixed RelSdkSymbolVisibilityChecker failure
* SDK:
   - Fixed issue observed set environment path repeatedly in Windows platform.
* OP:
   CPU:
    - Crop and resize op Support.


2.5.0
======

**11/30/2022**

QNN API version: v2.3.1


Changelog
---------

Features
~~~~~~~~
* CPU:
    - Added support for dynamic weights for TransposeConv2d.
    - Added support for INT32 in index tensor for Argmax Op.
    - Added INT32 data type support for Pack Op.
    - Add INT32 support for ElementWiseSelect op.
    - Add int32 and uint32 input support for Argmin and Argmax.
    - Added INT32 data type support for index tensors in ArgMin Op.
    - Added INT32 data type support for ElementWiseFloorDiv Op
    - Added support for 3 dim input in instanceNorm op.
* OpDef:
    - Added INT32 support for in[1] in GatherElements op.
    - Added INT32 support for out[0] in Argmax op.
    - Added Op definition for BatchPermutation.
    - Added INT32 support for out[0] in Argmin op.
* HTP:
    - Added a HTP specific profiling level in qnn-net-run.
* Tools:
    - Added qnn-accuracy-evaluator. This tool helps to automatically run different model config setups and compare the output results to get the best setup config. (experimental)
    - Added Architecture Checker tool to QNN SDK. Available as command line option to converters. (experimental)
    - Added qnn-quantization-checker tool to QNN SDK (experimental)
    - Added qnn-netron GUI tool to QNN SDK.
   Converter:
     ONNX:
        - Add ElementWise Softplus support.
* Op:
     HTP:
      - Speed up dynamic depthwise convolution with uint8 weights.

Bugs
~~~~
* HTP:
   - Fix vtcm overflow caused by softmax and onehot which have a large depth.
   - Fixed accuracy regression in few models using masked-multiplication FP16 Op.
   - Solve vtcm overflow for transposeconv2d layer whose groups > 1, in depth= out depth, padding =0 and groups != in depth.
   - Mitigated runtime crash due to potential memory corruption (54195)
   - Repair accuracy bug in element wise operations.
* DSP:
   - Fixed QnnProperty_hasCapability to be callable independent of QnnBackend being created.
   - Cache tensor info on tensor create for use in subsequent APIs.
   - Fixed soc (miss)detection issue.
   - Fixed issue in QnnContext_setConfig related to setting priority before graph creation.
   - Fixed the calculation of zero point used for dilated convolution with stride greater than 1.
   - Fix the bug of get output info from the opconfig when add node in DSP.
* Tool:
   Converter:
      - Fixed bugs when select(where) Op have three inputs.
      ONNX:
        - Allowed constant tensor encodings to be equal to the overridden output tensor encodings when bit width=4.
   qnn-netron:
      - Fixed issue causing differences not being presented properly for some models.
      - Fixed dependency script bug with nodejs installation version mis-match.
   Tensorflow converter:
      - Fixed issues with per-channel quantization of weights: set is_symmetric = true by default, added param "axis" and "is_symmetric" into weight encodings info.
      - Fix the bugs of lstm with stacked cell.
   Quantizer:
      - Fixed issue with quantization of weights and biases in Conv3d Op due to squashing with Relu.
* HTA:
    - Fixed Reshape op validator to reflect support for only equal Input and Output dimensions.
    - Fixed issue with detailed profiling information not being produced.
* OP:
    GPU:
      - Fixed Convolution Op configuration to resolve accuracy issues.
      - Fix Concat graph finalize failures on Fillmore and Kodiak devices.
      - Fix concat op having input rank = 4 and axis = 0 validation error on low tier devices.


2.4.0
======

**10/31/2022**

QNN API version: v2.3.0


Changelog
---------

Features
~~~~~~~~
* DSP:
    Op:
      - Support broadcasting for ElementWiseSelect.
* CPU:
   - Added support for broadcasting in ElementWiseSelect Op.
   - GridSample op Support.
* Tools:
    qnn-sample-app:
      - Added support for QnnDevice create and free APIs.
    qnn-net-run:
      - Add duration and num_inferences command line options.
      - Add support for int64/uint64 graph input and outputs.
* API:
    - Introduction of the QnnSignal API.
    - Add support for QNN_SOC_MODEL_SM8325.
    - Added QNN_PROPERTY_GRAPH_SUPPORT_FINALIZE_SIGNAL, QNN_PROPERTY_GRAPH_SUPPORT_EXECUTE_SIGNAL, and QNN_PROPERTY_GRAPH_SUPPORT_EXECUTE_ASYNC_SIGNAL capabilities.
* OpDef:
    - Added Op definition for ScatterElements.
    - Added support for broadcasting in ElementwiseSelect Op.
* GPU: 
   - Fixed Concat Op configuration and validation logic.

Bugs
~~~~
* GPU:
   - Fixed init time regressions when using kernel cache.
   - Fixed soc (miss)detection issue.
* OpDef:
   - Remove incorrect shape constraints for Tile out[0] and multiples param.
* HTP:
   - Updated the core code to export an additional symbol to default visibility for op package integration.
* Tools:
    Quantizer:
       - Fixed bug caused by incorrectly added Convert operation for non-quantized data type conversions.
* CPU:
   - Fixed soc (miss)detection issue.



2.3.0
======

**09/30/2022**

QNN API version: v2.2.0


Changelog
---------

Features
~~~~~~~~
* CPU:
      - Added dynamic tensor support for TransposeConv2D.
    Op:
      - Added support for Shape op.
      - Added support for ConstantOfShape op.
* API:
   - Updated QnnGraph_executeAsync() behavior to block until the execution is enqueued rather than returning early if the queue is full.
   - Clarified behavior with concurrent calls to QnnGraph_execute() and QnnGraph_executeAsync()
   - Introduced a queue depth context config to control the maximum depth of the async execution queue.
   - Remove deprecated QnnGpuBackend_CustomConfig_t from QnnGpuBackend.h
   - Moved default QNN_API definition to QnnCommon.h
* Tools:
    Converters:
      Onnx:
        - Added 5D tensor support for PoolMax3d.
        - Added 5D tensor support for Resize.
        - Added 5D tensor support for PoolAvg3d.
    qnn-net-run:
      - Added support for execution via QnnGraph_executeAsync(), this will be the default mode of execution if supported by a backend.
* HTA:
   - Introduced backend with API 2.x support.
   - Add validation of HW limitation for FC layer.
* DSP:
   - Introduced backend with API 2.x support.
* HTP:
    Op:
      - Added 5D support to ElementWisePower.

Bugs
~~~~
* HTP:
   - Fixed vtcm estimation for axis=3 concat. Now input tensors are also taken into account if concat is not inplaced.
   - Fixed issue with float models containing Reduce Mean op not handling batch > 1 accurately.
   - Bug fix to handle graph finalize issues for certain ML models.
* HTA:
   - Fix wrong return of API error code.
* CPU:
   - Add INT64 support for cast op.
   - Improved CPU BE performance on Windows.
* GPU:
    Op:
     - Fix bug in InstanceNorm validation that fails when passing in normalize_variance param.
     - Fix bug in Tile validator for tiling across batch dimension for input rank >= 4
* Tools:
    Quantizer:
      - Fixed issue observed with int4 weight override support.


2.1.0
======

**08/04/2022**

QNN API version: v2.1.0

- Added QNN_SOC_MODEL_SXR1230P, QNN_SOC_MODEL_SSG2115P, and QNN_SOC_MODEL_SM6450.

Changelog
---------

Features
~~~~~~~~
* OpDef:
    - Added GRU op definition.
* Tools:
    Converters:
      Onnx:
        TensorFlow:
          - Added 5D tensor support for Conv3D.
* DSP:
   Op:
      - support CastUint32toFloat32.
      - support FloorDiv.

Bugs
~~~~
* HTP:
   - Updated rules to properly handle dequantize followed quantize operation.
   - Fixed the dequantize followed by slicepad sequence issue.
   
* Tool:
    qnn-throughput-net-run:
      - fixed potential memory leak issue with profile object allocation.

2.0.0
======

**07/07/2022**

QNN API version: v2.0.0

- QnnInterface:
    - QnnInterface_getProviders function signature update.

- QnnTypes:
    - Qnn_Tensor_t data structure update:
        - Add versioning (i.e. Qnn_TensorV1_t).
        - Add name field. ID field is now backend generated.
        - Consolidate max and current dimensions into one field.
        - INT4 support (see Qnn_BwScaleOffset_t and Qnn_BwAxisScaleOffset_t).
    - Qnn_OpConfig_t data structure update:
        - Add versioning (i.e. Qnn_OpConfigV1_t).
    - Added Qnn_SocModel_t.

- QnnTensor:
    - Qnn_Tensor_t is now an output argument to QnnTensor_createContextTensor and
      QnnTensor_createGraphTensor since the ID is now generated by the backend from the name.
    - Added QNN_TENSOR_ERROR_NAME_HASH_COLLISION error code.

- QnnDevice introduction:
    - Adds multi-core support.

- QnnBackend:
    - Introduce Qnn_BackendHandle_t.
    - These APIs now take a Qnn_BackendHandle_t as an argument:
        - QnnBackend_registerOpPackage
        - QnnBackend_validateOpConfig
        - QnnBackend_registerOpPackag
    - QnnBackend_initialize replaced by QnnBackend_create.
    - QnnBackend_terminate replaced by QnnBackend_free.
    - Added QnnBackend_getSupportedOperations and QnnBackend_OperationName_t.
    - Removed QnnBackend_getPerfInfrastructure (see QnnDevice_getInfrastructure).
    - Added and removed a variety of error codes.

- QnnMem:
    - QnnMem_register now take a Qnn_ContextHandle_t as an argument.
    - Add backend specific memory registration extensions.

- QnnContext:
    - Increased maximum context binary size to 64-bit.
    - Consolidate QnnContext_createFromBinary and QnnContext_createFromBinaryWithConfig.
    - QnnContext_create and QnnContext_createFromBinary function signature updates:
        - Qnn_BackendHandle_t association.
        - Qnn_DeviceHandle_t association.

- QnnLog:
    - Introduce Qnn_LogHandle_t.
    - QnnLog_setLogLevel now takes a Qnn_LogHandle_t as an argument.
    - QnnLog_initialize replaced by QnnLog_create.
    - QnnLog_terminate replaced by QnnLog_free.
    - Qnn_LogHandle_t is associated to a Qnn_BackendHandle_t in QnnBackend_create.
    - Added and removed a variety of error codes.

- QnnProperty:
    - Removed QnnProperty_get and QnnProperty_free.
    - Removed the following capability keys:
        - QNN_PROPERTY_BACKEND_SUPPORT_BUILD_ID
        - QNN_PROPERTY_BACKEND_SUPPORT_PERF_INFRASTRUCTURE
        - QNN_PROPERTY_BACKEND_SUPPORT_OP_VALIDATION
        - QNN_PROPERTY_CONTEXT_SUPPORT_GET_BINARY
        - QNN_PROPERTY_CONTEXT_SUPPORT_GET_BINARY_SIZE
        - QNN_PROPERTY_CONTEXT_SUPPORT_CREATE_BINARY
    - Added the following capability keys:
        - QNN_PROPERTY_CONTEXT_SUPPORT_CACHING
        - QNN_PROPERTY_GRAPH_SUPPORT_PRIORITY_CONTROL
        - QNN_PROPERTY_GROUP_DEVICE
        - QNN_PROPERTY_DEVICE_SUPPORT_INFRASTRUCTURE
        - QNN_PROPERTY_GRAPH_SUPPORT_PRIORITY_CONTROL
    - Added and removed a variety of error codes.

- QnnGraph:
    - Add priority configuration.
    - Add QnnGraph_setConfig API.

- QnnProfile:
    - QnnProfile_create associated with a Qnn_BackendHandle_t.

- QnnOpPackage:
    - Introduce Qnn_OpPackageHandle_t.
    - Introduce 2.0 interface to the backend.
    - Removed the QNN_OP_PACKAGE_API_VERSION_* macros and replaced them with 
      QNN_OP_PACKAGE_API_VERSION_1_4_0 and QNN_OP_PACKAGE_API_VERSION_2_0_0.

- QnnSystem:
    - QnnSystemInterface_getProviders function signature update.
    - QnnSystemContext_getBinaryInfo function signature update for const output.
    - Added QnnSystemContext_BinaryInfoV2_t to support QnnDevice.

- QnnOpDef:
    - Added op set version.

- Other:
    - Prune header inclusions.

