API overview

The current version of the Qualcomm® AI Engine Direct API is:

QNN_API_VERSION_MAJOR 2
QNN_API_VERSION_MINOR 29
QNN_API_VERSION_PATCH 0

The QNN API is source-level backwards compatible. However, it is not guaranteed to be binary (ABI) backwards compatible.

Components

Clients interact with Qualcomm® AI Engine Direct through the backend library by invoking QNN API. The QNN API is C-style to facilitate portability across different platforms. The API is organized into components, marked with QnnComponent as depicted in the diagram QNN API Components. Each of the API components has a corresponding header file, e.g., QnnGraph.h.

QNN API Components

../_static/resources/qnn_api_components.png

The QNN API components are categorized in QNN API Components Summary Table.

API Component

Category

Backend Specialization

Description

QnnBackend

Core

Yes

This is a top level QNN API component. Most QNN APIs require a backend to be initialized first. Provides QNN OpPackage Registry API.

QnnDevice

Core

Yes

This is a top level QNN API component which provides multi-core support. Provides all constructs required to associate desired hardware accelerator resources for execution of user composed graphs. A platform is broken down into potentially multiple devices. Devices may have multiple cores. Provides API for performance control.

QnnContext

Core

Yes

Context provides execution environment for graphs and operations. Graphs and tensors which are shared in between graphs are created within a context. Context content can be cached into a binary form which later can be used for faster context/graph loading. It provides configuration option for Priority control.

QnnGraph

Core

Yes

Provides composable graph API. A graph is created inside a context, and is composed from nodes and tensors. Nodes are connected with tensors. Once finalized, graph is ready for execution.

QnnTensor

Core

No

Tensors hold either operation’s static/constant data or input/output activation data. Tensors can have either Context or Graph scope. Tensors created with Context scope can be used within Graphs that belong to the same Context.

QnnOpPackage

Core

Yes

Provides interface to the backend to use registered OpPackage libraries.

QnnProfile

Utility

Yes

Provides means to profile QNN backends to evaluate performance (memory and timing) of graphs and operations.

QnnLog

Utility

No

Provides means for QNN backends to output logging data, can be extended to the OpPackage as well. Can be initialized before QnnBackend.

QnnProperty

System

No

Provides means for client to discover capabilities of a backend. Can be used without QnnBackend initialization.

QnnMem

System

No

Provides API to register externally allocated memory with a backend.

QnnSignal

System

No

Provides means to manage Signal objects. Signal objects are used to control execution of other components.

Usage sequences

As highlighted in the Overview section, the QNN architecture and unified API is designed with the intent to ease integration into third-party NN frameworks, with the flexibility to support varying use case needs.

The following sections describe typical QNN API interaction workflows.

Basic call flow

The most common use case for clients interacting with the QNN API is to construct a graph representing their network model by adding operation nodes and tensors to connect them. Once the construction is complete, the user can start graph execution by supplying input tensors to the designated input nodes of the graph. Frameworks like SNPE and ANN use this workflow to construct and execute QNN graphs.

The QNN SDK also includes neural network converters that can assist the user in constructing such graphs by translating a source network model into an equivalent QNN representation. This is essentially a C++ file consisting of invocations to the exact same QNN APIs that perform operations described above. Clients can use this converter-based workflow to compile and link the QNN representation of their models into applications. Some tools in the QNN SDK, such as qnn-net-run, use this workflow.

The QNN Basic Call Flow diagram demonstrates this basic scenario.

Note that this illustration does not showcase the usage of all API components described above. For simplicity, the illustration is limited to the loading and execution of an entire graph inside one context and on one backend.

QNN Basic Call Flow

../_static/resources/BasicCallSequence.png

Initialization and Op Package registration

Applications interacting with the QNN API must first create the backend on which to create contexts and graphs. This is done using QnnBackend_create().

Native operations supported in the QNN SDK are automatically registered with the backend at the time of initialization. If an application wants to use custom Op Packages, it must register them manually using QnnBackend_registerOpPackage().

Context and graph composition

QNN graphs live in QNN contexts that provide an execution environment for their operations, as explained in QNN API Components. An application creates a context using QnnContext_createFromBinary(). Optionally, it can customize the context using the argument QnnContext_Config_t.

The application then creates an empty graph with desired configuration within the context using QnnGraph_create().

It then starts translating the source framework model into the QNN graph by adding nodes and interconnecting them by adding tensors to the graph. The QNN tensors that connect these nodes are created using the QNN APIs in QnnTensor.h. Tensors that connect nodes within a graph are referred to as graph tensors, and can be created using QnnTensor_createGraphTensor(). Tensors that are defined in the scope of a QNN context, and connect different graphs are referred to as context tensors, and can be created using QnnTensor_createContextTensor(). Input tensors that represent static data, such as weights and biases, can be created by supplying data as part of Qnn_Tensor_t.

QNN tensors can be designated to be of different types based on the purpose that they are used in an application. Graph input tensors are specified with type QNN_TENSOR_TYPE_APP_WRITE. Graph output tensors are specified with type QNN_TENSOR_TYPE_APP_READ. All other intermediate tensors in a graph are specified with type QNN_TENSOR_TYPE_NATIVE, and tensors containing static data are specified with type QNN_TENSOR_TYPE_STATIC. Context tensors that intend to connect two or more graphs are specified with type QNN_TENSOR_TYPE_APP_READWRITE.

QNN prescribes rules and imposes certain restrictions in using these types when creating different types of QNN tensors to safeguard applications from creating them using invalid combinations. Refer to QnnTensor.h for a comprehensive listing of all rules as applicable to tensor creation.

Note

QNN tensors must be created with a name unique in the context. Duplication is not permitted and results in undefined behavior.

Nodes are instances of QNN operations. Each node is created with an operation configuration that defines the type and attributes of the operation this node represents. See Qnn_OpConfig_t for more details. Nodes are added to a graph using QnnGraph_addNode(), which accepts a user argument for such an op configuration specified by its OpConfig. Adding nodes to the QNN graph should be done in the node dependency order.

Note

There are no QNN APIs to remove nodes and tensors registered with a context or graph.

Graph finalization

After all tensors and nodes have been configured and added to the graph the application must inform QNN that composition is complete and the graph can be finalized by calling QnnGraph_finalize(). This step allows the backend to perform a series of optimizations, such as collapsing multiple nodes into single nodes, and produce a highly performant and efficient graph.

Note

This step of finalizing a graph is mandatory to allow applications to run inference cycles.

Note

A finalized graph cannot be modified any further; no new nodes or tensors can be added to the graph after this step.

Graph execution

After having successfully created and finalized a graph, the application can now start running inference on the backend with the graph execution APIs. Graphs can be executed synchronously using QnnGraph_execute(), or asynchronously using QnnGraph_executeAsync(). The asynchronous version accepts additional arguments to notify the application when execution completes, along with optional notification parameters.

Termination

An application can tear down contexts it has created using QnnContext_free(), which in turn destroys any graphs they hold. Subsequently, an application can free a backend handle with QnnBackend_free(), which invalidates all resources and handles associated to the backend handle during the course of using contexts and graphs within it.

Context caching

The QNN framework allows applications to cache composed graphs and contexts in binary form for future use. Among others, one of the reasons an application may choose to do so is to save on the time in constructing graphs, thereby reducing network initialization time. A backend may take advantage of this approach by allowing applications to compose and cache QNN contexts offline on an x86-based desktop machine, and subsequently loading and executing graphs from the cache at runtime on the target device.

The diagram QNN Context Caching Call flow demonstrates the sequence of calls involved in context caching.

QNN Context Caching Call flow

../_static/resources/ContextCaching.png

A QNN context can be cached in binary form and retrieved from the backend with QnnContext_getBinary(). The application is expected to allocate sufficient memory to hold the context cache. An estimate of the size of the binary can be obtained with QnnContext_getBinarySize(). The description of the contents in the binary buffer is determined by each backend individually, and therefore is backend-specific. However, there is additional metadata that describes the contents of the binary buffer that can be queried by the user, and is common across all backends. It contains, among other things, all the information about graphs and input/output tensors to such graphs that users can reference later during cache-based graph inference. For full reference, look up QnnSystemContext_BinaryInfo_t.

Note

Tensors are uniquely identified by their IDs. Therefore they remain the same between context caching and runtime cache-based inference. Graph input/output tensor IDs may be obtained from a context cache via QnnSystemContext_BinaryInfo_t.

Note

Context caching succeeds only if entities within the context, such as graphs if they exist, have been properly formed and finalized.

Cache-based execution

The primary objective of this workflow is to allow an application to skip graph composition by loading a precomposed context from the cached binary produced by the step above.

The QNN Cache-based Execution Call flow diagram demonstrates the sequence of steps to execute graphs loaded from a cached context.

QNN Cache-based Execution Call flow

../_static/resources/CacheBasedExecution.png

An application first loads a context from the cached binary into the QNN backend using QnnContext_createFromBinary(). The backend returns a context handle that can be used to further retrieve a graph constructed within that context using QnnGraph_retrieve(). These graphs are ensured to have been finalized by virtue of having been successfully cached in the step above. Once this is done, the application can start forward inference by supplying input tensors, just as in the basic call flow described above. An application can optionally choose to use additional services provided by the QNN API to inspect the contents of the context binary produced in the step above. This eliminates the requirement on applications to cache all context-related metadata themselves during graph preparation. These backend-independent services are made available through the so-called QnnSystem API exposed through a standalone QnnSystem library. For reference, the API QnnSystemContext_getBinaryInfo() in the diagram above provides context binary information when presented with a serialized binary buffer.

Note

Custom Op Packages registered with a backend do not get cached along with any contexts. They have to be manually registered during the cache retrieval sequence using QnnBackend_registerOpPackage().

The application terminates the constructed context using QnnContext_free(), as before.

Backend API specialization

Some QNN API components provide means for backend specialization through opaque objects for which structure definition must be provided by the backend. The QNN API Components Summary Table indicates API components that provide backend specialization capability.

API specialization is optional, and is subject to the needs and discretion of a backend. Backends publish their specialized headers at <QNN_SDK_ROOT>/include/QNN/<Backend>/ using the following convention:

  • Headers are named as Qnn<Backend><ApiComponent>.h

    • For example, QnnCpuOpPackage.h is the CPU backend’s specialization of the QnnOpPackage.h header, and can be found in the <QNN_SDK_ROOT>/include/QNN/CPU/ folder.

  • Each backend-specialized header includes base API component headers as required. For example,

    • QnnCpuOpPackage.h will #include "QnnOpPackage.h"

  • Clients should include specialized headers with <Backend> in the path.

    • For example, #include "CPU/QnnCpuOpPackage.h"

  • Clients are responsible to ensure casting the correct backend-specific type into API specialization interface. For example, to specify a custom graph configuration for the DSP backend, a client must cast the data structure QnnDspGraph_CustomConfig_t defined in DSP/QnnDspGraph.h into the generic QnnGraph_CustomConfig_t used in QnnGraph_Config_t as part of QnnGraph_create().

Current versions of backend specific APIs are found at the following locations: