Compiling a UDO package¶
Introduction
This section provides information about compiling UDO packages for all supported runtimes in Qualcomm® Neural Processing SDK.
As explained in Overview of UDO, a set of registration and implementation libraries is collectively referred to as a UDO package. The user has complete control over building these libraries for their desired runtimes using compatible tool-chains. Alternatively, Qualcomm® Neural Processing SDK offers tools and utilities to create and compile a UDO easily. For more information about the tool used to create a UDO package refer to Creating a UDO package. This section explains UDO package compilation based on the directory structure provided by the package generator.
Implementing a User-defined operation
Fundamentally, a UDO is required to be developed using the set of APIs defined in header files located at $SNPE_ROOT/include/SNPE/SnpeUdo/. Each runtime may impose additional requirements and provide options for customizing the implementation to suit the runtime. Details of the UDO APIs can be found in the API documentation at Qualcomm® Neural Processing SDK API. This section assumes that a UDO package was generated using the UDO package generator tool described in Creating a UDO package which produces a partial implementation skeleton based on the UDO specification configured by the user.
Make Targets for Package Compilation
The UDO package generator tool creates a makefile to compile the package for a specific runtime and target platform combination. The makefile is intended to provide a simple interface to compile for platforms that use make natively or require ndk-build. Using the provided makefile also allows for per library compilation for various targets.
The general form of each make target is <runtime>_<platform>. Targets that are only of the form <runtime> include all possible targets. For instance, running
make cpu
will compile the CPU for both x86 and Android platforms (arm64-v8a). A comprehensive table of available make targets is presented below .
Note: Use of the makefile is optional and not required to generate libraries.
Note: For all following examples, the displayed artifacts are for arm64-v8a target.
Implementing a UDO for CPU
A CPU UDO implementation library based on core UDO APIs is
required to run a UDO package on CPU runtime. The UDO package
generator tool will create a skeleton containing blank
constructs in the required format, but the core logic of
creating and execution of the operation needs to be filled in
by the user. This can be done by completing the implementation
of finalize(), execute(), and free() functions in
the <OpName>.cpp file generated by the UDO package
generator tool.
To have good performance and stability, it is required to avoid
heap memory allocation in the completed execute() functions.
The heap memory allocation includes but not limited to calling
malloc, operator new, constructing STL container objects
like std::vector with default allocator, and adding items
like calling std::vector::push_back to STL container objects
with default allocator. Please check
here
for more information.
Note: One important notion to take into account is that the Qualcomm® Neural Processing SDK provides tensor data corresponding to all the inputs and outputs of a UDO not directly but as an opaque pointer. The UDO implementation is expected to get a handle to the raw tensor pointers using the methods in the CustomOp operation object issued by Qualcomm® Neural Processing SDK at the time of execution. The CPU runtime operates only with floating point activation tensors. Therefore, CPU UDO implementations should be implemented to receive and produce only floating point tensors and set the field data_type in the config file to FLOAT_32. All other data types will be ignored. Refer to Defining a UDO for more details.
Compiling and running the UDO package on host is required for Qualcomm® Neural Processing SDK model quantization tool, snpe-dlc-quantize. It is necessary to quantize a model using snpe-dlc-quantize, to run a UDO layer that has at least one non-float input on the DSP.
Compiling a UDO for CPU on host
Steps to compile the CPU UDO implementation library on host x86 platform are as below:
Set the environment variable
$SNPE_UDO_ROOT.export SNPE_UDO_ROOT=<absolute_path_to_SnpeUdo_headers_directory>
Run the make instruction below in UDO package directory to compile the UDO package:
make cpu_x86
The expected artifacts after compiling for Host CPU are
The UDO CPU implementation library: <UDO-Package>/libs/x86-64_linux_clang/libUdo<UDO-Package>ImplCpu.so
The UDO package registration library: <UDO-Package>/libs/x86-64_linux_clang/libUdo<UDO-Package>Reg.so
Note: The command must be run from the package root.
Compiling a UDO for CPU on device
Steps to compile the CPU UDO implementation library on Android platform are as below:
- Set the environment variable
$SNPE_UDO_ROOT.export SNPE_UDO_ROOT=<absolute_path_to_SnpeUdo_headers_directory>
$ANDROID_NDK_ROOTmust be set for the Android NDK build toolchain.export ANDROID_NDK_ROOT=<absolute_path_to_android_ndk_directory>
Run the make instruction below in UDO package directory to compile the UDO package:
make cpu_android
The shared C++ standard library is required for the NDK build to run. Make sure libc++_shared.so is present on the device at
LD_LIBRARY_PATH.
The expected artifacts after compiling for Android CPU are
The UDO CPU implementation library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>ImplCpu.so
The UDO package registration library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>Reg.so
A copy of shared standard C++ library: <UDO-Package>/libs/arm64-v8a/libc++_shared.so
Implementing a UDO for GPU
Similar to the CPU runtime, a GPU UDO implementation library
based on core UDO APIs is required to run a UDO package on GPU
runtime. The UDO package generator tool will create a skeleton
containing blank constructs in the required format, but the
core logic of creating and execution of the operation needs to
be filled in by the user. This can be done by completing the
implementation of setKernelInfo() and
<OpName>Operation() function, and adding the GPU kernel
implementations in the <OpName>.cpp file generated by the
UDO package generator tool.
To have good performance and stability, it is required to avoid
heap memory allocation in the completed <OpName>Operation()
functions. The heap memory allocation includes but not limited
to calling malloc, operator new, constructing STL
container objects like std::vector with default allocator,
and adding items like calling std::vector::push_back to
STL container objects with default allocator. Please check
here
for more information.
Qualcomm® Neural Processing SDK GPU UDO supports 16-bit floating point activations in the network. Users should expect input/output OpenCL buffer memory from Qualcomm® Neural Processing SDK GPU UDO to be in 16-bit floating point (or OpenCL half) data format as the storage type. For increased accuracy, users may choose to implement internal math operations of the kernel using 32-bit floating point data, and converting to half precision when reading input buffers or writing output buffers from the UDO kernel.
Note: Qualcomm® Neural Processing SDK provides tensor data corresponding to all the inputs and outputs of a UDO not directly but as an opaque pointer. The UDO implementation is expected to convert it to <code>Qnn_Tensor_t</code> which holds OpenCL memory pointer for tensor.
Compiling a UDO for GPU on device
Steps to compile the GPU UDO implementation library on Android platform are as below:
- Set the environment variable
$SNPE_UDO_ROOT.export SNPE_UDO_ROOT=<absolute_path_to_SnpeUdo_headers_directory>
$ANDROID_NDK_ROOTmust be set for the Andorid NDK build toolchain.export ANDROID_NDK_ROOT=<absolute_path_to_android_ndk_directory>
$CL_LIBRARY_PATHmust be set for the libOpenCL.so library location.export CL_LIBRARY_PATH=<absolute_path_to_OpenCL_library>
The OpenCL shared library is not distributed as part of Qualcomm® Neural Processing SDK.
Run the make instruction below in UDO package directory to compile the UDO package:
make gpu_android
Note: The shared OpenCL library is target specific. It
should be discoverable in CL_LIBRARY_PATH.
The expected artifacts after compiling for Android GPU are
The UDO GPU implementation library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>ImplGpu.so
The UDO package registration library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>Reg.so
A copy of shared standard C++ library: <UDO-Package>/libs/arm64-v8a/libc++_shared.so
Implementing a UDO for DSP V65 and V66
Qualcomm® Neural Processing SDK utilizes Qualcomm® AI Direct SDK to run UDO layers on DSP. Therefore, a DSP
implementation library based on Qualcomm® AI Direct SDK APIs is required to run
a UDO package on DSP runtime. The UDO package generator tool
will create the template file <OpName>.cpp and the user
will need to implement the execution logic in the
<OpName>_executeOp() function in the template file.
To have good performance and stability, it is required to avoid
heap memory allocation in the completed <OpName>_executeOp()
functions. The heap memory allocation includes but not limited
to calling malloc, operator new, constructing STL
container objects like std::vector with default allocator,
and adding items like calling std::vector::push_back to
STL container objects with default allocator. Please check
here
for more information.
Qualcomm® Neural Processing SDK UDO provides the support for multi-threading of the operation using worker threads, Hexagon Vector Extensions (HVX) code and VTCM support.
The DSP runtime only propagates unsigned 8-bit activation tensors between the network layers. But it has the ability to de-quantize data to floating point if required. Therefore users developing DSP kernels can expect either UINT_8 or FLOAT_32 activation tensors in and out of the operation, and thus can set the field data_type in the config file to one of these two settings. Refer to Defining a UDO for more details.
Compiling a UDO for DSP V65 and V66 on device
This Qualcomm® Neural Processing SDK release supports building UDO DSP implementation libraries using Hexagon-SDK 3.5.x.
Set the environment variables
$SNPE_UDO_ROOTexport SNPE_UDO_ROOT=<absolute_path_to_SnpeUdo_headers_directory>
Hexagon-SDK needs to be installed and set up. For details, follow the setup instructions on
$HEXAGON_SDK_ROOT/docs/readme.htmlpage, where$HEXAGON_SDK_ROOTis the location of the Hexagon-SDK installation. Make sure$HEXAGON_SDK_ROOTis set to use the Hexagon-SDK build toolchain. Also set$HEXAGON_TOOLS_ROOTand$SDK_SETUP_ENVexport HEXAGON_SDK_ROOT=<path to hexagon sdk installation> export HEXAGON_TOOLS_ROOT=$HEXAGON_SDK_ROOT/tools/HEXAGON_Tools/8.3.07 export ANDROID_NDK_ROOT=<path to Android NDK installation> export SDK_SETUP_ENV=Done
$ANDROID_NDK_ROOTmust be set for the Andorid NDK build toolchain.export ANDROID_NDK_ROOT=<absolute_path_to_android_ndk_directory>
Run the make instruction below in UDO package directory to compile the UDO DSP implementation library:
make dsp
The expected artifacts after compiling for DSP are
The UDO DSP implementation library: <UDO-Package>/libs/dsp_<dsp_arch_type>/libUdo<UDO-Package>ImplDsp.so
The UDO package registration library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>Reg.so
Note: The command must be run from the package root. dsp_v60 folder is created for all aarchs which are less than v68.
Implementing a UDO for DSP V68 or later
Qualcomm® Neural Processing SDK utilizes Qualcomm® AI Direct SDK to run UDO layers on DSP v68 or later.
Therefore, a DSP implementation library based on Qualcomm® AI Direct SDK APIs
is required to run a UDO package on DSP runtime. The UDO
package generator tool will create the template file
<OpName>ImplLibDsp.cpp and the user will need to implement
the execution logic in the <OpName>Impl() function in the
template file.
To have good performance and stability, it is required to avoid
heap memory allocation in the completed <OpName>Impl()
functions. The heap memory allocation includes but not limited
to calling malloc, operator new, constructing STL
container objects like std::vector with default allocator,
and adding items like calling std::vector::push_back to
STL container objects with default allocator. Please check
here
for more information.
Qualcomm® Neural Processing SDK UDO provides the support for Hexagon Vector Extensions (HVX) code and cost based scheduling.
The DSP runtime propagates unsigned 8-bit or unsigned 16-bit activation tensors between the network layers. But it has the ability to de-quantize data to floating point if required. Therefore users developing DSP kernels can expect either UINT_8, UINT_16 or FLOAT_32 activation tensors in and out of the operation, and thus can set the field data_type in the config file to one of these three settings. Refer to Qualcomm® AI Direct SDK for more details.
Compiling a UDO for DSP_V68 or later on device
This Qualcomm® Neural Processing SDK release supports building UDO DSP implementation libraries using Hexagon-SDK 4.x and Qualcomm® AI Direct SDK.
Set the environment variables
$SNPE_UDO_ROOTexport SNPE_UDO_ROOT=<absolute_path_to_SnpeUdo_headers_directory>
Hexagon-SDK 4.0+ needs to be installed and set up. For Hexagon-SDK details, follow the setup instructions on
$HEXAGON_SDK4_ROOT/docs/readme.htmlpage, where$HEXAGON_SDK4_ROOTis the location of the Hexagon-SDK installation. Make sure$HEXAGON_SDK4_ROOTis set to use the Hexagon-SDK build toolchain. Also, set$HEXAGON_TOOLS_ROOTand$SDK_SETUP_ENV. Additionally, we need an extracted Qualcomm® AI Direct SDK (no need of Qualcomm® AI Direct SDK setup) for building the libraries. For Qualcomm® AI Direct SDK details, refer to the Qualcomm® AI Direct SDK documentation at$QNN_SDK_ROOT/docs/QNN/index.htmlpage, where$QNN_SDK_ROOTis the location of the Qualcomm® AI Direct SDK installation. Set the$QNN_SDK_ROOTto the unzipped Qualcomm® AI Direct SDK location.export HEXAGON_SDK_ROOT=<path to hexagon sdk installation> export HEXAGON_SDK4_ROOT=<path to hexagon sdk 4.x installation> export HEXAGON_TOOLS_ROOT=$HEXAGON_SDK_ROOT/tools/HEXAGON_Tools/8.4.09 export QNN_SDK_ROOT=<path to QNN sdk installation> export ANDROID_NDK_ROOT=<path to Android NDK installation> export SDK_SETUP_ENV=Done
$ANDROID_NDK_ROOTmust be set for the Andorid NDK build toolchain.export ANDROID_NDK_ROOT=<absolute_path_to_android_ndk_directory>
Run the make instruction below in UDO package directory to compile the UDO DSP implementation library:
make dsp
Run the make instruction below in UDO package directory to generate a library for offline cache generation:
make dsp_x86 X86_CXX=<path_to_x86_64_clang>
Run the make instruction below in UDO package directory to generate a library for Android ARM architecture:
make dsp_aarch64
Note: This should only be run on linux based devices. This should not be run for Windows based devices.
The expected artifacts after compiling for DSP are
The UDO DSP implementation library: <UDO-Package>/libs/dsp_v68/libUdo<UDO-Package>ImplDsp.so
The UDO package registration library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>Reg.so
The expected artifact after compiling for offline cache generation is
The UDO DSP implementation library: <UDO-Package>/libs/x86-64_linux_clang/libUdo<UDO-Package>ImplDsp.so
The expected artifact after compiling for Android ARM architecture is
The UDO DSP implementation library: <UDO-Package>/libs/arm64-v8a/libUdo<UDO-Package>ImplDsp_AltPrep.so
Note: The command must be run from the package root.
Table of Make Targets
Make Target |
Runtime |
Platform |
Misc. |
|---|---|---|---|
all |
CPU, GPU, DSP |
x86, arm64-v8a |
|
all_x86 |
CPU |
x86 |
|
all_android |
CPU, GPU, DSP |
arm64-v8a |
|
reg |
x86, arm64-v8a |
||
reg_x86 |
x86 |
||
reg_android |
arm64-v8a |
||
cpu |
CPU |
x86, arm64-v8a |
|
cpu_x86 |
CPU |
x86 |
Same as all_x86 |
cpu_android |
CPU |
arm64-v8a |
|
gpu |
GPU |
arm64-v8a |
|
gpu_android |
GPU |
arm64-v8a |
Same as gpu |
dsp |
DSP |
||
dsp_android |
DSP |
Same as dsp |
|
dsp_x86 |
DSP |
||
dsp_aarch64 |
DSP |
Note: By default, compiling for a runtime additionally compiles the corresponding registration library