Creating a UDO Package¶
This section describes the process of creating a UDO package from a simple text specification of a user-defined operation using the snpe-udo-package-generator. From the Qualcomm® Neural Processing SDK API standpoint, a UDO package consists of a registration library and one or more implementation libraries. As such, while a user can create a UDO package independent of this prescription, this section describes the process of creating a partially defined UDO package which can be easily implemented and compiled to produce the relevant libraries.
Generating UDO Skeleton Code
To generate a package using Qualcomm® Neural Processing SDK tools, it is necessary to create a UDO configuration describing the operation and the package details. See Defining a UDO Package for more information. Once a configuration has been specified to adequately represent the desired UDO, it can be supplied as an argument to the Qualcomm® Neural Processing SDK UDO package generator tool described in snpe-udo-package-generator. The intention of the tool is to generate partial skeleton code to aid rapid prototyping. This section describes the usage of the package generator tool and the artifacts it generates.
In order to run the
snpe-udo-package-generator,
the user is expected to have followed the setup instructions at
Qualcomm (R) Neural Processing SDK Setup. The tool also has a dependency on
the Mako Template Library, which can be found here:
https://www.makotemplates.org/download.html. Additionally, we
need an extracted Qualcomm® AI Direct SDK (no need of Qualcomm® AI Direct SDK setup) for
generating the skeleton code. For Qualcomm® AI Direct SDK details, refer to the
Qualcomm® AI Direct SDK documentation at $QNN_SDK_ROOT/docs/index.html page,
where QNN_SDK_ROOT is the location of the Qualcomm® AI Direct SDK
installation. Set the $QNN_SDK_ROOT to the unzipped Qualcomm® AI Direct SDK
location. Once setup is complete, the following command can be
used to generate a package:
snpe-udo-package-generator -p $SNPE_ROOT/examples/SNPE/NativeCpp/UdoExample/Softmax/config/Softmax_Htp.json -o <my-dir>
The above command will create a UDO package which will be a directory composed of skeleton code and build files that can be used to compile the package contents into stand-alone shared libraries. The config file referenced in UDO Tutorial has been used to generate the udo package contents below:
|-- Makefile
|-- common.mk
|-- config
| `-- Softmax_Htp.json
|-- include
| `-- utils
| |-- IUdoOpDefinition.hpp
| |-- UdoMacros.hpp
| `-- UdoUtil.hpp
`-- jni
|-- Android.mk
|-- Application.mk
`-- src
|-- CPU
| |-- Makefile
| |-- makefiles
| | |-- Android.mk
| | |-- Application.mk
| | `-- Makefile.linux-x86_64
| `-- src
| |-- CpuCustomOpPackage.cpp
| |-- SoftmaxUdoPackageInterface.cpp
| |-- ops
| | `-- Softmax.cpp
| `-- utils
| |-- BackendUtils.hpp
| |-- CPU
| | |-- CpuBackendUtils.cpp
| | `-- CpuBackendUtils.hpp
| `-- CustomOpUtils.hpp
|-- DSP_V68
| |-- Makefile
| `-- src
| |-- SoftmaxUdoPackageInterface.cpp
| `-- ops
| `-- Softmax.cpp
|-- GPU
| |-- Makefile
| |-- include
| | |-- GpuCustomOpPackage.hpp
| | `-- Operation.hpp
| |-- makefiles
| | |-- Android.mk
| | `-- Application.mk
| `-- src
| |-- GpuCustomOpPackage.cpp
| |-- SoftmaxUdoPackageInterface.cpp
| `-- ops
| `-- Softmax.cpp
|-- reg
| |-- Makefile
| `-- SoftmaxUdoPackageRegLib.cpp
`-- utils
`-- UdoUtil.cpp
Contents of a UDO package
The package can be compiled using the make build system for a Linux host machine or the Android-NDK build system for an Android device. Briefly, the make system is configured using the top level Makefile, common.mk and the individual makefiles in each runtime directory. The android-build system is configured using jni/Android.mk and jni/Application.mk. See Compiling a UDO package for more compilation details.
The config directory contains the JSON configuration used to create the package.
The include directory contains three kinds of files: headers from the Qualcomm® Neural Processing SDK UDO API, header files specific to the UDO package and its operations, and a directory of C++ helper utils which wrap the Qualcomm® Neural Processing SDK UDO API calls. Users should note that the utils API is included simply for convenience in creating implementation source code. The use of the utils is not a prerequisite for constructing or executing a UDO package.
The relevant source files for the package are organized under the jni/src directory. There will be a sub-directory for each core-type specified in the config. The registration (reg) directory contains files necessary to create the registration library, which is generally the point of entry for the Qualcomm® Neural Processing SDK API. There is also source code from the previously mentioned C++ helper utils. In general, users are only expected to edit code contained in runtime-specific or registration directories.
Generated Source Code
This section and the following sub-sections cover the source code generated in a package using the package contents displayed in Generating UDO Skeleton Code. When finalized, a UDO package is expected to contain a registration library and one or more implementation libraries. To produce the registration library, the source code in jni/src/reg is compiled. The implementation library is compiled using source code from each core-type specific directory. Recall that the package created by the tool will still need to be implemented. The following subsections will address the files that need to be implemented. All generated source code will have the tag Auto-generated in the header. The source code is considered partially complete in the generation stage, and it is the user’s responsibility to implement certain files as needed to ensure proper compatibility and functionality with the Qualcomm® Neural Processing SDK API. All code to be implemented will have the tag add code here in the body to indicate that it needs to be implemented. Note that all libraries link against the C++ utils source code.
Completing the Registration Skeleton Code
As mentioned previously, the registration library is created from source code in jni/src/reg. The directory contains a Makefile to compile the package and the package specific file: SoftmaxUdoPackageRegLib.cpp which contains the function symbols that get resolved by the Qualcomm® Neural Processing SDK UDO API when the library is opened. The registration library file contains API calls that provide the Qualcomm® Neural Processing SDK UDO API with information about the nature of the operations in the model, as well as the implementation libraries they belong to.
Completing the Implementation Skeleton Code
The implementation library is created per core-type, from source code that lives under the core-type specific directory within jni/src. Using the CPU runtime as an example, the jni/src/CPU directory contains a Makefile to build the CPU implementation library, a package-specific source file: SoftmaxUdoPackageInterface.cpp for all operations to be contained in the library, and a per operation source file: Softmax.cpp that should contain the runtime implementation. As in the registration case, the package-specific source file should not be edited in the general case. Similarly this file contains methods that return information about the operations contained in the implementation library, and methods that act as a layer of indirection above the code that is ultimately executed in the per operation file. In the CPU case, the three methods in Softmax.cpp namely: finalize, execute, and free are the user’s responsibility to edit. Note these methods create the operation, execute its implementation, and free the operation respectively. As such, these are completely determined by the user. A sample generated version of the implementation library is included below:
Qnn_ErrorHandle_t execute(CustomOp* operation) {
/**
* Add code here
**/
return QNN_SUCCESS;
}
Qnn_ErrorHandle_t finalize(const CustomOp* operation) {
QNN_CUSTOM_BE_ENSURE_EQ(operation->numInput(), 1, QNN_OP_PACKAGE_ERROR_VALIDATION_FAILURE)
QNN_CUSTOM_BE_ENSURE_EQ(operation->numOutput(), 1, QNN_OP_PACKAGE_ERROR_VALIDATION_FAILURE)
/**
* Add code here
**/
return QNN_SUCCESS;
}
Qnn_ErrorHandle_t free(CustomOp& operation) {
/**
* Add code here
**/
return QNN_SUCCESS;
}
To have good performance and stability, it is required to
avoid heap memory allocation in the completed op execution
functions, that is, <op_name>Impl,
<op_name>_executeOp, <op_name>Operation and
execute functions for DSP V68 and later, DSP V66 / V65,
GPU, and CPU respectively which are executed during graph
execution. The heap memory allocation includes but not
limited to calling malloc, operator new,
constructing STL container objects like std::vector
with default allocator, and adding items like calling
std::vector::push_back to STL container objects
with default allocator.
The reason to avoid heap memory allocation is because the time to finish heap memory allocation is unbounded and may have huge variance. Especially for DSP and HTP, the heap memory allocation can trigger CPU request in some cases and significantly impact the inference speed. Also, the heap memory allocation can fail and return null pointers or throw exceptions. In such case, there is usually no good way to continue the execution. In applications with strict functional safety requirements, heap memory allocation after initialization is not even permitted.
If a working buffer is required to carry out the op computation, here are some potential alternatives:
construct std::array instead of std::vector for local variables: Unlike
std::vector,std::arrayuses stack memory. This works if the maximum memory size can be known in advance and the size is not large.use output tensor space as scratch memory: Each execution function has at least one output tensor. You can use the space of the output tensor as the scratch buffer before you fill in the real output data. Please note that the output tensor space can only be safely written in the execution function which owns the output tensor.
Notes
In the general case, the package should only require functional edits that enable proper execution. The initial un-implemented package is guaranteed to compile.
One subtle distinction is that the generated DSP V65 or DSP V66 implementation source code expects one operation per implementation library. While in the CPU, GPU, and DSP V68 or later cases, there may be an arbitrary number of operations in a library.
There are differences between the implementation source files for each runtime. In the GPU case, the execute workflow is already implemented and the user is only expected to implement the <OpName>Operation and setKernelInfo methods. In contrast to CPU and GPU, DSP uses API which does not depend on C++ helper utils discussed in the Generated Source Code section. This means that certain helper methods and constructors may not be available in the DSP case. For DSP case, the user is expected to implement softmaxImpl method.