Block Op Overview¶
Machine Learning (ML) compute is typically expressed using operators in frameworks such as PyTorch or ONNX. The following limitations arise when expressing the computational graph and compiling it for a QAIRT backend/device.
Compute is expressed using different subgraphs, making it challenging for ML compilers to pattern-match the subgraphs to an efficient kernel for a QAIRT backend.
The operators present in the framework are not sufficient to express the computation efficiently or adequately, leading to an inflated graph or an inability to capture the desired compute at the framework level.
Naively quantizing transformed or lowered operators can result in a loss of precision.
To assist tools in mapping commonly occurring computational graphs to QAIRT backends, the computational graph is packaged as a block op in the source framework as a Python module. These can then be detected by framework specific QAIRT converters so they can be appropriately mapped to the QAIRT backends.
Block ops are available as Python modules that can be imported into the source framework to compose models.
Direct replacement in the model, no retraining required.
Block ops are similar to ONNX contrib ops.
Benefits
Performance: Compilers resort to pattern matching to fuse multiple operators into a single operator to improve performance, but it is hard to make pattern matching robust. Model writers can capture a subgraph of operators into a block op to achieve more rubust performance improvements.
Functionalities: Many interesting and unique functionalities, like dynamic behavior and statefulness, are currently missing in the frameworks. Block ops allow us to define and expose unique operators to frameworks for use by model writers in their models.
Accuracy: Quantization of heavily lowered graphs can cause a loss of precision. Block ops allow us to retain operator boundaries in the framework and allow us to quantize before the operator is lowered. This can help avoid loss of precision in some models.