Overall workflow
Compilation and optimization of models is done ahead of time (i. e. AOT, as opposed to e. g. JIT). The idea is that a model is optimized once and run many times. When an optimized model is used for a long time, the relative cost of optimization becomes negligible, even if it takes hours or even days, in absolute terms.
This approach enables deeper and more complex optimization strategies, leading to better end results. It also enables a simpler and smaller runtime environment, as it does not need a JIT compiler and its dependencies.
The resulting workflow can be separated into three steps.
Preparing the model:
In this step the user converts the model from a specific format such as PyTorch or TensorFlow to the more general ONNX format. The user also creates a model program, which contains information about e.g. the model, datasets and metrics used during optimization and later, inference.
Optimizing the model:
The model, its program and optional custom data (such as datasets) gets passed as inputs to the optimizer through a web interface. After the optimization is complete, the optimized model is stored in the persistent storage of the SaaS environment. It can optionally be uploaded to a configured external storage, or be built into a Docker image and pushed to a registry.
Executing the optimized model:
Interaction with the model is typically done remotely through a client-server setup, whose protocol is specified by the user.
Details
Converting models
Many models available on Huggingface can be exported to ONNX using Huggingface's Optimum Python package. Custom PyTorch models can easily be exported using PyTorch's torch.onnx.export(...) functionality, as described here.
The model program
Besides the ONNX model, the user is expected to provide a so called model program, which defines everything needed in order to run and optimize the model, e.g.:
References to the actual ONNX model files and how to run them.
The data sets that should be used during optimization, which metrics to target and acceptable metric value ranges.
The input and output data formats, specified using Protocol Buffers, and mapping code from the client/server protocol data types to the model I/O data types.
Implementing the model program is straightforward, as Inceptron provides documentation, examples, Python interfaces and default implementations of those interfaces. The communication protocol, server and client implementations are automatically derived from the user provided Protocol Buffers.
The optimizer
Inceptron has chosen to provide the optimizer as a SaaS solution, as it contains proprietary IP in the form of compiler passes. During optimization, the end user can choose which validation data sets to use, thus controlling what data is exposed to third parties.
The optimizer is based on Apache TVM, which is a framework for building optimizing ML compilers. Inceptron improves upon TVM's already good baseline optimizations by adding custom compiler passes, implementing novel optimizations and compression techniques.
The execution environment
The execution environment is fully containerized using Docker. Its image is based on Ubuntu and contains Inceptron's custom TVM runtime. The optimized model files and a gRPC server are either mounted at runtime or are built into a custom image based on the generic runtime image.