Deploying Models with OctoML¶
Once you have accelerated and packaged a model using either the web interface or Python SDK, deployment of the optimized model to your target hardware is simple.
Python wheel deployment¶
Required Dependencies¶
A Linux x86 machine is required. The following software dependencies are required to run the OctoML-packaged model. For debian/ubuntu images you must have:
ldd --version
shows GLIBC >= 2.27. (For CPU packages, GLIBC >= 2.24 should be sufficient)sudo apt-get install libssl-dev zlib1g-dev build-essential
sudo apt-get install libffi-dev
Note if you’ve compiled your own Python dist (eg through pyenv install) you will need to re-compile the dist after this steppython >= 3.7
(3.7 recommended)
Additional GPU Dependencies (TVM only)¶
If you have packaged your model for a GPU platform, you will need CUDA 11.1 (or Vulkan 1.2.135 for Vulkan based GPU platforms) installed.
For Vulkan based GPU platforms on Ubuntu, the Vulkan dependency can be installed by running apt install libvulkan-dev.
Note: Currently GPU packaging is only available for TVM but is coming soon for ONNX-RT.
Running Model Inference¶
First, download the Python wheel file produced by the OctoML Platform. In the SDK, this is done with:
wrkflow = modelvar.octomize(PLATFORM)
wrkflow.wait()
wrkflow.save_package(OUTPUT_DIRECTORY)
This will save a file in the form of <package_name>-0.1.0-py3-none-any.whl
,
where <package_name>
will be the specified package name or, if unspecified, a name
derived from the name
field of your model. For derived package names note that
non-alphanumeric characters will be replaced with underscores (‘_’) and leading/trailing
underscores will be stripped. Package names must only contain lower case letters, numbers,
and underscores (which may not be repeating, leading, or trailing).
valid:
a
,my_package_name
,my_package_name_2
invalid:
a__b
,_my_package_name
,my_package_name_
You can install the wheel into your Python environment using the Python pip
command:
$ pip install <package_name>-0.1.0-py3-none-any.whl
Once installed, the model can be imported and invoked from Python as follows:
import <package_name>
import numpy as np
model = <package_name>.OctomizedModel()
Please confirm that input info is correct for the packaged model with:
idict = model.get_input_dict()
print(idict)
Now you can provide inputs to run inference on.
Benchmarking¶
To benchmark the model with randomly generated inputs, run:
mean_ms, std_ms = model.benchmark()
Inference for single image input¶
If you have a model that takes in an image, you can run something like:
import cv2
image_path = <image path here>
image_size = <image size here> # Please consult `idict` above for value
input_dtype = <input dtype here> # Please consult `idict` above for value
# If the image is in grayscale, instead of the following line, invoke
# img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
img = cv2.imread(image_path)
img = cv2.resize(img, dsize=((image_size, image_size)))
# Note that if you provided an RGB image, `img.shape` will look like
# (image_size, image_size, 3). If the `idict` info you printed above indicates that
# your model expects an input of shape (1, 3, image_size, image_size), you
# should uncomment the following transposition to match the input data to the format
# expected by your model.
#
# img = img.transpose((2, 0, 1))
input_nparr = np.array(img.tolist()).astype(input_dtype)
# The next line assumes that the batch dimension for this image is 1
input_nparr = input_nparr.reshape([1, *input_nparr.shape])
# If you provided a grayscale image, `img.shape` will look like
# (image_size, image_size). If the `idict` info you printed above indicates that
# your model expects an input of shape (1, 1, image_size, image_size), ensure that
# you properly resize the data to the expected format as follows:
# input_nparr.reshape([1, 1, image_size, image_size])
Note that you will need to adjust the above code depending on how your model’s inputs were pre-processed for training.
Now that you’ve processed your image, you can run your model on the processed inputs with:
outputs = model.run(input_nparr)
If you’d like to run your model with explicit device synchronization (so that outputs are guaranteed to be on the host before you interact with them), with the TVM Python package you can do:
outputs = model.run_with_device_sync(input_nparr)
Device sync should be enabled for onnxruntime by default.
At this point please confirm the output is as you would expect. If your model produces multiple outputs, note that the order of outputs is preserved across model formats.
Inference for multiple inputs¶
This example code runs a model with multiple random np array inputs – please adjust for your own purposes:
idict = model.get_input_dict()
inputs = []
for iname in idict.keys():
ishape = idict[iname]["shape"]
idtype = idict[iname]["dtype"]
inp = np.random.random(ishape).astype(idtype)
inputs.append(inp)
# Run the model. If your model has multiple inputs, you may pass multiple inputs
# in the style of *args to the `run` function, or if you prefer, you may provide
# a dict of input name to value in **kwargs style.
outputs = model.run(*inputs)
# Note that the order of outputs is preserved from both ONNX and Relay formats.
The return type of model.run
is List[tvm.runtime.ndarray.NDArray]
or List[numpy.ndarray]
if you are using ONNX-RT,
documentation for NDArray is found here.
To access the first output as a numpy object from the inference, you may run:
out = outputs[0].numpy()