Deploying Models with OctoML

Once you have accelerated and packaged a model using either the web interface or Python SDK, deployment of the optimized model to your target hardware is simple.

Python wheel deployment

Required Dependencies

A Linux x86 machine is required. The following software dependencies are required to run the OctoML-packaged model. For debian/ubuntu images you must have:

  1. ldd --version shows GLIBC >= 2.27.

  2. sudo apt-get install libssl-dev zlib1g-dev build-essential

  3. sudo apt-get install libffi-dev Note if you’ve compiled your own Python dist (eg through pyenv install) you will need to re-compile the dist after this step

  4. python >= 3.7 (3.7 recommended)

Additional GPU Dependencies (TVM only)

If you have packaged your model for a GPU platform, you will need CUDA 11.1 (or Vulkan 1.2.135 for Vulkan based GPU platforms) installed.

For Vulkan based GPU platforms on Ubuntu, the Vulkan dependency can be installed by running apt install libvulkan-dev.

Note: Currently GPU packaging is only available for TVM but is coming soon for ONNX-RT.

Running Model Inference

First, download a Python wheel file produced by the OctoML Platform. In the SDK, this is done with:

package_group = model.create_packages(PLATFORM)
print(package_group.uuid)
package_group.wait()
best_workflow = None
best_mean_latency = float("inf")
for workflow in package_group.workflows:
    metrics = workflow.metrics()
    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow
best_workflow.save_package(OUTPUT_DIRECTORY)

This will save a file in the form of <package_name>-0.1.0-py3-none-any.whl, where <package_name> will be the specified package name or, if unspecified, a name derived from the name field of your model. For derived package names note that non-alphanumeric characters will be replaced with underscores (‘_’) and leading/trailing underscores will be stripped. Package names must only contain lower case letters, numbers, and underscores (which may not be repeating, leading, or trailing).

  • valid: a, my_package_name, my_package_name_2

  • invalid: a__b, _my_package_name, my_package_name_

You can install the wheel into your Python environment using the Python pip command:

$ pip install <package_name>-0.1.0-py3-none-any.whl

Once installed, the model can be imported and invoked from Python as follows:

import <package_name>
import numpy as np

model = <package_name>.OctomizedModel()

Please confirm that input info is correct for the packaged model with:

idict = model.get_input_dict()
print(idict)

Now you can provide inputs to run inference on.

Benchmarking

To benchmark the model with randomly generated inputs, run:

mean_ms, std_ms = model.benchmark()

Inference for single image input

If you have a model that takes in an image, you can run something like:

import cv2
image_path = <image path here>
image_size = <image size here>       # Please consult `idict` above for value
input_dtype = <input dtype here>     # Please consult `idict` above for value

# If the image is in grayscale, instead of the following line, invoke
# img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
img = cv2.imread(image_path)
img = cv2.resize(img, dsize=((image_size, image_size)))

# Note that if you provided an RGB image, `img.shape` will look like
# (image_size, image_size, 3). If the `idict` info you printed above indicates that
# your model expects an input of shape (1, 3, image_size, image_size), you
# should uncomment the following transposition to match the input data to the format
# expected by your model.
#
# img = img.transpose((2, 0, 1))
input_nparr = np.array(img.tolist()).astype(input_dtype)
# The next line assumes that the batch dimension for this image is 1
input_nparr = input_nparr.reshape([1, *input_nparr.shape])

# If you provided a grayscale image, `img.shape` will look like
# (image_size, image_size). If the `idict` info you printed above indicates that
# your model expects an input of shape (1, 1, image_size, image_size), ensure that
# you properly resize the data to the expected format as follows:
# input_nparr.reshape([1, 1, image_size, image_size])

Note that you will need to adjust the above code depending on how your model’s inputs were pre-processed for training.

Now that you’ve processed your image, you can run your model on the processed inputs with:

outputs = model.run(input_nparr)

If you’d like to run your model with explicit device synchronization (so that outputs are guaranteed to be on the host before you interact with them), with the TVM Python package you can do:

outputs = model.run_with_device_sync(input_nparr)

Device sync should be enabled for onnxruntime by default.

At this point please confirm the output is as you would expect. If your model produces multiple outputs, note that the order of outputs is preserved across model formats.

Inference for multiple inputs

This example code runs a model with multiple random np array inputs – please adjust for your own purposes:

idict = model.get_input_dict()
inputs = []
for iname in idict.keys():
  ishape = idict[iname]["shape"]
  idtype = idict[iname]["dtype"]
  inp = np.random.random(ishape).astype(idtype)
  inputs.append(inp)

# Run the model. If your model has multiple inputs, you may pass multiple inputs
# in the style of *args to the `run` function, or if you prefer, you may provide
# a dict of input name to value in **kwargs style.
outputs = model.run(*inputs)

# Note that the order of outputs is preserved from both ONNX and Relay formats.

The return type of model.run is List[tvm.runtime.ndarray.NDArray] or List[numpy.ndarray] if you are using ONNX-RT, documentation for NDArray is found here.

To access the first output as a numpy object from the inference, you may run:

out = outputs[0].numpy()

Docker tarball deployment

OctoML’s WebUI provides the option to Download Docker tarballs:

Docker tarballs download

This tutorial describes how to build Docker images from these tarballs and push these to AWS ECR.

For information about how to run inference against deployed containers, see Running Inference Against the Triton Server

Required Dependencies

To follow this tutorial, you’ll need the following:

  • docker installed on the machine

  • docker daemon running

Building a Docker image

If a “Docker” format is available for your packaged model, do the following:

  • select the package

  • download the package

  • extract the tar.gz package

Example

tar -xzvf your_tarball_name_here.tar.gz

Script: build.sh

With the tarball, is a build script: build.sh

This script ensures the docker image is correctly built for the target hardware.

The following sample is provided to ensure a Docker image is built for an x86 hardware target.

#!/bin/sh

if [ $# -lt 1 ]; then
  echo usage: build.sh name:tag, e.g. ecr.io/octoml/model:v1
  exit 1
fi
docker build --tag $1 --platform linux/x86_64 --file docker_build_context/Dockerfile docker_build_context

To use this script, input the: name:tag

Example:

Use the image: yolo:latest

Replace name:tag with yolo:latest as seen in the command below.

./build.sh yolo:latest

Executing this command will create a docker image specifically for the target hardware the model was accelerated for.

Docker Image Validation

Confirm the docker image was created correctly by issuing the command: docker images.

If this image was built for a different architecture than your local machine, you won’t be able to run it locally, so do not try to test this on your local machine unless you’re sure the arch is compatible.

Build your image with a specific registry destination. Below, is an example of how to do this with AWS ECS.

  1. push the image to a registry

  2. push the image to the preferred hardware

Deploying Built Triton Image to AWS ECR

Login to the ECR registry using the AWS CLI. Enter the AWS_ACCOUNT_ID and REGION:

docker login --username AWS --password-stdin [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com

Login Succeeded displays in the terminal.

Once you are logged in, specify the repository name and tag to use for the model container image.

Note: If you haven’t already created a destination repository in ECR, you’ll need to so first For more information, visit the Amazon ECR Guide .

Create the repository in ECR:

aws ecr create-repository --repository-name yolo --profile [PROFILE]

Before pushing the built image to ECR, tag it locally:

docker tag yolo:latest [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com/yolo:latest

Confirm your image is properly tagged in your local registry:

docker images

Finally, push the image to AWS ECR, and confirm success.

At time of writing, this image is 12GB. This will take time during the first push.

docker push [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com/yolo:latest

With the image pushed, confirm in ECR.

aws ecr list-images --repository-name yolo --profile [PROFILE_NAME]

The optimized model container is now in AWS ECR and ready to be pushed wherever you need it!

Next steps:

Deploying Triton Containers from AWS ECR to AWS EKS

This tutorial provides an example of how to deploy an OctoML container in EKS, assuming you have:

  1. already pushed your OctoML model container to ECR and

  2. already created an EKS cluster for your application. For information on how to push your OctoML model container to ECR, see Deploying Built Triton Image to AWS ECR

For example, if you requested a container for an AWS c5n.xlarge instance, you’ll want to make sure your cluster has c5n.xlarge instances available. This will ensure you get the performance you were expecting.

  • For information about pushing your OctoML model container to AWS ECR, see Using Docker Tarballs

  • For information on creating an EKS cluster, see this guide: Creating an Amazon EKS cluster

First, make sure you have both the AWS command line interface and kubectl installed on your developer machine. Then, set your kubectl context to your AWS EKS cluster.

aws eks update-kubeconfig --name [AWS_CLUSTER_NAME] --region [REGION] --profile [PROFILE]

Make sure your cluster has nodes available to match the hardware you specified when you created your accelerated model container. To do so, create a managed node group of the appropriate instance type using the guide: Creating a managed node group

Once nodes with the relevant instance type are made available to the EKS cluster, the Triton container can be deployed using the Helm Chart.

The following bash script uses this helm chart to create a minimal deployment of Triton in EKS:

#!/bin/sh
aws_account_id=xxxxxxxxxx
REGION=us-west-2
registry=${aws_account_id}.dkr.ecr.${REGION}.amazonaws.com
DOCKER_IMAGE_NAME=triton-example
DOCKER_IMAGE_TAG=v1

cat << EOF > "values.yaml"
imageName: $registry/${DOCKER_IMAGE_NAME}
imageTag: $DOCKER_IMAGE_TAG
imageCredentials:
  registry: $registry
  username: AWS
  password: $(aws ecr get-login-password --region "$REGION")
nodeSelector:
  node.kubernetes.io/instance-type: c5n.xlarge
EOF

## Install helm chart to the EKS cluster via helm
if ! helm repo list | grep octoml-cli-tutorials; then
  helm repo add octoml-helm-charts d
fi

helm repo update octoml-cli-tutorials

helm install triton-example octoml-helm-charts/octoml-triton -n triton-example --create-namespace --values "values.yaml" --atomic --timeout 7m

As part of the helm chart installation, the following components will be created:

  • A pod, which holds the Triton container itself

  • A deployment, which manages the lifecycle of the Triton pod

  • A service, which exposes the pod to clients outside the cluster

The values.yaml file used in the example above utilizes the following fields:

  • imageName. The name of the image—since we’re pulling the image from ECR, we will need the registry name as a prefix

  • imageTag. The tag identifying the Triton image to be pulled.

  • imageCredentials. Credentials used by EKS to pull the Triton image into the pod. Note the use of aws ecr get-login-password , which can be invoked from an authenticated command line client to pull down ECR credentials

  • nodeSelector. a map of node selectors to be used by the Triton pod. Node selectors force a pod to be scheduled on nodes with specific label values. This example leverages the built-in node.kubernetes.io/instance-type label, which is automatically added to all nodes created by EKS, to make sure that the Triton pod lands on a c5n.xlarge instance

While the above example is strictly functional (to run inference on this and similar deployments, see Remote Inference on Triton Server. there are certain optional components missing:

While the Triton inference server in the above example can be accessed either by neighboring pods in the cluster or authenticated clients using kubectl port-forward, it cannot be accessed by an arbitrary anonymous client. To allow for this, consider adding the following to values.yaml:

service:
  type: LoadBalancer

This will provision the Triton service with type LoadBalancer, which will automatically create an AWS elastic load balancer pointing to the Triton pod. To get the DNS name of this load balancer, you can query the service like so:

kubectl get svc -n triton-example
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)                                        AGE
triton-example   LoadBalancer   172.20.67.101   aad62a015a3a241ef888db24515c4786-974307675.us-west-2.elb.amazonaws.com   8001:32182/TCP,8000:30248/TCP,8002:31026/TCP   2m

Alternatively, if your cluster has an externally-accessible ingress controller you can make Triton accessible through an ingress resource. To do so, add the following to your helm chart values file:

ingress:
  enabled: true
  grpc:
    host: <GRPC_HOSTNAME>
  http:
    host: <HTTP_HOSTNAME>

Doing so will allow helm to create ingress resources for both HTTP and GRPC access to the Triton pods. Assuming that and both redirect to the IP address of your ingress controller, you should be able to access the inference server this way.

Important: the NGINX Ingress Controller, which is one of the more prevalent ones in use today, does not allow for plaintext GRPC communications. If you plan to use this helm chart with that particular controller, you will have to add a TLS Secret:

ingress:
  enabled: true
  grpc:
    host: <GRPC_HOSTNAME>
  http:
    host: <HTTP_HOSTNAME>
    tls:
      - hosts:
          - <GRPC_HOSTNAME>
        secretName: <TLS_SECRET>

While using node selectors as in the example above guarantees that the Triton pod will land on the correct hardware, this does not guarantee other resource-hungry pods will not land on the same node. This leads to the busy neighbor problem, wherein the Triton server’s performance is hindered by excess resource contention. This can be prevented through the use of taints and tolerations , which limit which nodes pods are allowed to schedule on. In this example, start by applying the following taint to the c5n.xlarge node pool:

triton-instance-type=c5n.xlarge:NoSchedule

This taint makes it so that all pods which don’t have the corresponding triton-instance-type=c5n.xlarge toleration cannot be scheduled. To allow our triton pods to schedule on these nodes once more, we can add the following to the helm chart, which will configure the relevant toleration:

tolerations:
 - key: triton-instance-type
   value: c5n.xlarge
   operator: Equal
   effect: NoSchedule

This ensures that the Triton pod will perform without significant resource contention.

Next steps:

Remote Inference on Triton Server

This tutorial provides an example of how to run inference against a deployed Triton container in EKS. It is broken into two parts—accessing the Triton service in Kubernetes, and running instances against the server.

Accessing the Triton Server

For simplicity, we will be working from the example helm chart deployment presented in the latter half of the guide Deploying Triton Containers from AWS ECR to AWS EKS Three methods will be presented, organized in order of production-readiness:

Method 1: Using kubctl

The first and easiest method to access the deployed Triton container is to use kubectl port-forward, which allows an authenticated Kubernetes client to forward traffic from their local machine to a Kubernetes pod or service. To do so for our example, issue the following command:

kubectl port-forward service/triton-example -n triton-example 8000:8000 8001:8001

This will make the Triton service reachable from both port 8000 (http) and port 8001 (grpc) on localhost. To verify the Triton server is reachable, you can execute the following script from the same machine:

import tritonclient.grpc as grpc_client
import numpy as np
from PIL import Image as PyImage

triton = grpc_client.InferenceServerClient("localhost:8001")

model_name = <MODEL_NAME>

config = triton.get_model_config(model_name=model_name).config
print(config)

Where <MODEL_NAME> is the name of the model accelerated by Octomizer. You can find your model name within the tarball.

Under the directory docker_build_context/octoml/models there is a directory containing your model. The name of the directory is the name of your model.

For example, you may find a directory: docker_build_context/octoml/models/rf_iris

In this case, the name of your model is rf_iris.

Method 2: Accessing through a service of type ``LoadBalancer``

If you deployed your helm chart using a service of type LoadBalancer, you can hit the inference service from anywhere using the service’s external IP. For instance, if the external IP is 1.2.3.4, you can invoke the script above with the following line changed:

triton = grpc_client.InferenceServerClient("1.2.3.4:8001")

Method 3: Accessing through an ingress resource

In the previous section, we covered using the provided helm chart to optionally provision ingress resources to open the Triton service to the outside world. Once these are set up, accessing the service is as easy as making the following modification to the code used in the first method:

triton = grpc_client.InferenceServerClient(<GRPC_HOSTNAME>)

Where <GRPC_HOSTNAME> is the previously-chosen hostname for the GRPC service endpoint.

Running Inference Against the Triton Server

The next step is to perform an inference on your model. The model configuration describes the inputs your model expects and the outputs you can expect from your model, including the names, shapes, and datatypes of each input or output. We use this information to construct an inference request using Triton’s gRPC inference client, or an HTTP client, as we will demonstrate later on.

The following is an example model configuration with one input named X, and two outputs named label and probabilities:

name: "rf_iris"
platform: "onnxruntime_onnx"
input [
  {
    name: "X"
    data_type: TYPE_FP64
    dims: -1
    dims: 4
  }
]
output [
  {
    name: "label"
    data_type: TYPE_INT64
    dims: -1
  },
  {
    name: "probabilities"
    data_type: TYPE_FP32
    dims: -1
    dims: 3
  }
]

To begin performing an inference, first install the following packages:

pip install numpy requests tritonclient[all]

Then, navigate to the model directory relative to your tarball’s base directory, docker_build_context/octoml/models/<model_name>. Here, you’ll find two files, inference_grpc.py and inference_http.py, demonstrate making inferences over gRPC or HTTP using the python requests library respectively.

Here’s an example of performing an inference using the gRPC client on a model trained against the Iris Data Set

The model has one input with four features, each 64-bit floating point numbers, and two outputs representing the label as an integer and the probabilities of each label as an array of 32-bit floating point numbers.

import numpy
import tritonclient.grpc as grpc_client

url = "localhost:8001"
client = grpc_client.InferenceServerClient(url)

inputs = []
input_name_0 = "X"
input_shape_0 = [1, 4]
input_datatype_0 = numpy.float64
input_data_0 = numpy.array(numpy.ones(
  input_shape_0,
  input_datatype_0
  ))
triton_datatype_0 = "FP64"

  tensor = grpc_client.InferInput(
    name=input_name_0,
    shape=input_shape_0,
    datatype=triton_datatype_0
)

tensor.set_data_from_numpy(input_data_0)
inputs.append(tensor)

outputs = []
output_name_0 = "label"
tensor = grpc_client.InferRequestedOutput(output_name_0)
outputs.append(tensor)
output_name_1 = "probabilities"
tensor = grpc_client.InferRequestedOutput(output_name_1)
outputs.append(tensor)
  inferences = client.infer(
     model_name="rf_iris",
     model_version="1",
     inputs=inputs, outputs=outputs
 )

print(inferences.as_numpy(name="label"))

Here is the same example, but using HTTP and the requests library to perform the inference, as found in inference_http.py:

import json

import numpy
import requests

post_data = {'inputs': []}

input_name_0 = "X"
input_shape_0 = [1, 4]
input_datatype_0 = numpy.float64
input_data_0 = numpy.ones(input_shape_0, input_datatype_0)
triton_datatype_0 = "FP64"

  post_data['inputs'].append({
     'name': input_name_0,
     'shape': input_shape_0,
     'datatype': triton_datatype_0,
     'data': input_data_0.tolist()
 })

model_name = "rf_iris"
result = requests.post("http://localhost:8000/v2/models/rf_iris_to_onnx_onnx/versions/1/infer",
    data=json.dumps(post_data))

C++ Linux shared object deployment (TVM only)

Required Dependencies

A Linux x86 machine is required. The following software dependencies are required to install an OctoML model as a python package. For debian/ubuntu images you must have:

  1. ldd --version shows GLIBC >= 2.27.

  2. sudo apt-get install libssl-dev zlib1g-dev build-essential libffi-dev

Additional GPU Dependencies

If you have packaged your model for a GPU platform, you will need CUDA 11.1 (or Vulkan 1.2.135 for Vulkan based GPU platforms) installed.

For Vulkan based GPU platforms on Ubuntu, the Vulkan dependency can be installed by running apt install libvulkan-dev.

Note: Currently GPU packaging is only available for TVM but is coming soon for ONNX-RT.

Running Model Inference

First, download the .tar.gz file produced by the OctoML Platform. In the SDK, this is done with:

from octomizer.package_type import PackageType

package_group = model.create_packages(PLATFORM)
print(package_group.uuid)
package_group.wait()
best_workflow = None
best_mean_latency = float("inf")
for workflow in package_group.workflows:
    metrics = workflow.metrics()
    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow
best_workflow.save_package(OUTPUT_DIRECTORY, package_type=PackageType.LINUX_SHARED_OBJECT)

This will save a file in the form of <package_name>.tar.gz, where <package_name> will be the specified package name or, if unspecified, a name derived from the name field of your model. Note for derived package names that non-alphanumeric characters will be replaced with underscores (‘_’) and leading/trailing underscores will be stripped. Package names must only contain lower case letters, numbers, and underscores (which may not be repeating, leading, or trailing).

  • valid: a, my_package_name, my_package_name_2

  • invalid: a__b, _my_package_name, my_package_name_

Data is input to and output from model via tvm.runtime.ndarray.NDArray, documentation of NDArray is found here.

Note that the order of outputs is preserved from both ONNX and Relay formats.

Also note that packaged models running on machines with TVM previously installed will fail.

Source code for the sample program is provided in the tar file for your convenience. See the provided README.md for more file directory information.

Inference on sample random inputs

The sample program by default runs model inference on random inputs. To check that the model was succcessfully packaged you may simply run:

$ tar xvf <package_name>.tar.gz
$ cd <package_name>
$ make sample_program
$ LD_LIBRARY_PATH=$(pwd) ./sample_program

Inference for single image input

First, install libopencv. If you’re on linux, you can do this with sudo apt install libopencv-dev. Otherwise, please follow the install instructions here: https://docs.opencv.org/master/df/d65/tutorial_table_of_content_introduction.html. Remember to run make install after running the cmake build commands if you choose to build from scratch.

Some modifications to the Makefile are necessary to build the sample_program with access to the opencv libraries:

  1. Add -lopencv_core to the sample_program target block’s compilation command.

2. If you built opencv from scratch, you may additionally need to add something like -I/path/to/opencv/include/dir to the compile command for the sample_program target in the Makefile file. For linux, this will look like -I/usr/local/include/opencv4 (the opencv4 directory contains an opencv2 directory that we will reference later).

Next, you’ll need to make some modifications to the file sample_program.cpp.

  1. Add the header #include <opencv2/opencv.hpp>

2. To the bottom of the main function, add the following code, which assumes your code takes float type inputs. If you have not modified sample_program.cpp, the addition of the below code will cause model inference to be run twice; the first time on random inputs, the second on an image of your choosing.:

std::string image_path = <image path here>
int64_t image_size = <image size here>      # Consult the generated code for input_map
                                            # to determine the correct image size

// If the image is in grayscale, instead of the following line, invoke
// cv::Mat image = cv::imread(image_path, cv::IMREAD_GRAYSCALE);
cv::Mat image = cv::imread(image_path);

// Check for failure
if (image.empty()) {
  std::cout << "Unable to find and read image" << std::endl;
  return -1;
}

cv::Mat dst;
cv::resize(image, dst, cv::Size(image_size, image_size));

std::vector<float> data(dst.begin<uint8_t>(), dst.end<uint8_t>());
std:vector<int64_t> shape = input_shape_for_<input name here>;     # Consult the generated code for input name

int64_t num_input_elements = 1;
for (int64_t s : shape) {
  num_input_elements *= s;
}
tvm::runtime::NDArray input_arr = tvm_ndarray_from_vec<float>(
    data,
    shape,
    ctx,
    "float32",
    num_input_elements
);
std::vector<tvm::runtime::NDArray> input_vec = { input_arr };

auto image_outputs = model.run(input_vec);

// You can access the data from the output NDArrays with the following:
for (tvm::runtime::NDArray output : image_outputs) {
  int64_t num_output_elements = 1;
  for (int64_t s : output.Shape()) {
    num_output_elements *= s;
  }

  auto dat = static_cast<float*>(output->data);
  for (int i = 0; i < num_output_elements; i++) {
    std::cout << dat[i] << std::endl;
  }
}