Deploying Models with OctoML¶
Once you have accelerated and packaged a model using either the web interface or Python SDK, deployment of the optimized model to your target hardware is simple.
Python wheel deployment¶
Required Dependencies¶
A Linux x86 machine is required. The following software dependencies are required to run the OctoML-packaged model. For debian/ubuntu images you must have:
ldd --version
shows GLIBC >= 2.27.sudo apt-get install libssl-dev zlib1g-dev build-essential
sudo apt-get install libffi-dev
Note if you’ve compiled your own Python dist (eg through pyenv install) you will need to re-compile the dist after this steppython >= 3.7
(3.7 recommended)
Additional GPU Dependencies (TVM only)¶
If you have packaged your model for a GPU platform, you will need CUDA 11.1 (or Vulkan 1.2.135 for Vulkan based GPU platforms) installed.
For Vulkan based GPU platforms on Ubuntu, the Vulkan dependency can be installed by running apt install libvulkan-dev.
Note: Currently GPU packaging is only available for TVM but is coming soon for ONNX-RT.
Running Model Inference¶
First, download a Python wheel file produced by the OctoML Platform. In the SDK, this is done with:
package_group = model.create_packages(PLATFORM)
print(package_group.uuid)
package_group.wait()
best_workflow = None
best_mean_latency = float("inf")
for workflow in package_group.workflows:
metrics = workflow.metrics()
if metrics.latency_mean_ms < best_mean_latency:
best_mean_latency = metrics.latency_mean_ms
best_workflow = workflow
best_workflow.save_package(OUTPUT_DIRECTORY)
This will save a file in the form of <package_name>-0.1.0-py3-none-any.whl
,
where <package_name>
will be the specified package name or, if unspecified, a name
derived from the name
field of your model. For derived package names note that
non-alphanumeric characters will be replaced with underscores (‘_’) and leading/trailing
underscores will be stripped. Package names must only contain lower case letters, numbers,
and underscores (which may not be repeating, leading, or trailing).
valid:
a
,my_package_name
,my_package_name_2
invalid:
a__b
,_my_package_name
,my_package_name_
You can install the wheel into your Python environment using the Python pip
command:
$ pip install <package_name>-0.1.0-py3-none-any.whl
Once installed, the model can be imported and invoked from Python as follows:
import <package_name>
import numpy as np
model = <package_name>.OctomizedModel()
Please confirm that input info is correct for the packaged model with:
idict = model.get_input_dict()
print(idict)
Now you can provide inputs to run inference on.
Benchmarking¶
To benchmark the model with randomly generated inputs, run:
mean_ms, std_ms = model.benchmark()
Inference for single image input¶
If you have a model that takes in an image, you can run something like:
import cv2
image_path = <image path here>
image_size = <image size here> # Please consult `idict` above for value
input_dtype = <input dtype here> # Please consult `idict` above for value
# If the image is in grayscale, instead of the following line, invoke
# img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
img = cv2.imread(image_path)
img = cv2.resize(img, dsize=((image_size, image_size)))
# Note that if you provided an RGB image, `img.shape` will look like
# (image_size, image_size, 3). If the `idict` info you printed above indicates that
# your model expects an input of shape (1, 3, image_size, image_size), you
# should uncomment the following transposition to match the input data to the format
# expected by your model.
#
# img = img.transpose((2, 0, 1))
input_nparr = np.array(img.tolist()).astype(input_dtype)
# The next line assumes that the batch dimension for this image is 1
input_nparr = input_nparr.reshape([1, *input_nparr.shape])
# If you provided a grayscale image, `img.shape` will look like
# (image_size, image_size). If the `idict` info you printed above indicates that
# your model expects an input of shape (1, 1, image_size, image_size), ensure that
# you properly resize the data to the expected format as follows:
# input_nparr.reshape([1, 1, image_size, image_size])
Note that you will need to adjust the above code depending on how your model’s inputs were pre-processed for training.
Now that you’ve processed your image, you can run your model on the processed inputs with:
outputs = model.run(input_nparr)
If you’d like to run your model with explicit device synchronization (so that outputs are guaranteed to be on the host before you interact with them), with the TVM Python package you can do:
outputs = model.run_with_device_sync(input_nparr)
Device sync should be enabled for onnxruntime by default.
At this point please confirm the output is as you would expect. If your model produces multiple outputs, note that the order of outputs is preserved across model formats.
Inference for multiple inputs¶
This example code runs a model with multiple random np array inputs – please adjust for your own purposes:
idict = model.get_input_dict()
inputs = []
for iname in idict.keys():
ishape = idict[iname]["shape"]
idtype = idict[iname]["dtype"]
inp = np.random.random(ishape).astype(idtype)
inputs.append(inp)
# Run the model. If your model has multiple inputs, you may pass multiple inputs
# in the style of *args to the `run` function, or if you prefer, you may provide
# a dict of input name to value in **kwargs style.
outputs = model.run(*inputs)
# Note that the order of outputs is preserved from both ONNX and Relay formats.
The return type of model.run
is List[tvm.runtime.ndarray.NDArray]
or List[numpy.ndarray]
if you are using ONNX-RT,
documentation for NDArray is found here.
To access the first output as a numpy object from the inference, you may run:
out = outputs[0].numpy()
Docker tarball deployment¶
OctoML’s WebUI provides the option to Download Docker tarballs:
This tutorial describes how to build Docker images from these tarballs and push these to AWS ECR.
For information about how to run inference against deployed containers, see Running Inference Against the Triton Server
Required Dependencies¶
To follow this tutorial, you’ll need the following:
docker installed on the machine
docker daemon running
Building a Docker image¶
If a “Docker” format is available for your packaged model, do the following:
select the package
download the package
extract the tar.gz package
Example
tar -xzvf your_tarball_name_here.tar.gz
Script: build.sh
With the tarball, is a build script: build.sh
This script ensures the docker image is correctly built for the target hardware.
The following sample is provided to ensure a Docker image is built for an x86 hardware target.
#!/bin/sh
if [ $# -lt 1 ]; then
echo usage: build.sh name:tag, e.g. ecr.io/octoml/model:v1
exit 1
fi
docker build --tag $1 --platform linux/x86_64 --file docker_build_context/Dockerfile docker_build_context
To use this script, input the: name:tag
Example:
Use the image: yolo:latest
Replace name:tag
with yolo:latest
as seen in the command below.
./build.sh yolo:latest
Executing this command will create a docker image specifically for the target hardware the model was accelerated for.
Docker Image Validation
Confirm the docker image was created correctly by issuing the command: docker images
.
If this image was built for a different architecture than your local machine, you won’t be able to run it locally, so do not try to test this on your local machine unless you’re sure the arch is compatible.
Build your image with a specific registry destination. Below, is an example of how to do this with AWS ECS.
push the image to a registry
push the image to the preferred hardware
Deploying Built Triton Image to AWS ECR¶
Login to the ECR registry using the AWS CLI. Enter the AWS_ACCOUNT_ID
and REGION
:
docker login --username AWS --password-stdin [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com
Login Succeeded
displays in the terminal.
Once you are logged in, specify the repository name and tag to use for the model container image.
Note: If you haven’t already created a destination repository in ECR, you’ll need to so first For more information, visit the Amazon ECR Guide .
Create the repository in ECR:
aws ecr create-repository --repository-name yolo --profile [PROFILE]
Before pushing the built image to ECR, tag it locally:
docker tag yolo:latest [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com/yolo:latest
Confirm your image is properly tagged in your local registry:
docker images
Finally, push the image to AWS ECR, and confirm success.
At time of writing, this image is 12GB. This will take time during the first push.
docker push [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com/yolo:latest
With the image pushed, confirm in ECR.
aws ecr list-images --repository-name yolo --profile [PROFILE_NAME]
The optimized model container is now in AWS ECR and ready to be pushed wherever you need it!
Next steps:
For a tutorial on how to deploy your container to a cluster like AWS Elastic Kubernetes Service or information on how to provision your image in an autoscaling cluster and get it running, see Deploying Triton Containers from AWS ECR to AWS EKS
For information on how to run inference against deployed containers, see Remote Inference on Triton Server
Deploying Triton Containers from AWS ECR to AWS EKS¶
This tutorial provides an example of how to deploy an OctoML container in EKS, assuming you have:
already pushed your OctoML model container to ECR and
already created an EKS cluster for your application. For information on how to push your OctoML model container to ECR, see Deploying Built Triton Image to AWS ECR
For example, if you requested a container for an AWS c5n.xlarge instance, you’ll want to make sure your cluster has c5n.xlarge instances available. This will ensure you get the performance you were expecting.
For information about pushing your OctoML model container to AWS ECR, see Using Docker Tarballs
For information on creating an EKS cluster, see this guide: Creating an Amazon EKS cluster
First, make sure you have both the AWS command line interface and
kubectl
installed on your developer machine. Then, set your kubectl
context to your AWS EKS cluster.
aws eks update-kubeconfig --name [AWS_CLUSTER_NAME] --region [REGION] --profile [PROFILE]
Make sure your cluster has nodes available to match the hardware you specified when you created your accelerated model container. To do so, create a managed node group of the appropriate instance type using the guide: Creating a managed node group
Once nodes with the relevant instance type are made available to the EKS cluster, the Triton container can be deployed using the Helm Chart.
The following bash script uses this helm chart to create a minimal deployment of Triton in EKS:
#!/bin/sh
aws_account_id=xxxxxxxxxx
REGION=us-west-2
registry=${aws_account_id}.dkr.ecr.${REGION}.amazonaws.com
DOCKER_IMAGE_NAME=triton-example
DOCKER_IMAGE_TAG=v1
cat << EOF > "values.yaml"
imageName: $registry/${DOCKER_IMAGE_NAME}
imageTag: $DOCKER_IMAGE_TAG
imageCredentials:
registry: $registry
username: AWS
password: $(aws ecr get-login-password --region "$REGION")
nodeSelector:
node.kubernetes.io/instance-type: c5n.xlarge
EOF
## Install helm chart to the EKS cluster via helm
if ! helm repo list | grep octoml-cli-tutorials; then
helm repo add octoml-helm-charts d
fi
helm repo update octoml-cli-tutorials
helm install triton-example octoml-helm-charts/octoml-triton -n triton-example --create-namespace --values "values.yaml" --atomic --timeout 7m
As part of the helm chart installation, the following components will be created:
A pod, which holds the Triton container itself
A deployment, which manages the lifecycle of the Triton pod
A service, which exposes the pod to clients outside the cluster
The values.yaml
file used in the example above utilizes the
following fields:
imageName
. The name of the image—since we’re pulling the image from ECR, we will need the registry name as a prefiximageTag
. The tag identifying the Triton image to be pulled.imageCredentials
. Credentials used by EKS to pull the Triton image into the pod. Note the use ofaws ecr get-login-password
, which can be invoked from an authenticated command line client to pull down ECR credentialsnodeSelector
. a map of node selectors to be used by the Triton pod. Node selectors force a pod to be scheduled on nodes with specific label values. This example leverages the built-in node.kubernetes.io/instance-type label, which is automatically added to all nodes created by EKS, to make sure that the Triton pod lands on ac5n.xlarge
instance
While the above example is strictly functional (to run inference on this and similar deployments, see Remote Inference on Triton Server. there are certain optional components missing:
While the Triton inference server in the above example can be accessed
either by neighboring pods in the cluster or authenticated clients using
kubectl port-forward
, it cannot be accessed by an arbitrary
anonymous client. To allow for this, consider adding the following to values.yaml
:
service:
type: LoadBalancer
This will provision the Triton service with type LoadBalancer, which will automatically create an AWS elastic load balancer pointing to the Triton pod. To get the DNS name of this load balancer, you can query the service like so:
kubectl get svc -n triton-example
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
triton-example LoadBalancer 172.20.67.101 aad62a015a3a241ef888db24515c4786-974307675.us-west-2.elb.amazonaws.com 8001:32182/TCP,8000:30248/TCP,8002:31026/TCP 2m
Alternatively, if your cluster has an externally-accessible ingress controller you can make Triton accessible through an ingress resource. To do so, add the following to your helm chart values file:
ingress:
enabled: true
grpc:
host: <GRPC_HOSTNAME>
http:
host: <HTTP_HOSTNAME>
Doing so will allow helm to create ingress resources for both HTTP and GRPC access to the Triton pods. Assuming that and both redirect to the IP address of your ingress controller, you should be able to access the inference server this way.
Important: the NGINX Ingress Controller, which is one of the more prevalent ones in use today, does not allow for plaintext GRPC communications. If you plan to use this helm chart with that particular controller, you will have to add a TLS Secret:
ingress:
enabled: true
grpc:
host: <GRPC_HOSTNAME>
http:
host: <HTTP_HOSTNAME>
tls:
- hosts:
- <GRPC_HOSTNAME>
secretName: <TLS_SECRET>
While using node selectors as in the example above guarantees that the Triton pod will land on the correct hardware, this does not guarantee other resource-hungry pods will not land on the same node.
This leads to the busy neighbor problem, wherein the Triton server’s performance is hindered by excess resource contention.
This can be prevented through the use of taints and tolerations , which limit which nodes pods are allowed to schedule on.
In this example, start by applying the following taint to the c5n.xlarge
node pool:
triton-instance-type=c5n.xlarge:NoSchedule
This taint makes it so that all pods which don’t have the corresponding triton-instance-type=c5n.xlarge
toleration cannot be scheduled. To allow our triton pods to schedule on these nodes once more, we can add
the following to the helm chart, which will configure the relevant toleration:
tolerations: - key: triton-instance-type value: c5n.xlarge operator: Equal effect: NoSchedule
This ensures that the Triton pod will perform without significant resource contention.
Next steps:
For information on how to run inference against deployed containers, see Running Inference Against the Triton Server
Remote Inference on Triton Server¶
This tutorial provides an example of how to run inference against a deployed Triton container in EKS. It is broken into two parts—accessing the Triton service in Kubernetes, and running instances against the server.
Accessing the Triton Server
For simplicity, we will be working from the example helm chart deployment presented in the latter half of the guide Deploying Triton Containers from AWS ECR to AWS EKS Three methods will be presented, organized in order of production-readiness:
Method 1: Using kubctl
The first and easiest method to access the deployed Triton container is
to use kubectl port-forward
, which allows an authenticated
Kubernetes client to forward traffic from their local machine to a
Kubernetes pod or service. To do so for our example, issue the following
command:
kubectl port-forward service/triton-example -n triton-example 8000:8000 8001:8001
This will make the Triton service reachable from both port 8000 (http) and port 8001 (grpc) on localhost. To verify the Triton server is reachable, you can execute the following script from the same machine:
import tritonclient.grpc as grpc_client
import numpy as np
from PIL import Image as PyImage
triton = grpc_client.InferenceServerClient("localhost:8001")
model_name = <MODEL_NAME>
config = triton.get_model_config(model_name=model_name).config
print(config)
Where <MODEL_NAME>
is the name of the model accelerated by Octomizer.
You can find your model name within the tarball.
Under the directory docker_build_context/octoml/models
there is a directory
containing your model. The name of the directory is the name of your model.
For example, you may find a directory: docker_build_context/octoml/models/rf_iris
In this case, the name of your model is rf_iris
.
Method 2: Accessing through a service of type ``LoadBalancer``
If you deployed your helm chart using a service of type LoadBalancer
,
you can hit the inference service from anywhere using the service’s
external IP. For instance, if the external IP is 1.2.3.4
, you
can invoke the script above with the following line changed:
triton = grpc_client.InferenceServerClient("1.2.3.4:8001")
Method 3: Accessing through an ingress resource
In the previous section, we covered using the provided helm chart to optionally provision ingress resources to open the Triton service to the outside world. Once these are set up, accessing the service is as easy as making the following modification to the code used in the first method:
triton = grpc_client.InferenceServerClient(<GRPC_HOSTNAME>)
Where <GRPC_HOSTNAME>
is the previously-chosen hostname for the GRPC service endpoint.
Running Inference Against the Triton Server¶
The next step is to perform an inference on your model. The model configuration describes the inputs your model expects and the outputs you can expect from your model, including the names, shapes, and datatypes of each input or output. We use this information to construct an inference request using Triton’s gRPC inference client, or an HTTP client, as we will demonstrate later on.
The following is an example model configuration with one input named
X
, and two outputs named label
and probabilities
:
name: "rf_iris"
platform: "onnxruntime_onnx"
input [
{
name: "X"
data_type: TYPE_FP64
dims: -1
dims: 4
}
]
output [
{
name: "label"
data_type: TYPE_INT64
dims: -1
},
{
name: "probabilities"
data_type: TYPE_FP32
dims: -1
dims: 3
}
]
To begin performing an inference, first install the following packages:
pip install numpy requests tritonclient[all]
Then, navigate to the model directory relative to your tarball’s base
directory, docker_build_context/octoml/models/<model_name>
. Here,
you’ll find two files, inference_grpc.py
and inference_http.py
, demonstrate making inferences over gRPC or HTTP using the python
requests
library respectively.
Here’s an example of performing an inference using the gRPC client on a model trained against the Iris Data Set
The model has one input with four features, each 64-bit floating point numbers, and two outputs representing the label as an integer and the probabilities of each label as an array of 32-bit floating point numbers.
import numpy
import tritonclient.grpc as grpc_client
url = "localhost:8001"
client = grpc_client.InferenceServerClient(url)
inputs = []
input_name_0 = "X"
input_shape_0 = [1, 4]
input_datatype_0 = numpy.float64
input_data_0 = numpy.array(numpy.ones(
input_shape_0,
input_datatype_0
))
triton_datatype_0 = "FP64"
tensor = grpc_client.InferInput(
name=input_name_0,
shape=input_shape_0,
datatype=triton_datatype_0
)
tensor.set_data_from_numpy(input_data_0)
inputs.append(tensor)
outputs = []
output_name_0 = "label"
tensor = grpc_client.InferRequestedOutput(output_name_0)
outputs.append(tensor)
output_name_1 = "probabilities"
tensor = grpc_client.InferRequestedOutput(output_name_1)
outputs.append(tensor)
inferences = client.infer(
model_name="rf_iris",
model_version="1",
inputs=inputs, outputs=outputs
)
print(inferences.as_numpy(name="label"))
Here is the same example, but using HTTP and the requests
library to
perform the inference, as found in inference_http.py
:
import json
import numpy
import requests
post_data = {'inputs': []}
input_name_0 = "X"
input_shape_0 = [1, 4]
input_datatype_0 = numpy.float64
input_data_0 = numpy.ones(input_shape_0, input_datatype_0)
triton_datatype_0 = "FP64"
post_data['inputs'].append({
'name': input_name_0,
'shape': input_shape_0,
'datatype': triton_datatype_0,
'data': input_data_0.tolist()
})
model_name = "rf_iris"
result = requests.post("http://localhost:8000/v2/models/rf_iris_to_onnx_onnx/versions/1/infer",
data=json.dumps(post_data))