Tutorials

Accelerate a public Keras SavedModel

First make sure you have TensorFlow 2.6 installed and the latest version of OctoML’s SDK:

! pip install tensorflow==2.6
! pip install octomizer-sdk --upgrade

Import libraries:

from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2

import octomizer.client
import octomizer.models.tensorflow_saved_model as tf_model

Now fetch a public Keras SavedModel from Keras’s official model zoo:

model = MobileNetV2(weights='imagenet')

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")

# Create a tarball for the model
! tar -czvf my_model.tgz my_model

Upload the Keras SavedModel to OctoML’s Platform:

# Pass your API token below:
client = octomizer.client.OctomizerClient(access_token=MY_ACCESS_TOKEN)

# Upload the model to Octomizer.
model = tf_model.TensorFlowSavedModel(client, name="my_model", model="my_model.tgz")

This is a model with dynamic inputs, so you’ll need to specify the input shapes before acceleration. If not, you’ll get an error message saying “Dynamic inputs are not supported.”:

# Check the automatically inferred shapes.
inputs = model.get_uploaded_model_variant().inputs
print(inputs)
input_name = list(inputs[0].keys())[0]

# The command above returns a string starting with something like 'input_0:0': [-1, 224, 224, 3].
# The -1 in the first dimension means you need to specify the batch size.

input_shapes = {input_name: [1, 224, 224, 3]} # Notice the -1 has been replaced by 1.
input_dtypes = inputs[1]

Now create packages for the model. By default, the resulting package will be a Python wheel:

package_group = model.create_packages(
    platform="broadwell",
    input_shapes=input_shapes,
    input_dtypes=input_dtypes
)

# Save the package group uuid somewhere so you can use it to access
# benchmark metrics or the resulting packages later.
print(package_group.uuid)

After you receive an email notification about the completion of the acceleration workflow, you can view performance benchmark metrics on the hardware you chose and download a packaged version of the accelerated model, either by visiting the UI or invoking the following code:

# Look up the workflows you previously launched using the group id
package_group = client.get_package_workflow_group("<INSERT GROUP ID>")
assert package_group.done()

best_workflow = None
best_mean_latency = float("inf")

for workflow in package_group.workflows:
    # To view benchmark metrics, either visit the UI or invoke something similar to:
    engine = workflow.proto.benchmark_stage_spec.engine
    metrics = workflow.metrics()
    print(engine)
    print(metrics)
    print("-----------------------------------")

    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow

# Save the resulting Python wheel to the current directory.
best_workflow.save_package(".")

Accelerate a custom Keras SavedModel

First make sure you have TensorFlow 2.6 installed and the latest version of OctoML’s SDK:

! pip install tensorflow==2.6
! pip install octomizer-sdk --upgrade

Import libraries:

from tensorflow import keras
import numpy as np
import octomizer.client
import octomizer.models.tensorflow_saved_model as tf_model

Define and save your Keras SavedModel:

def get_model():
    # Create a simple model using Keras.
    inputs = keras.Input(shape=(32,))
    outputs = keras.layers.Dense(1)(inputs)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

model = get_model()

# Train the model.
test_input = np.random.random((128, 32))
test_target = np.random.random((128, 1))
model.fit(test_input, test_target)

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
# Calling keras.models.save_model(model, "my_model") also works.
model.save("my_model")

# Create a tarball for the model
! tar -czvf my_model.tgz my_model

The remaining steps to upload the custom Keras SavedModel, disambiguate inputs, accelerate the model, and view results are same as the steps specified above for public Keras SavedModels.

Accelerate a public TensorFlow GraphDef

First make sure you have TensorFlow 2.6 installed and the latest version of OctoML’s SDK:

! pip install tensorflow==2.6
! pip install octomizer-sdk --upgrade

Import libraries:

import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

import octomizer.client
import octomizer.models.tensorflow_graph_def_model as tf_model

Now fetch a public GraphDef from Keras’s official model zoo:

model = MobileNetV2(weights='imagenet')

Convert the model to a GraphDef:

full_model = tf.function(lambda x: model(x))
full_model = full_model.get_concrete_function(x=tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()

# Save frozen graph from frozen ConcreteFunction to hard drive
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                  logdir=".",
                  name="myGraphDef.pb",
                  as_text=False)

Upload the GraphDef to OctoML’s Platform:

# Pass your API token below:
client = octomizer.client.OctomizerClient(access_token=MY_ACCESS_TOKEN)

# Upload the model to Octomizer.
model = tf_model.TensorFlowGraphDefModel(client, name="myGraphDef.pb", model="myGraphDef.pb")

This is a model with dynamic inputs, so you’ll need to specify the input shapes before acceleration. If not, you’ll get an error message saying “Dynamic inputs are not supported.”:

# Check the automatically inferred shapes.
inputs = model.get_uploaded_model_variant().inputs
print(inputs)
input_name = list(inputs[0].keys())[0]

# The command above returns a string starting with something like 'x:0': [-1, 224, 224, 3].
# The -1 in the first dimension means you need to specify the batch size.

input_shapes = {input_name: [1, 224, 224, 3]} # Notice the -1 has been replaced by 1.
input_dtypes = inputs[1]

The remaining steps to accelerate the model and view results are same as the steps specified above for public Keras SavedModels.

Accelerate a custom TensorFlow GraphDef

First make sure you have TensorFlow 2.6 installed and the latest version of OctoML’s SDK:

! pip install tensorflow==2.6
! pip install octomizer-sdk --upgrade

Import libraries:

import tensorflow as tf
from tensorflow import keras
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
import numpy as np

import octomizer.client
import octomizer.models.tensorflow_graph_def_model as tf_model

Define the model:

def get_model():
    # Create a simple model using Keras.
    inputs = keras.Input(shape=(32,))
    outputs = keras.layers.Dense(1)(inputs)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

model = get_model()

# Train the model.
test_input = np.random.random((128, 32))
test_target = np.random.random((128, 1))
model.fit(test_input, test_target)

The remaining steps to convert the model to a GraphDef, upload the GraphDef to OctoML, disambiguate inputs, accelerate the model, and view results are same as the steps specified above for public GraphDef models.

End-to-end PyTorch object detection example using ImageNet

First, make sure you have all necessary libraries installed:

! pip install onnxruntime==1.11.0
! pip install onnx==1.9.0
! pip install torch==1.9.0
! pip install octomizer-sdk --upgrade

Import libraries:

import numpy as np
import octomizer.client
import octomizer.models.torchscript_model as torchscript_model
import torch.onnx
import torchvision
import urllib

Download a public PyTorch model:

model = torchvision.models.squeezenet1_0(pretrained=True)

Export the PyTorch model to Torchscript.

There are two methods for exporting models to Torchscript: scripting and tracing. We strongly recommend tracing to maximize your Torchscript models’ compatibility. Below, we provide an example of tracing a squeezenet model.

For more information on exporting PyTorch models to Torchscript, we recommend this guide https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html:

model_name = "squeezenet_from_pt"
torchscript_file = model_name + ".pt"

# A standard ImageNet input has 3 color channels (RGB) and images of dimension 224x224.
# Image values can be randomized to conduct tracing
rand_input = torch.randn(1, 3, 224, 224)

model.eval()

traced_model = torch.jit.trace(model, rand_input)

traced_model.save(f=torchscript_file)

Next, upload the model to the OctoML platform.

Note: Torchscript models do not include input shape or data type information, so you’ll need to pass those in as dictionaries for each input. Since Squeezenet only has one input (referred to as “input0” below), these dicts will have only one record for both shape and datatype:

# Pass in your model name and (if you have one) your project name
model_name = 'squeezenet'
description = 'squeezenet first test'

# Pass in your shapes and data type for each input
input_shapes = {'input0': [1, 3, 224, 224]}
input_dtype = {'input0': 'float32'}

#Upload the model to the Octomizer
model = torchscript_model.TorchscriptModel(
    client=client,
    name=model_name,
    model=torchscript_file,
    model_input_shapes=input_shapes,
    model_input_dtypes=input_dtype,
    description=description
)

Now accelerate the model. By default, the resulting package will be a Python wheel:

package_group = model.create_packages(platform="broadwell")

# Dynamically shaped models would require `input_shapes` and `input_dtypes`
# as additional parameters in the create_packages() call.

# Save the workflow uuid somewhere so you can use it to access benchmark
# metrics or the resulting package later.
print(package_group.uuid)

# Also save the ``model_name`` you used somewhere because you will need to call
# `import <model_name>` after downloading the resulting package later,
# unless you set a custom package name per the docs for octomizer.model.Model.create_package_workflow_group
print(model_name)

After you receive an email notification about the completion of the acceleration workflow, you can view performance benchmark metrics on the hardware you chose and download a packaged version of the accelerated model, either by visiting the UI or invoking the following code:

# Look up the workflows you previously launched using the group id
package_group = client.get_package_workflow_group("<INSERT GROUP ID>")
assert package_group.done()

best_workflow = None
best_mean_latency = float("inf")

for workflow in package_group.workflows:
    # To view benchmark metrics, either visit the UI or invoke something similar to:
    engine = workflow.proto.benchmark_stage_spec.engine
    metrics = workflow.metrics()
    print(engine)
    print(metrics)
    print("-----------------------------------")

    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow

# Save the resulting Python wheel to the current directory.
best_workflow.save_package(".")

Install the wheel generated by OctoML:

! pip install squeezenet_from_pt-0.1.0-py3-none-any.whl

To test the accelerated model, download a publicly available image from PyTorch:

# Download a picture of a Samoyed dog from PyTorch
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")

try:
    urllib.URLopener().retrieve(url, filename)
except:
    try:
        urllib.request.urlretrieve(url, filename)
    except:
        print("Cannot import image")


# Use boilerplate image processing code from PyTorch-- see https://pytorch.org/hub/pytorch_vision_squeezenet/
from PIL import Image
from torchvision import transforms

input_image = Image.open(filename)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the PyTorch model

Run the accelerated model on the image:

# The module name defaults to the name of the model, if the model has alphanumeric characters only.
# You could also have customized this package name when creating an acceleration workflow.
import squeezenet_from_pt

best_model = squeezenet_from_pt.OctomizedModel()
outputs = best_model.run(input_batch.numpy()) # Run the accelerated model

# The accelerated model outputs a tvm.nd.NDArray object with shape (1,1000),
# with confidence scores over Imagenet's 1000 classes. You can convert the
# output to a numpy array by calling .numpy().

# Find the index of the ImageNet label predicted with highest probability.
pred = np.argmax(outputs[0].numpy())

# Download ImageNet labels
! wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt

# Read the labels in ImageNet and print out the label predicted with highest probability
with open("imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]
    print("Accelerated model detects object: " + categories[pred])

Finally, check the original, unaccelerated PyTorch model shows the same output for the given image:

# Run the original PyTorch model
with torch.no_grad():
    orig_out = model(input_batch)

# The original output has unnormalized scores. To get probabilities, you can run a softmax on it.
orig_probs = torch.nn.functional.softmax(orig_out[0], dim=0)

# Read the labels in ImageNet and print out the label predicted with highest probability
with open("imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]
    orig_pred = np.argmax(orig_probs.numpy())
    print("Unaccelerated model detects object: " + categories[orig_pred])

End-to-end Transformers question answering example

First install all necessary libraries:

! pip install transformers
! pip install onnxruntime==1.8.0
! pip install octomizer-sdk --upgrade

Also install a large BERT model finetuned on the SQuAD question answering dataset from the transformers repo:

! python -m transformers.onnx --model=bert-large-uncased-whole-word-masking-finetuned-squad --feature=question-answering onnx/

Now upload the model to the OctoML platform:

import octomizer.client
import octomizer.models.onnx_model as onnx_model

# Pass your API token below:
client = octomizer.client.OctomizerClient(access_token="<INSERT ACCESS TOKEN>")

# Upload the ONNX model
MODEL_NAME = "bert_squad_qa"
ONNX_FILE = "onnx/model.onnx"
accel_model = onnx_model.ONNXModel(client, name=MODEL_NAME, model=ONNX_FILE)

# Check the automatically inferred input shapes.
inputs = accel_model.get_uploaded_model_variant().inputs
print(inputs)

The command above prints ({‘input_ids’: [-1, -1], ‘attention_mask’: [-1, -1], ‘token_type_ids’: [-1, -1]}, {‘input_ids’: ‘int64’, ‘attention_mask’: ‘int64’, ‘token_type_ids’: ‘int64’}).

Notice the input shapes printed above have negative values, which means they are dynamic and need to be disambiguated. For transformer models, the inputs input_ids, attention_mask, and token_type_ids need to have the same shape: [batch_size, maximum_sequence_length].

In this example, we will specify a batch size of 1 and maximum sequence length of 128. input_ids indicate the IDs of the tokens (words or subwords) in an input sequence. attention_mask is a binary tensor indicating which indices are padded so the model does not attend to them. token_type_ids represents a binary mask identifying the two types of sequence in the model– question or context:

input_shapes = {'input_ids': [1, 128], 'attention_mask': [1, 128], 'token_type_ids': [1, 128]}
input_dtypes = inputs[1] # Use the input data types OctoML automatically inferred

OctoML delivers best performance on Transformer-based models via optimized use of ONNX-RT and packaging for CPU targets. For GPU targets, we recommend using TVM for acceleration:

package_group = accel_model.create_packages(
    platform="broadwell",
    input_shapes=input_shapes,
    input_dtypes=input_dtypes
)

# Save the package group uuid somewhere so you can use it to access benchmark metrics or the resulting package later.
print(package_group.uuid)

# Also save the MODEL_NAME you used somewhere because you will need to call `import <MODEL_NAME>` after download the resulting package later,
# unless you set a custom package name per the docs for octomizer.model.Model.create_package_workflow_group
print(MODEL_NAME)

After you receive an email notification about the completion of the acceleration workflow, you can view performance benchmark metrics on the hardware you chose and download a packaged version of the accelerated model, either by visiting the UI or invoking the following code:

# Look up the workflows you previously launched using the group id
package_group = client.get_package_workflow_group("<INSERT GROUP ID>")
assert package_group.done()

best_workflow = None
best_mean_latency = float("inf")

for workflow in package_group.workflows:
    # To view benchmark metrics, either visit the UI or invoke something similar to:
    engine = workflow.proto.benchmark_stage_spec.engine
    metrics = workflow.metrics()
    print(engine)
    print(metrics)
    print("-----------------------------------")

    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow

# Save the resulting Python wheel to the current directory.
best_workflow.save_package(".")

Install the wheel generated by OctoML:

! pip install bert_squad_qa-0.1.0-py3-none-any.whl

Import the accelerated model:

import bert_squad_qa

best_model = bert_squad_qa.OctomizedModel()

Now set up a sample input for the accelerated model:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
question, context = "What are some example applications of BERT?", "…BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications."
encoded_input = tokenizer.encode_plus(question, context, return_tensors="np")

Run the accelerated model:

start_scores, end_scores = best_model.run(*encoded_input.values())

Now let’s interpret the results:

import numpy as np
input_ids = encoded_input['input_ids']
tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze())

# Find the tokens with the highest `start` and `end` scores.
answer_start = np.argmax(start_scores)
answer_end = np.argmax(end_scores)

# Combine the tokens in the answer and print it out.
answer = ' '.join(tokens[answer_start:answer_end+1])
print(answer)

The question we asked the accelerated model was “What are some example applications of BERT?” The model answered “… bert model can be fine ##tu ##ned with just one additional output layer to create state - of - the - art models for a wide range of tasks , such as question answering and language inference.”

Supported Hardware

To get the supported hardware targets available for acceleration, you can use the following code:

# Pass your API token below:
client = octomizer.client.OctomizerClient(access_token=MY_ACCESS_TOKEN)

# Get the list of available hardware targets
targets = client.get_hardware_targets()

Each hardware target in the list will have information like the display name, platform, vendor name, the number of vCPUs, the name of the architecture, and the supported model runtimes for the target (e.g. TVM, ONNX Runtime, etc.). If we want to create packages for a model using the target, Intel Cascade Lake (AWS c5.12xlarge), which has a platform name of aws_c5.12xlarge, we can set it as the platform parameter in the create_packages function:

# Assuming the model has already been uploaded and we infer the inputs
package_group = model.create_packages(platform="aws_c5.12xlarge")

# Save the workflow uuid somewhere so you can use it to access benchmark metrics or the resulting package later.
print(package_group.uuid)

After your package group has finished, the lscpu information will be included in the benchmark results of a workflow:

assert package_group.done()
best_workflow = None
best_mean_latency = float("inf")

for workflow in package_group.workflows:
    metrics = workflow.metrics()
    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow

result = best_workflow.result()
print(result.benchmark_result.lscpu_output)

The output will produce something like this:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:            7
CPU MHz:             1941.374
BogoMIPS:            5999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

End-to-End: Optimizing PyTorch Model, Packaging it in Triton, and Pushing to AWS ECR

First, make sure you have all necessary libraries installed:

! pip install onnxruntime==1.11.0
! pip install onnx==1.9.0
! pip install torch==1.9.0
! pip install octomizer-sdk --upgrade

Import libraries:

import numpy as np
import octomizer.client
import octomizer.models.torchscript_model as torchscript_model
import torch.onnx
import torchvision
import urllib

Download a public PyTorch model:

model = torchvision.models.squeezenet1_0(pretrained=True)

Export the PyTorch model to Torchscript.

There are two methods for exporting models to Torchscript: scripting and tracing. We strongly recommend tracing to maximize your Torchscript models’ compatibility. Below, we provide an example of tracing a squeezenet model.

For more information on exporting PyTorch models to Torchscript, we recommend this guide https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html:

model_name = "squeezenet_from_pt"
torchscript_file = model_name + ".pt"

# A standard ImageNet input has 3 color channels (RGB) and images of dimension 224x224.
# Image values can be randomized to conduct tracing
rand_input = torch.randn(1, 3, 224, 224)

model.eval()

traced_model = torch.jit.trace(model, rand_input)

traced_model.save(f=torchscript_file)

Next, upload the model to the OctoML platform.

Note: Torchscript models do not include input shape or data type information, so you’ll need to pass those in as dictionaries for each input. Since Squeezenet only has one input (referred to as “input0” below), these dicts will have only one record for both shape and datatype:

# Pass in your model name and (if you have one) your project name
model_name = 'squeezenet'
description = 'squeezenet first test'

# Pass in your shapes and data type for each input
input_shapes = {'input0': [1, 3, 224, 224]}
input_dtype = {'input0': 'float32'}

#Upload the model to the Octomizer
model = torchscript_model.TorchscriptModel(
        client=client,
        name=model_name,
        model=torchscript_file,
        model_input_shapes=input_shapes,
        model_input_dtypes=input_dtype,
        description=description
)

Now we’ll accelerate/package the model for an AWS “c5n.xlarge” instance. This step is likely to take several hours:

package_group = model.create_packages(platform="aws_c5n.xlarge")

# Dynamically shaped models would require `input_shapes` and `input_dtypes` as additional parameters in the create_packages() call.

# Save the workflow uuid somewhere so you can use it to access benchmark metrics or the resulting package later.
print(package_group.uuid)

After you receive an email notification about the completion of the acceleration workflow, you can view performance benchmark metrics on the hardware you chose.

You can also see the options available for packaging. In this tutorial, we’re focused upon the “docker_build_triton” package.:

# Look up the workflows you previously launched using the group id
package_group = client.get_package_workflow_group("<INSERT GROUP ID>")
assert package_group.done()

best_workflow = None
best_mean_latency = float("inf")

for workflow in package_group.workflows:
    # To view benchmark metrics, either visit the UI or invoke something similar to:
    engine = workflow.proto.benchmark_stage_spec.engine
    metrics = workflow.metrics()
    print(engine)
    print(metrics)
    print("-----------------------------------")

    if metrics.latency_mean_ms < best_mean_latency:
        best_mean_latency = metrics.latency_mean_ms
        best_workflow = workflow

# Here we print the package types that are available. Notice that Triton is among them.
packages = best_workflow.proto.status.result.package_result
print(packages)

Once we’ve confirmed we have a docker package available to us, let’s take a moment to prepare the destination for our package: AWS ECR. For more detail on setting up ECR, check out this guide: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html:

Start by logging into your ECR registry using the AWS CLI.:

aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com

Since we’ve logged in, we need to specify the repository name and tag we’re going to use for the model container image. If you haven’t already created a destination repository in ECR, you’ll need to so first. Check out the guide here: https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html:

# Create the repository in ECR:
aws ecr create-repository --repository-name $MODEL_NAME --profile [PROFILE]

# Create a local mirror for that repository, and also specify the tag
mirror_repository_name = '[AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com/' + model_name
tag = 'v1'

print(repository_name+":"+tag)

Now we are ready to build an image! To do so, we call a special “docker_build_triton” method on the workflow we just completed. We’ll need to pass in the repository name and tag.:

best_workflow.docker_build_triton(mirror_repository_name+":"+tag)

Since we’ve built the image, let’s confirm we have it in our local registry:

docker images

Finally, let’s push that image to AWS ECR, and confirm we’ve succeeded. Note this image is 12GB, it will take awhile the first time we do it.:

docker push [AWS_ACCOUNT_ID].dkr.ecr.[REGION].amazonaws.com/squeezenet_from_pt:v1 #PROD VERSION

Now that we’ve pushed this image, let’s confirm this is in ECR.:

aws ecr list-images --repository-name $MODEL_NAME --profile [PROFILE_NAME]

The optimized model container is now in AWS ECR and ready to be pushed wherever you need it!

Convert a classical ML sklearn model to ONNX

To upload your classical ML sklearn model to OctoML for benefits of remote hardware benchmarking and seamless deployment, you need to convert tbe model to ONNX format.

First, make sure you have the necessary dependencies installed:

! pip install sklearn
! pip install skl2onnx

Now make sure you have a model defined– we’ll use RandomForestClassifier as an example:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = RandomForestClassifier()
clr.fit(X_train, y_train)

We are ready to convert the model to ONNX:

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

FILENAME = "Random-Forest-Classifier.onnx"
initial_type = [('float_input', FloatTensorType([None, 4]))]
onx = convert_sklearn(clr, initial_types=initial_type)
with open(FILENAME, "wb") as f:
    f.write(onx.SerializeToString())

That was it! This flow works for all the sklearn models specified here that are convertible to ONNX. If you use custom sklearn pipelines or objects necessitating more advanced conversion, you should follow the guide here.

[beta] Explore Optimizations and Deploy on ONNX Runtime

OctoML provides an integration between TVM and ONNX Runtime, two of the many acceleration technologies OctoML uses to deliver cost reduction and latency/throughput improvements. Specifically, OctoML converts TVM-optimized models into an ONNX Runtime custom operator. This packaging option is automatically available in the web UI to selected beta users who use the Extended Acceleration mode, but is also available to all platform users via the SDK. In this tutorial, we will show how to create these packages and deploy them.

To create ONNX Runtime custom op packages via the SDK, use the benchmark_tvm_in_onnxruntime flag in ModelVariant.accelerate.

To create ONNX Runtime custom op packages via the web UI, toggle on the “Extended acceleration” button.

Extended acceleration toggle in WebUI

After the model has been tuned and packaged, we can download the package from the WebUI by clicking ONNX Runtime custom op.

Download the custom op package.

Then to run the package, move it to a new directory in an instance of the hardware target that the model was tuned on. This is important, since some tuned models are hardware-dependent and utilize specific instruction that may be unsupported by other hardware targets. For example, Intel Macs may not support AVX-512 instructions.

The resulting file structure should look like this:

example
├── resnet50-v1-7_onnx.onnx.tar
└── model/

We can then extract the package:

tar -xvf resnet50-v1-7_onnx.onnx.tar -C model/

Make sure that you have the dependencies installed:

pip3 install onnxruntime onnx

Then we can run the model from the example directory:

import numpy as np
import onnxruntime

# Create ORT inference session
sess_options = onnxruntime.SessionOptions()
sess_options.register_custom_ops_library(
    "./model/custom_resnet50_v1_7_onnx.so"
)

engine = onnxruntime.InferenceSession(
    "./model/resnet50_v1_7_onnx.onnx",
    providers=["CPUExecutionProvider"],
    sess_options=sess_options,
)

# Create input
shape = [1, 3, 224, 224]
dtype = "float32"
input_data = {"data": np.random.uniform(size=shape).astype(dtype)}

# Run inference
output_data = engine.run(output_names=None, input_feed=input_data)
print(output_data)

Note: If you get an Illegal instruction error, make sure that the tuned model is running on the hardware target it was tuned for.