Using the OctoML Platform

This page walks through usage of the OctoML Platform, including an overview of the key concepts.

Service Access

The OctoML Web interface can be accessed at Login to this page using the credentials provided to you.

Once you have accessed the Web interface, you may create an API token to access the API programmatically. Navigate to the Account Settings page to create a token. The contents of the token secret must be stored in the environment variable OCTOMIZER_API_TOKEN for use by the OctoML Python SDK.

The API is hosted at The API provides both a gRPC and REST interface, described in OctoML RPC Interface. You may also use the OctoML Python SDK to access the OctoML Platform from Python.

If you have issues logging in, please Contact Support.

System Status

Currently, if OctoML services are not available, error messages will occur in the application to alert a user. If you cannot resolve the issue on your own, or need more information about the issue, please Contact Support.

Key Concepts

Before diving into the details of using the OctoML Platform, it’s important to understand a few key concepts.


A Project is a collection of Models. Projects can be defined by the user to logically group several Models. A Model is said to be detached if it does not belong to any Project.


A Model represents a collection of Model Variants, all of which share the same key properties: input layer shape and type, output layer shape and type, and accuracy. OctoML produces new Model Variants, each with a different computational and memory layout for a specified hardware target, as it optimizes a given Model uploaded by the user. All Model Variants of a single Model retain the same inputs, outputs, and accuracy.

Model Variants

A Model Variant is a single instance of a Model, in a particular format, potentially targeting a specific hardware platform. Each Model has one or more Model Variants, starting with the initial Model Variant uploaded by the user in some format (such as ONNX or TensorFlow). Each Model Variant may exhibit different performance on different hardware targets. All Model Variants of a single Model retain the same inputs, outputs, and accuracy.


A Workflow represents a sequence of actions taken by the OctoML Platform, taking a Model Variant as input, and possibly producing a new Model Variant as output. Workflows consist of up to three separate stages, any one of which is optional: Autotuning, Benchmarking, and Packaging.


The Autotuning stage of a Workflow takes a Model Variant as input and performs optimization on the model, based on the input parameters provided by the user. As output, it produces a new, optimized Model Variant.


The Benchmarking stage of a Workflow measures the performance of a Model Variant on a target hardware platform. It produces a detailed report of model performance.


The Packaging stage of a Workflow converts a Model Variant into a downloadable artifact that can be installed and executed. Currently, OctoML supports a two packaging formats: A Python wheel that can be installed using pip, and a linux shared object tar file.

Creating a Project

To create a Project, click on the Create Project button on the OctoML Platform home page. Projects are given a name and description, which may be any string value.

Creating a Model

Like Projects, Models are given a name and description, which may be any string value. Optional labels are a comma-separated list of strings, which can be used for filtering the list of Models.

There are two ways to create a Model:

  1. When creating a Project, the Project creation screen prompts you to upload a model. The created Model will then belong to the newly-created Project.

  2. On the OctoML Platform, navigate to the Project that you want the new Model to belong to. Then, click the Add Model button.

Currently, OctoML supports models uploaded in ONNX and TensorFlow (SavedModel and Graph Def) formats. In the future, additional model types will be supported.

OctoML will infer your input types, leaving -1 values for input dimensions that are dynamic. You will be required to provide the input layer name, input layer type, and input layer shape for all dynamic inputs when Octomizing a Model. The input layer name is the layer name in the provided ONNX file, commonly input or data. The input layer type corresponds to the corresponding NumPy type name used by the model’s input layer. For example, specify float32 for np.float32, or uint8 for np.uint8. The input layer shape is a comma-separated list of layer dimension sizes. For image-based models, these are provided in NCHW format, for example, 1,3,512,512 for a model with one input image of 3 channels of 512x512 pixels in size.

Moving Models to a Project

To move a Model to a Project, click the menu button next to the Model, and choose Move to project from the pop-up menu. Select the desired Project that you want to move the Model to, and click Move.

Model Optimization and Benchmarking

Viewing a Model allows you to optimize the model for a given set of hardware targets. Selecting and optimizing against one or more hardware platforms will create one or more Workflows that perform the steps of autotuning and benchmarking the model.

At any given time, a Workflow may be in one of four states:


The workflow has been created but has not yet started running. Workflows run automatically when resources are available.


The workflow is currently in progress. If available, the progress of the workflow will be shown.


The workflow has completed successfully.


The workflow failed during execution. An error message will be shown with details.

An email notification will be sent when a Workflow completes or fails.

Upon completion, each Workflow may create a new Model Variant (one for each selected hardware platform), as well as benchmark results.

Model benchmarking

Model benchmarking occurs by repeatedly running model inference with its associated runtime (e.g., Relay for TVM-generated models, ONNX-Runtime for ONNX models, TensorFlow’s runtime for TensorFlow models, TFLite’s runtime for TFLite models) against the selected hardware target, then producing the mean and standard deviation of the set of those results. Model inference during benchmarking is done using randomized input data.

To conduct ONNX-RT benchmarking, OctoML currently runs ONNX 1.6 and ONNX-RT 1.4.0, with the default CPU Execution Providers and, for NVIDIA/CUDA targets, runs onnx_tensorrt 7.0 (on top of TensorRT 7.0).

Complete benchmark results can be obtained via the OctoML Python API.

Model Autotuning

Autotuning a model is OctoML’s optimization process. Each autotuning run creates a new Model Variant, which is a logically equivalent model (that is, a model which retains the same accuracy as its original state) with performance tuned for a specific hardware target.

OctoML uses Apache TVM for model optimization and autotuning. TVM has both graph-level and operator-level optimizations; while the graph-level optimizations are deterministic, the operator-level optimizations are not. The autotuning process works at the operator level by exploring the compute and memory resources of the selected hardware target, generating many variations of a compiled operator and benchmarking each of those variations to determine the optimal configuration.

The results of these operator-level optimizations are called logs in OctoML, and are distinct from audit logs or other typical software logs. Autotuning logs represent the specific layout of an operator and the resulting performance, and the combination of these logs describes the optimization of the overall model. When the results of these logs are combined and applied to the model, a new Model Variant is generated.

While the API exposes all Model Variants for a given Model, the Web interface only presents the highest-performing Model Variant for each hardware target to the user.

Performance metrics

Model performance resulting from benchmarking runs is available for every model for which an optimization run has occurred.

The OctoML Python API provides complete access to all benchmarking results for all Model Variants.

Unexpected model performance

Occasionally, Octomized models may show no results or results that appear far too fast or slow than what might be expected (below one millisecond, or above 1,000 milliseconds, for most models).

Because of the emerging nature of both model architectures, model operators, compilation framework coverage, and vendor-provided hardware-specific libraries (e.g. CUDA, OpenVINO), some model runs may produce these results unexpectedly.

If this occurs, please Contact Support for assistance.

Model Packaging

In previous versions of the SDK, package formats had to be specified and we only packaged to a single format for each workflow.

Since version 0.7.0 of the SDK, packaged models are packaged to all available package formats for a given engine, targeting a specific hardware platform. Users can save packages by using the save_package API to save a specific package format. Keep in mind that not all package formats are available for all engine types/hardware target combinations.

Please refer to the Python API for further details.

If you have a need for a specific packaging format - as a C API, gRPC call, or Docker container, for instance - please Contact Support.

Managing users as an account administrator

In OctoML, there are two types of users: account administrators and regular users. Account administrators have the ability to add new users to an account, modify users’ permission settings, and deactivate users in an account. Each account can have multiple administrators. The initial account user automatically receives administrator privileges but is capable of being downgraded to a regular user account.

When adding a new user, an account administrator has to specify the user’s first and last names, email, and permission settings. Permission settings include (1) whether the user will be set as an account administrator or a regular user, and (2) whether the user will be able to Octomize models (which incurs cost on the account) or only has read-only account access.

Below is sample code showing how to add users, view a list of existing users, update users’ permissions, and deactivate users in an existing account:

from datetime import datetime

from octomizer import client


client = client.OctomizerClient(access_token=ACCESS_TOKEN)
account_uuid = client.get_current_user().account_uuid

headers = {"Authorization": f"Bearer {ACCESS_TOKEN}"}

# Add a new user.
new_user = client.add_user(

# List users in your account.
users = client.list_users(account_uuid=account_uuid)
for user in users:

# Give an existing user admin permissions.
new_admin_user = client.update_user(
    user_uuid="some_uuid",  # you can find this uuid by listing users in your account.

# Remove an existing user's permissions to octomize.
non_octomize_user = client.update_user(
    user_uuid="some_uuid",  # you can find this uuid by listing users in your account.

# Deactivate an existing user.
deactivated_user = client.update_user(
    user_uuid="some_uuid",  # you can find this uuid by listing users in your account.

# Get usage data for the active user.
usageData = client.get_usage(
    start_time=datetime.min,  # defaults to the start of the current month.
    end_time=datetime.max,  # defaults to the end of the current month.