Using Octomizer

This page walks through usage of the Octomizer service, including an overview of the key concepts.

Service Access

The Octomizer service Web interface can be accessed at https://app.octoml.ai. Login to this page using the credentials provided to you.

Once you have accessed the Web interface, you may create an API token to access the API programmatically. Navigate to the Account Settings page to create a token. The contents of the token secret must be stored in the environment variable OCTOMIZER_API_TOKEN for use by the Octomizer Python SDK.

The API is hosted at https://api.octoml.ai. The API provides both a gRPC and REST interface, described in Octomizer RPC Interface. You may also use the Octomizer Python SDK to access the Octomizer from Python.

If you have issues logging in, please Contact Support.

System Status

Currently, if OctoML services are not available, error messages will occur in the application to alert a user. If you cannot resolve the issue on your own, or need more information about the issue, please Contact Support.

Key Concepts

Before diving into the details of using the Octomizer, it’s important to understand a few key concepts.

Model

A Model represents a collection of Model Variants, all of which share the same key properties: input layer shape and type, output layer shape and type, and accuracy. The Octomizer produces new Model Variants, each with a different computational and memory layout for a specified hardware target, as it optimizes a given Model uploaded by the user. All Model Variants of a single Model retain the same inputs, outputs, and accuracy.

Model Variants

A Model Variant is a single instance of a Model, in a particular format, potentially targeting a specific hardware platform. Each Model has one or more Model Variants, starting with the initial Model Variant uploaded by the user in some format (such as ONNX or TensorFlow). Each Model Variant may exhibit different performance on different hardware targets. All Model Variants of a single Model retain the same inputs, outputs, and accuracy.

Workflow

A Workflow represents a sequence of actions taken by the Octomizer, taking in a Model Variant as input, and possibly producing a new Model Variant as output. Workflows consist of up to three separate stages, any one of which is optional: Autotuning, Benchmarking, and Packaging.

Autotuning

The Autotuning stage of a Workflow takes a Model Variant as input and performs optimization on the model, based on the input parameters provided by the user. As output, it produces a new, optimized Model Variant.

Benchmarking

The Benchmarking stage of a Workflow measures the performance of a Model Variant on a target hardware platform. It produces a detailed report of model performance.

Packaging

The Packaging stage of a Workflow converts a Model Variant into a downloadable artifact that can be installed and executed. Currently, the Octomizer supports a two packaging formats: A Python wheel that can be installed using pip, and a linux shared object tar file.

Creating a Model

To create and upload a new Model, click on the Add model button on the Octomizer home page. Currently, Octomizer supports models uploaded in ONNX format. In the future, additional model types will be supported.

Models are given a name and description, which may be any string value. Optional labels are a comma-separated list of strings, which can be used for filtering the list of Models.

The Octomizer will infer your input types, leaving -1 values for input dimensions that are dynamic. You will be required to provide the input layer name, input layer type, and input layer shape for all dynamic inputs when Octomizing a Model. The input layer name is the layer name in the provided ONNX file, commonly input or data. The input layer type corresponds to the corresponding NumPy type name used by the model’s input layer. For example, specify float32 for np.float32, or uint8 for np.uint8. The input layer shape is a comma-separated list of layer dimension sizes. For image-based models, these are provided in NCHW format, for example, 1,3,512,512 for a model with one input image of 3 channels of 512x512 pixels in size.

Model Optimization and Benchmarking

Viewing a Model allows you to optimize the model for a given set of hardware targets. Selecting and optimizing against one or more hardware platforms will create one or more Workflows that perform the steps of autotuning and benchmarking the model.

At any given time, a Workflow may be in one of four states:

Pending

The workflow has been created but has not yet started running. Workflows run automatically when resources are available.

Running

The workflow is currently in progress. If available, the progress of the workflow will be shown.

Completed

The workflow has completed successfully.

Failed

The workflow failed during execution. An error message will be shown with details.

An email notification will be sent when a Workflow completes or fails.

Upon completion, each Workflow may create a new Model Variant (one for each selected hardware platform), as well as benchmark results.

Model benchmarking

Model benchmarking occurs by repeatedly running model inference with its associated runtime (e.g., Relay for TVM-generated models, ONNX-Runtime for ONNX models) against the selected hardware target, then producing the mean and standard deviation of the set of those results. Model inference during benchmarking is done using randomized input data.

To conduct ONNX-RT benchmarking, Octomizer currently runs ONNX 1.6 and ONNX-RT 1.4.0, with the default CPU Execution Providers and, for NVIDIA/CUDA targets, runs onnx_tensorrt 7.0 (on top of TensorRT 7.0).

Complete benchmark results can be obtained via the Octomizer Python API.

Model Autotuning

Autotuning a model is the Octomizer’s optimization process. Each autotuning run creates a new Model Variant, which is a logically equivalent model (that is, a model which retains the same accuracy as its original state) with performance tuned for a specific hardware target.

Octomizer uses TVM for model optimization and autotuning. TVM has both graph-level and operator-level optimizations; while the graph-level optimizations are deterministic, the operator-level optimizations are not. The autotuning process works at the operator level by exploring the compute and memory resources of the selected hardware target, generating many variations of a compiled operator and benchmarking each of those variations to determine the optimal configuration.

The results of these operator-level optimizations are called logs in the Octomizer, and are distinct from audit logs or other typical software logs. Autotuning logs represent the specific layout of an operator and the resulting performance, and the combination of these logs describes the optimization of the overall model. When the results of these logs are combined and applied to the model, a new Model Variant is generated.

While the API exposes all Model Variants for a given Model, the Web interface only presents the highest-performing Model Variant for each hardware target to the user.

Performance metrics

Model performance resulting from benchmarking runs is available for every model for which an optimization run has occurred.

The Octomizer Python API provides complete access to all benchmarking results for all Model Variants.

Unexpected model performance

Occasionally, Octomized models may show no results or results that appear far too fast or slow than what might be expected (below one millisecond, or above 1,000 milliseconds, for most models).

Because of the emerging nature of both model architectures, model operators, compilation framework coverage, and vendor-provided hardware-specific libraries (e.g. CUDA, OpenVINO), some model runs may produce these results unexpectedly.

If this occurs, please Contact Support for assistance.

Model Packaging

By default, a packaged model is converted to a Python wheel format, targeting a specific hardware platform. One may instead specify that they wish to package their model as a PackageType.LINUX_SHARED_OBJECT. Please refer to the Python API for further details.

If you have a need for a specific packaging format - as a C API, gRPC call, or Docker container, for instance - please Contact Support.

Managing users as an account administrator

In Octomizer, there are two types of users: account administrators and regular users. Account administrators have the ability to add new users to an account, modify users’ permission settings, and deactivate users in an account. Each account can have multiple administrators. The initial account user automatically receives administrator privileges but is capable of being downgraded to a regular user account.

When adding a new user, an account administrator has to specify the user’s first and last names, email, and permission settings. Permission settings include (1) whether the user will be set as an account administrator or a regular user, and (2) whether the user will be able to Octomize models (which incurs cost on the account) or only has read-only account access.

Below is sample code showing how to add users, view a list of existing users, update users’ permissions, and deactivate users in an existing account:

from datetime import datetime

from octomizer import client

ACCESS_TOKEN = "Token"
NEW_USER_FIRST = "First"
NEW_USER_LAST = "Last"
NEW_USER_EMAIL = "user@user.com"

client = client.OctomizerClient(access_token=ACCESS_TOKEN)
account_uuid = client.get_current_user().account_uuid

headers = {"Authorization": f"Bearer {ACCESS_TOKEN}"}

# Add a new user.
new_user = client.add_user(
    given_name=NEW_USER_FIRST,
    family_name=NEW_USER_LAST,
    email=NEW_USER_EMAIL,
    account_uuid=account_uuid,
    is_own_account_admin=False,
    can_octomize=True,
)
print(new_user)


# List users in your account.
users = client.list_users(account_uuid=account_uuid)
for user in users:
    print(user)


# Give an existing user admin permissions.
new_admin_user = client.update_user(
    user_uuid="some_uuid",  # you can find this uuid by listing users in your account.
    is_own_account_admin=True,
)
print(new_admin_user)


# Remove an existing user's permissions to octomize.
non_octomize_user = client.update_user(
    user_uuid="some_uuid",  # you can find this uuid by listing users in your account.
    can_octomize=False,
)
print(non_octomize_user)


# Deactivate an existing user.
deactivated_user = client.update_user(
    user_uuid="some_uuid",  # you can find this uuid by listing users in your account.
    active=False,
)
print(deactivated_user)

# Get usage data for the active user.
usageData = client.get_usage(
    start_time=datetime.min,  # defaults to the start of the current month.
    end_time=datetime.max,  # defaults to the end of the current month.
    account_uuid=account_uuid,
)
print(usageData)