Serving Machine Learning models in Google Cloud

Part 4: Deploying NVIDIA Triton Inference Server into AI Platform

2 min readOct 29, 2020

Before we let loose on a couple of examples, here is some background information that you may find helpful.

Model artifacts in AI Platform are organized in a model/model version hierarchy, looking something like:

model: customer_propensity
  version: v01
  version: v02
  version: v03
model: inventory_forecast
  version: v02
  version: v03

To create model version, you will first need to create the model where the model version will reside.

gcloud ai-platform models create customer_propensity --region us-central1

Then you can create the model:

gcloud beta ai-platform versions create v02 \
--model customer_propensity \
--accelerator count=1,type=nvidia-tesla-t4 \
--config config_simple.yaml
--region us-central1

Where config_simple.yaml:

autoScaling:
  minNodes: 1
container:
  args:
  - tritonserver
  - --model-repository=$(AIP_STORAGE_URI)
  env: []
  image: $REGION-docker.pkg.dev/$PROJECT_ID/$REPO/$IMAGE:VERSION
  ports:
    containerPort: 8000
deploymentUri: $PATH_TO_MODEL_ARTIFACTS
machineType: n1-standard-4
routes:
  health: /v2/models/$MODEL_NAME
  predict: /v2/models/$MODEL_NAME/infer

You will need to provide the following parameters:

VERSION_NAME: this is the name of this Model Version, such as v2.
PATH_TO_MODEL_ARTIFACTS: AI Platform will look to this location to copy the model artifacts into this model version.
REGION: region where the container image is located.
PROJECT_ID: this is the Project ID where the container image is located.
REPO: repository where the container image is located.
IMAGE:VERSION: this is the image and version of the container.
MODEL_NAME: this must match the name of a model in the model artifacts. In this guide, we are setting up a single model, and this tells Triton the model name of the model you would like Triton to run prediction requests on.
REGION: region where the container image is located.

Other model version parameters:

Triton by default listens on port 8000 for HTTP requests. This tells AI Platform which port to communicate with Triton.
In this example, we will use a n1-standard-4 and one nvidia-tesla-t4 GPU.
AIP_STORAGE_URI is an AI Platform-provided environment variable. This is the location where model artifacts are copied to during this model version creation. Content will come from deploymentUri. If your model server requires a location to retrieve model artifacts, use AIP_STORAGE_URI and not deploymentUri.

What’s next

Google Cloud Platform has many other tutorials covering a wide range of topics. Try them out here.

Serving Machine Learning models in Google Cloud

Part 4: Deploying NVIDIA Triton Inference Server into AI Platform

What’s next

Written by Kevin Tsai

No responses yet