Serving Machine Learning models in Google Cloud
Part 4: Deploying NVIDIA Triton Inference Server into AI Platform
2 min readOct 29, 2020
Before we let loose on a couple of examples, here is some background information that you may find helpful.
Model artifacts in AI Platform are organized in a model/model version hierarchy, looking something like:
model: customer_propensity
version: v01
version: v02
version: v03
model: inventory_forecast
version: v02
version: v03
To create model version, you will first need to create the model where the model version will reside.
gcloud ai-platform models create customer_propensity --region us-central1
Then you can create the model:
gcloud beta ai-platform versions create v02 \
--model customer_propensity \
--accelerator count=1,type=nvidia-tesla-t4 \
--config config_simple.yaml
--region us-central1
Where config_simple.yaml:
autoScaling:
minNodes: 1
container:
args:
- tritonserver
- --model-repository=$(AIP_STORAGE_URI)
env: []
image: $REGION-docker.pkg.dev/$PROJECT_ID/$REPO/$IMAGE:VERSION
ports:
containerPort: 8000
deploymentUri: $PATH_TO_MODEL_ARTIFACTS
machineType: n1-standard-4
routes:
health: /v2/models/$MODEL_NAME
predict: /v2/models/$MODEL_NAME/infer
You will need to provide the following parameters:
- VERSION_NAME: this is the name of this Model Version, such as v2.
- PATH_TO_MODEL_ARTIFACTS: AI Platform will look to this location to copy the model artifacts into this model version.
- REGION: region where the container image is located.
- PROJECT_ID: this is the Project ID where the container image is located.
- REPO: repository where the container image is located.
- IMAGE:VERSION: this is the image and version of the container.
- MODEL_NAME: this must match the name of a model in the model artifacts. In this guide, we are setting up a single model, and this tells Triton the model name of the model you would like Triton to run prediction requests on.
- REGION: region where the container image is located.
Other model version parameters:
- Triton by default listens on port 8000 for HTTP requests. This tells AI Platform which port to communicate with Triton.
- In this example, we will use a n1-standard-4 and one nvidia-tesla-t4 GPU.
- AIP_STORAGE_URI is an AI Platform-provided environment variable. This is the location where model artifacts are copied to during this model version creation. Content will come from deploymentUri. If your model server requires a location to retrieve model artifacts, use AIP_STORAGE_URI and not deploymentUri.
What’s next
Google Cloud Platform has many other tutorials covering a wide range of topics. Try them out here.