Serving Machine Learning models in Google Cloud

Part 3: Introducing Custom Container Prediction in AI Platform

Kevin Tsai
2 min readOct 29, 2020

In Part 2, we covered model serving options in Google Cloud. Custom Container Prediction is the new member of the family, a fully-managed service with all the advanced trimmings like model monitoring and explainability, yet gives you the freedom of serving from a container of your choice.

A custom container image can be as simple as a base image from NVIDIA or Facebook. Your model pre-/post-processing may already be in a TensorFlow serving graph (the preferred method). All you need to do now is to tell AI Platform how to talk to your container and point AI Platform at where to look for your model artifacts or copy them yourself into your container image.

The first of the two architectures is the direct model server. This configuration makes a lot of sense if:

  • you are serving PyTorch or any other framework that is not natively hosted on AI Platform today, as this allows you to choose your model server like TorchServe.
  • you need to configure model server parameters such as dynamic batching.
  • you only need to serve one model in a container.

However, as you control the destiny of your container, model server with listener offers much more flexibility, for scenarios where:

  • you need more complex routing than offered by AI Platform.
  • your pre-/post-processing is outside of your serving graph.
  • your pre-/post-processing is heavy and you have low latency budgets. Custom Container Prediction offers preferable performance over AI Platform Custom Prediction Routines.
  • you need to serve more than one model artifact from within the same container.

Below is a summary comparing the two. In general, go Direct unless you have a reason to go Listener.

Now that you have a conceptual understanding of Custom Container Prediction, let’s get hands on.

Next… Part 4: Deploying NVIDIA Triton Inference Server into AI Platform

--

--

Kevin Tsai

Solutions Architect at Google Cloud focusing on ML Inference, MLOps, and Distributed Training.