Serving Machine Learning models in Google Cloud

3 min readOct 29, 2020

This multi-part blog discusses common options for high-performance ML model inference in Google Cloud, when to use which, and introduces the new member of the AI Platform family: Custom Container Prediction.

Part 1: Characteristics of a high-performance ML serving platform
Part 2: Choosing an ML serving platform in Google Cloud
Part 3: Introducing Custom Container Prediction in AI Platform
Part 4: Deploying NVIDIA Triton Inference Server into AI Platform

This blog series would not have been possible without the following people: Anand Iyer, David Goodwin, Dong Meng, Henry Tappen, Jarek Kazmierczak, Jill Milton, Mikhail Chrestkha, Nikita Namjoshi, Ricky Nguyen, and Robbie Haertel. A sincere thank you for making this possible.

Part 1: Characteristics of a high-performance ML serving platform

A lot has been written about model training and a lot will continue to be written. Every year there is a new breakthrough model of some kind. Every year, MLPerf lists the results from ever-larger clusters and ever-lower training time for ResNet-50, now down to under a minute. This year’s entries have clusters with thousands of accelerators, with the largest clusters consuming electricity the equivalent of over 500 U.S. homes. But hidden under the shadows of Training is Inference, where testing rigs look like they are taken from the leftover parts from the training submissions.

Inference is where the rubber meets the road, or more aptly, where your model greets your users. 50 millisecond more latency and your user is gone, sometimes forever. Can’t scale on Black Friday? Just lost a bunch of sales. Inference works 24x7 but is never given the acknowledgement; it’s the proverbial stepchild from the previous marriage.

Unlike model training, inference is commonly real-time. There is no “restart the training job.” The following are typical requirements of a high-performance model serving platform:

Scalable: Inference workloads in production will scale depending on load. Scaling can happen gradually or suddenly. A production model serving platform needs to be able to scale to the need of requests, even if the required compute resources are orders of magnitude higher.

Predictable: Applications of model serving can be extremely sensitive to latency, such as transaction fraud detection and ad serving. In some applications, every additional millisecond response increase may result in significant lost revenue. Being able to provide predictable latency control is a key requirement to such a model serving platform.

Flexible: Data pre- and post-processing range from simple to complex. Preprocessing may be included in the model serving graph, or may be multi-step processes that call onto external services. A production model serving platform needs to offer the flexibility to include arbitrary pre- and post-processing pipelines.

Economical: One easy way to achieve the scalability and predictability requirements is to over-provision capacity. By setting the design capacity to ten or even twenty times the nominal expected load, you can create a large “cushion” for peaks and spikes in workload. However, this will be wasteful of resources. A production serving platform should support the judicious use of compute resources via techniques like autoscaling, dynamic batching, model stacking, and model priority.

Next… Part 2: Choosing an ML serving platform in Google Cloud

Serving Machine Learning models in Google Cloud

Part 1: Characteristics of a high-performance ML serving platform

Written by Kevin Tsai