Resource efficiency is a primary concern in production machine learning systems. This is particularly true in the case of realtime inference, where deployed models need to perform inference on-demand with low latency.

The challenge is in balancing availability with minimalism. To maximize availability, you need enough deployed instances to perform inference on all of your models. To minimize costs, you want to deploy as few instances as possible. The tension between these two forces can be frustrating.

We’ve worked on many features to make this challenge easier in  Cortex, our open source ML deployment platform—request-based autoscaling, multi-model APIs, Spot instance support—but we’ve recently developed something new which we believe will make the balance even easier to strike.

In Cortex’s upcoming release, it will officially support multi-model caching, which will allow you to run inference on thousands of models from only a single deployed instance.

Multi-model caching: a brief introduction

The standard approach to realtime inference, which Cortex uses, is what we call the model-as-microservice paradigm. Essentially, it involves writing an API that runs inference on a model (a predictor), and deploying it as a web service.

#machine-learning #tensorflow #python #artificial-intelligence

How to deploy 1,000 models on one CPU with TensorFlow Serving
4.75 GEEK