A working example with Algorithmia

Motivation

With the rise of MLOps in recent years, running machine learning models for inference tasks has become much easier. Depending on the use case, appropriately optimized deep learning models can even run directly on a mobile device. In client-server/microservice architectures, larger models with high accuracy requirements are usually hosted centrally and queried by downstream services via well-defined interfaces. Tools such as TensorFlow Serving now also make these use cases a manageable problem on an appropriately configured server infrastructure.

However, from a software engineering perspective, we know how complex a self-managed infrastructure can become. Not surprisingly, serverless solutions from cloud providers are gaining in popularity for application development these days. No infrastructure management and pay as you go are the main advantages, which is why I now work almost exclusively with such solutions.

Serverless GPU in 2021

However, when I found myself in the situation of integrating a rather complex deep learning model for online prediction into such a serverless architecture of microservices, I was somewhat surprised. In my use case, the requirement was to process individual requests with base64 encoded images at irregular intervals (a few seconds to several hours) and return the correct class using a self-trained deep learning model. In my opinion, a standard task without deep complexity. Spoiled by Cloud Run, Cloud Functions, AWS Lambda, etc., I naively thought there should be a “GPU enable” checkbox and off we go…

Not quite. In fact, finding a truly serverless solution turned out to be non-trivial. As already described here, the classic serverless solutions are rather designed for CPU workload. Inferencing with CPUs only was out of the question in my case, since the latency requirement of the service could not have been met in this way.

Google AI Platform Prediction and AWS SageMaker?

Meanwhile, Google with AI Platform Prediction and AWS with SageMaker offer solutions including inference accelerators for deep learning models. Just a brief summary of why these services did not meet my requirements (for now).

Starting with AWS SageMaker the minimum instance count is required to be 1 or higher ( https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-prerequisites.html). For many use cases with continuous load this should not be a problem. For my case, however, this would be a waste of resources and is not 100% in line with the pay as you go principle I was looking for.

Google AI Platform Prediction currently only allows the use of GPUs with the TensorFlow SavedModel format. For example, PyTorch models can only be used within custom containers (currently pre-GA) without GPU support. In addition, google does allow autoscaling to 0, but if a request triggers your service you are charged for a minimum of 10 minutes computing time even if the request took only a fraction of a second ( https://cloud.google.com/ai-platform/prediction/pricing)

#inference #serverless #gpu #mlops

Serverless GPU-Powered Hosting of Machine Learning Models
2.05 GEEK