We will shortly introduce the Google Cloud service Dataflow, and how it can be used to run predictions on millions of images in a serverless way. No cluster creation, no maintenance, and you only pay for what you use. We will start by providing a context on why we think this is important, a brief introduction to a couple of concepts and then go directly into a use case.

Scaling Machine Learning models is hard and expensive. It is costly mainly for two reasons: because the infrastructure is expensive, and because your experimentation pipeline is slow causing your ML team to be constantly waiting for the results.

It is easy to underestimate how is expensive it is to have an ML team waiting for results. Not only because of the time wasted but also because people get frustrated and end up losing motivation.

One way to optimize your ML experimentation process it is to build and manage your own infrastructure. This is common in big companies, but this is often prohibitively expensive for smaller companies.

Traditional software development had similar challenges years ago, companies needed to scale, mainly driven by business needs (slow services/API/website don’t scale). Cloud Providers saw an opportunity to provide services that allow companies to scale without upfront costs. They mainly approach this issue in two ways: services that are easier to manage and serverless services. The latter were particularly attractive for small companies that didn’t have a big DevOps team or much money to pay for their own infrastructure.

A similar trend is happening in Machine Learning. This time is driven by the scarcity and high cost of ML talent. Like SW, small companies can’t invest a huge amount of money to create a custom optimized infrastructure to ML, it requires a team of experts to build it and maintain it.

Cloud Providers saw an opportunity to add value similarly by providing Serverless solutions to ML. GCP, in particular, has done this for a while. GCP AI Platform offers different services for each stage of an ML project. But they also count on other services that can help solve ML challenges.

What is Dataflow?

Dataflow is a fully managed processing service, that uses Apache Beam as its programming model to define and execute pipelines. Dataflow is one of the many runners that Beam supports.

Apache Beam provides a portable API layer for building sophisticated data-parallel processing pipelines that may be executed across a diversity of execution engines, or runners.

Dataflow in particular has some interesting features:

  • Fully managed
  • Autoscaling
  • Easy implementation

What is Apache Beam?

As mentioned above, Beam is a programming model used to define and execute a processing pipeline. It provides an interface to create a processing pipeline that can be executed in multiple environments (such as Spark, Hadoop, Dataflow).

#2020 jul tutorials # overviews #computer vision #dataflow #google #python #scalability

Scaling Computer Vision Models with Dataflow
1.35 GEEK