Tensorflow Extended, ML Metadata and Apache Beam on the Cloud

The fully end to end example that tensorflow extended provides by running tfx template copy taxi $target-dir produces 17 files scattered in 5 directories. If you are looking for a smaller, simpler and self contained example that actually runs on the cloud and not locally, this is what you are looking for. Cloud services setup is also mentioned here.

What’s going to be covered

We are going to generate statistics and a schema for the Chicago taxi trips csv dataset that you can find by running the tfx template copy taxi command under the data directory.

Generated artifacts such as data statistics or the schema are going to be viewed from a jupyter notebook, by connecting to the ML Metadata store or just by downloading artifacts from simple file/binary storage.

Full code sample at the bottom of the article

Services Used

Dataflow as the Apache Beam Pipeline running service
Storage Buckets as simple (but fast) binary and file storage service
(Optional but comes with diminishing returns) Cloud SQL (MySQL) as the backing storage service for ML Metadata

The whole pipeline can run on your local machine (or on different cloud providers/your custom spark clusters as well). This is an example that can be scaled by using bigger datasets. If you wish to understand how this happens transparently

Execution Process

If running locally, code will not be serialised or sent to the cloud (of course). Otherwise, Beam is going to send everything to a staging location (typically bucket storage). Check out cloudpickle to get some intuition on how serialisation is done.
Your cloud running service of choice (ours is Dataflow) is going to check if all the mentioned resources exist and are accessible (for example, pipeline output, temporary file storage, etc)
Compute instances are going to be started and your pipeline is going to be executed in a distributed scenario, showing up in the job inspector while it is still running or finished.

It’s a good naming practise to use _/temp_ or _/tmp_ for temporary files and _/staging_ or _/binaries_ for the staging directory.

#apache-beam #tensorflow-extended #deep-learning #tensorflow #google-cloud-platform

What’s going to be covered

Services Used

Execution Process

towardsdatascience.com

Tensorflow Extended, ML Metadata and Apache Beam on the Cloud