Structuring ML Pipeline Projects

Project Structure: Requirements

Enable experimentation with multiple pipelines
Support both alocal execution mode and a deployment execution mode. This ensures the creation of 2 separate running configurations, with the first being used for local development and end-to-end testing and the second one used for running in the cloud.
Reuse code across pipeline variants if it makes sense to do so
Provide an easy to useCLI interface for executing pipelines with different configurations and data

A correct implementation also ensures that tests are easy to incorporate in your workflow.

Project Structure: Design Decisions

Use Python.
Use Tensorflow Extended (TFX) as the pipeline framework.

In this article we will demonstrate how to run a TFX pipeline both locally and on a Kubeflow Pipelines installation with minimum hassle.

Side Effects Caused By Design Decisions

By using TFX, we are going to use tensorflow . Keep in mind that tensorflow supports more types of models, like boosted trees.
Apache Beam can execute locally, anywhere kubernetes runs and on all public cloud providers. Examples include but are not limited to: GCP Dataflow, Azure Databricks.
Due to Apache Beam, we need to make sure that the project code is easily packageable by python’ssdist for maximum portability. This is reflected on the top-level module structure of the project. (If you use external libraries be sure to include them by providing an argument to apache beam. Read more about this on Apache Beam: Managing Python Pipeline Dependencies).

[Optional]_ Before continuing, take a moment to read about the provided TFX CLI. Currently, it is embarrasingly slow to operate and the directory structure is much more verbose than it needs to be. It also does not include any notes on reproducibility and code reuse._

#software-engineering #machine-learning #cloud-native #tensorflow-extended #deep-learning #react native

Project Structure: Requirements

Project Structure: Design Decisions

Side Effects Caused By Design Decisions

towardsdatascience.com

Structuring ML Pipeline Projects