MLOps meets DevOps - Pachyderm: Enables DevOps for data

This is a guest post by Jimmy Whitaker, senior data science evangelist at Pachyderm

In the last few years, DevOps has begun to shift—from a culture of continuous integration and continuous deployment (CI/CD) best practices to leveraging Git-based techniques to manage software deployments. This transition has made software less error prone, more scalable, and has increased collaboration by making DevOps more developer-centric.

At its heart, this simplification lets developers use the same powerful version control system they’re used to with Git as a way to deliver infrastructure as code, documentation, and even Kubernetes configurations. It works by using Git as a single source of truth, letting you build robust declarative apps and infrastructure with ease. But when it comes to machine learning applications, DevOps gets much more complex.

The difference between MLOps and DevOps

The kinds of problems we face in machine learning are fundamentally different than the ones we face in traditional software coding. Functional issues, like race conditions, infinite loops, and buffer overflows, don’t come into play with machine learning models. Instead, errors come from edge cases, lack of data coverage, adversarial assault on the logic of a model, or overfitting. Edge cases are the reason so many organizations are racing to build AI Red Teams to diagnose problems before things go horribly wrong.

It’s simply not enough to port your CI/CD and infrastructure code to machine learning workflows and call it done. Handling this new generation of machine learning operations (MLOps) problems requires a brand new set of tools that focus on the gap between code-focused operations and MLOps. The key difference is data. We need to version our data and datasets in tandem with the code. That means we need tools that specifically focus on data versioning, model training, production monitoring, and many others unique to the challenges of machine learning at scale.

Pachyderm: Enabling DevOps for data

Luckily, we have a strong tool for MLOps that does seamless data version control: Pachyderm. When we link the power of Pachyderm to our declarative Git-based workflows, we can bring the smooth flow of CI/CD to the world of machine learning models.

We’ll use two key tools to combine our existing CI/CD tooling and MLOps:

GitHub Actions builds powerful, flexible automation directly into the developer workflow. Actions can be triggered by nearly any GitHub event, such as pull requests, releases, push events, etc. While the triggered workflow can be almost anything, in the context of this post, it will most commonly be a build, CI/CD step, or release.
Pachyderm delivers data versioning and lineage to our data science project. It provides Git-like functionality, baked right into a version-controlled file system. This gives you the ability to control your binary assets, both big and small, and to create processing pipelines triggered by data changes.

#enterprise #partners #actions #ci/cd #devops #git-based #mlops #pachyderm #workflow

The difference between MLOps and DevOps

Pachyderm: Enabling DevOps for data

github.blog

MLOps meets DevOps - Pachyderm: Enables DevOps for data