Running a Data Replication Pipeline on Kubernetes with Argo and Singer.io

Running a Data Replication Pipeline on Kubernetes with Argo and Singer.io

Our solution is to deploy singer.io taps and targets — Python scripts that can perform data replication between arbitrary sources and destinations. The Singer specification is the foundation for the popular Stitch SaaS, and it is also leveraged by a number of independent consultants and data projects.

Hundreds of data teams have migrated to the ELT pattern in recent years, leveraging SaaS tools like Stitch or FiveTran to reliably load data into their infrastructure. In my experience, these SaaS offerings are outstanding and can accelerate your pipelining significantly. However, a lot of folks like don’t have the budget, or have custom applications they need to accommodate, or just love the pain of rolling their own tools.

Our solution is to deploy singer.io taps and targets — Python scripts that can perform data replication between arbitrary sources and destinations. The Singer specification is the foundation for the popular Stitch SaaS, and it is also leveraged by a number of independent consultants and data projects.

Singer pipelines are highly modular. You can pipe any tap to any target to build a data pipeline that fits your needs. And although this makes them a perfect fit for Dockerized pipelines, I found it challenging to find examples of Singer pipelines deployed via Kubernetes or Docker. Eventually, we put together a pipeline leveraging Argo Workflows and containerized Singer taps and targets.

Container orchestration without Argo and Kubernetes. Image by CHUTTERSNAP from Unsplash.

This article walks through the workflow at a high level and provides some example code to get up and running with some shared templates. I assume some familiarity with DockerKubernetes and the Singer specification. Even if you’re new to these technologies, though, I will try to point out helpful resources to get you pointed in the right direction.

Why Roll Our Own?

ETL is not the reason that anyone gets into data science or engineering. There is little creativity, lots of maintenance, and no recognition until something goes wrong. Fortunately, SaaS tools like Stitch and FiveTran have pretty much turned data replication into a commodity that small teams can leverage.

The “solved” nature of data replication makes it easier for data scientists to own projects end-to-end, freeing data engineers to think about the “platform” rather than point solutions. (StitchFix has a terrific post on this.) The players in this market recognize that it’s really the stuff “around” the integration scripts that are differentiators: the Meltano project out of GitLab, for example, has found a niche in being a “runner” for integration processes, rather than the logic of the data replication service.

data-engineering argo kubernetes data-replication

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

50+ Useful Kubernetes Tools for 2020 - Part 2

Our original Kubernetes tool list was so popular that we've curated another great list of tools to help you improve your functionality with the platform.

Managing Data as a Data Engineer:  Understanding Data Changes

Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.

Managing Data as a Data Engineer — Understanding Users

Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.

Intro to Data Engineering for Data Scientists

Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews

Know the Difference Between a Data Scientist and a Data Engineer

Know the Difference Between a Data Scientist and a Data Engineer. Big data engineer certification and data science certification programs stand resourceful for those looking to get into the data realm.