Our solution is to deploy singer.io taps and targets — Python scripts that can perform data replication between arbitrary sources and destinations. The Singer specification is the foundation for the popular Stitch SaaS, and it is also leveraged by a number of independent consultants and data projects.
Hundreds of data teams have migrated to the ELT pattern in recent years, leveraging SaaS tools like Stitch or FiveTran to reliably load data into their infrastructure. In my experience, these SaaS offerings are outstanding and can accelerate your pipelining significantly. However, a lot of folks like don’t have the budget, or have custom applications they need to accommodate, or just love the pain of rolling their own tools.
Our solution is to deploy
singer.io taps and targets — Python scripts that can perform data replication between arbitrary sources and destinations. The Singer specification is the foundation for the popular Stitch SaaS, and it is also leveraged by a number of independent consultants and data projects.
Singer pipelines are highly modular. You can pipe any tap to any target to build a data pipeline that fits your needs. And although this makes them a perfect fit for Dockerized pipelines, I found it challenging to find examples of Singer pipelines deployed via Kubernetes or Docker. Eventually, we put together a pipeline leveraging Argo Workflows and containerized Singer taps and targets.
This article walks through the workflow at a high level and provides some example code to get up and running with some shared templates. I assume some familiarity with Docker, Kubernetes and the Singer specification. Even if you’re new to these technologies, though, I will try to point out helpful resources to get you pointed in the right direction.
ETL is not the reason that anyone gets into data science or engineering. There is little creativity, lots of maintenance, and no recognition until something goes wrong. Fortunately, SaaS tools like Stitch and FiveTran have pretty much turned data replication into a commodity that small teams can leverage.
The “solved” nature of data replication makes it easier for data scientists to own projects end-to-end, freeing data engineers to think about the “platform” rather than point solutions. (StitchFix has a terrific post on this.) The players in this market recognize that it’s really the stuff “around” the integration scripts that are differentiators: the Meltano project out of GitLab, for example, has found a niche in being a “runner” for integration processes, rather than the logic of the data replication service.
Our original Kubernetes tool list was so popular that we've curated another great list of tools to help you improve your functionality with the platform.
Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.
Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.
Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews
Know the Difference Between a Data Scientist and a Data Engineer. Big data engineer certification and data science certification programs stand resourceful for those looking to get into the data realm.