George  Koelpin

George Koelpin

1603472400

The State of Open-Source Data Integration and ETL

Open-source data integration is not new. It started 16 years ago with Talend. But since then, the whole industry has changed. The likes of Snowflake, Bigquery, Redshift have changed how data is being hosted, managed, and accessed while making it easier and a lot cheaper. But the data integration industry has evolved as well.

On one hand, new open-source projects emerged, such as Singer.io in 2017. This enabled more data integration connectors to become accessible to more teams, even though it still required a significant amount of manual work.

On the other hand, data integration was made accessible to more teams (analysts, scientists, business intelligence teams). Indeed, companies like Fivetran benefited from Snowflake’s rise, empowering non-engineering teams to set up and manage their data integration connectors by themselves, so they can use and work on the data in an autonomous way.

But even with this progress, a large majority of teams still build their own connectors in-house. The build vs. buy leans strongly on the build. That’s why we think it’s time to have a fresh new look at the landscape of the open-source technologies around data integration.

However, the idea for this article came from an awesome debate on DBT’s Slack last week. The discussion centered around two things:

  • The state of open-source alternatives to Fivetran, and
  • Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

Even Fivetran’s CEO was involved in the debate.

We already synthesized the second point in a previous article. In this article, we want to analyze the first point: the landscape of open-source data integration technologies.

#open source #data science

What is GEEK

Buddha Community

The State of Open-Source Data Integration and ETL
Uriah  Dietrich

Uriah Dietrich

1618457700

What Is ETLT? Merging the Best of ETL and ELT Into a Single ETLT Data Integration Strategy

Data integration solutions typically advocate that one approach – either ETL or ELT – is better than the other. In reality, both ETL (extract, transform, load) and ELT (extract, load, transform) serve indispensable roles in the data integration space:

  • ETL is valuable when it comes to data quality, data security, and data compliance. It can also save money on data warehousing costs. However, ETL is slow when ingesting unstructured data, and it can lack flexibility.
  • ELT is fast when ingesting large amounts of raw, unstructured data. It also brings flexibility to your data integration and data analytics strategies. However, ELT sacrifices data quality, security, and compliance in many cases.

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case.

#data science #data #data security #data integration #etl #data warehouse #data breach #elt #bid data

Virgil  Hagenes

Virgil Hagenes

1602702000

Data Quality Testing Skills Needed For Data Integration Projects

The impulse to cut project costs is often strong, especially in the final delivery phase of data integration and data migration projects. At this late phase of the project, a common mistake is to delegate testing responsibilities to resources with limited business and data testing skills.

Data integrations are at the core of data warehousing, data migration, data synchronization, and data consolidation projects.

In the past, most data integration projects involved data stored in databases. Today, it’s essential for organizations to also integrate their database or structured data with data from documents, e-mails, log files, websites, social media, audio, and video files.

Using data warehousing as an example, Figure 1 illustrates the primary checkpoints (testing points) in an end-to-end data quality testing process. Shown are points at which data (as it’s extracted, transformed, aggregated, consolidated, etc.) should be verified – that is, extracting source data, transforming source data for loads into target databases, aggregating data for loads into data marts, and more.

Only after data owners and all other stakeholders confirm that data integration was successful can the whole process be considered complete and ready for production.

#big data #data integration #data governance #data validation #data accuracy #data warehouse testing #etl testing #data integrations

Open-Source Data Integration and ETL in 2020

Open-source data integration is not new. It started 16 years ago with Talend. But since then, the whole industry has changed. The likes of Snowflake, BigQuery, and Redshift have changed how data is being hosted, managed, and accessed, while making it easier and a lot cheaper. But the data integration industry has evolved as well.

On one hand, new open-source projects emerged, such as Singer.io in 2017. This enabled more data integration connectors to become accessible to more teams, even though it still required a significant amount of manual work.

On the other hand, data integration was made accessible to more teams (analysts, scientists, business intelligence teams). Indeed, companies like Fivetran benefited from Snowflake’s rise, empowering non-engineering teams to set up and manage their data integration connectors by themselves, so they can use and work on the data in an autonomous way.

But even with this progress, a large majority of teams still build their own connectors in-house. The build vs. buy leans strongly on the build. That’s why we think it’s time to have a fresh new look at the landscape of the open-source technologies around data integration.

However, the idea for this article came from an awesome debate on DBT’s Slack last week. The discussion centered around two things:

  • The state of open-source alternatives to Fivetran, and
  • Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

Even Fivetran’s CEO was involved in the debate.

We already synthetized the second point in a previous article. In this article, we want to analyze the first point: the landscape of open-source data integration technologies.

TL;DR

Here is a table summarizing our analysis.

In orange is what we’re currently building at Airbyte in the next few weeks.

To better understand this table, we invite you to read below the details of our analysis on the landscape.

#etl #data #data-integration #open-source #data-science

Open-Source vs. Commercial Software: How To Better Solve Data Integration

There was an awesome debate on DBT’s Slack last week discussing mainly 2 things:

  1. The state of open-source alternatives to Fivetran
  2. Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

If you’re already on DBT’s Slack, here is the thread’s URL. Even Fivetran’s CEO was involved in the debate.

In this article, we want to discuss the second point and go over the different points mentioned by each party. The first point will come in another article. It’s more relevant to discuss whether an OSS approach makes sense before drilling down into the different alternatives.

We’ll go over the main challenges that companies face and see which approach fits best. We’ll call “commercial companies” the ones with a commercial software product, and “OSS companies” the ones with an open-source approach.

TL;DR

To better understand this table, we invite you to read the list of challenges each approach faces below.

#open source #data science #data analysis #data integration #etl #data ingestion #elt

Wiley  Mayer

Wiley Mayer

1602429313

Solving Data Integration: The Pros and Cons of Open Source and Commercial Software

There was an awesome debate on DBT’s Slack last week discussing mainly two things:

  1. The state of open-source alternatives to Fivetran
  2. Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

If you’re already on DBT’s Slack, here is the thread’s URL. Even Fivetran’s CEO was involved in the debate.

In this article, we want to discuss the second point and go over the different points mentioned by each party. The first point will come in another article. It’s more relevant to discuss whether an OSS approach makes sense before drilling down into the different alternatives.

We’ll go over the main challenges that companies face and see which approach fits best. We’ll call “commercial companies” the ones with a commercial software product, and “OSS companies” the ones with an open-source approach.

1. Having a large number of high-quality well-maintained pre-built connectors

This might be the most challenging part for the open-source approach, but there are actually choices that can make an OSS approach even stronger than a commercial one on that matter.

Commercial approach

In this case, a company supports a limited number of connectors (the most used ones) and actively maintains them. They know when there is a schema change and when they need to update the connector, and can be pretty responsive on that, if the organization is responsive and scales well.

However, the more connectors, the more difficult it is for a commercial company to keep the same level of maintenance across all connectors. In an ideal world, the organization will grow linearly with the number of connectors. But most often, there are inefficient processes, so every organization will reach a limit. The more efficient they are, the higher the limit.

#open-source #data-integration #data #big-data #database #business-models #software-development #solving-data-integration