Why I Built an Opensource Tool for Big Data Testing

Why I Built an Opensource Tool for Big Data Testing

I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark.

I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark. In this post I’ll share why I wrote this tool, why the existing tools weren’t enough, and how this tool may be helpful to you.

Who spends their evenings writing a data quality tool?

In every data-driven organisation, we must always recognise that without confidence in the quality of our data, that data is useless. Despite that there are relatively few tools available to help us ensure our data quality stays high.

What I was looking for was a tool that:

  • Helped me write high performance checks on the key properties of my data, like the size of my datasets, the percentage of rows that comply with a condition, or the distinct values in my columns
  • Helped me track those key properties over time, so that I can see how my datasets are evolving, and spot problem areas easily
  • Enabled me to write more complex checks to check other facets of my data that weren’t simple to incorporate in a property, and enabled me to compare between different datasets
  • Would scale to huge volumes of data

The tools that I found were more limited, constraining me to simpler checks defined in yaml or json, or only letting me check simpler properties on a single dataset. I wrote data-flare to fill in these gaps, and provide a one-stop-shop for our data quality needs.

open-source data-quality apache-spark scala spark data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Let’s talk about Open Data …

Let’s talk about Open Data : According to the International Open Data Charter(1), it defines open data as those digital data that are made available with the technical.

How to Fix Your Data Quality Problem

Data quality is top of mind for every data professional — and for good reason. Bad data costs companies valuable time, resources, and most of all, revenue.

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline.

Python Vs Scala For Apache Spark

Apache Spark is a popular open-source data processing framework. This widely-known big data platform provides several exciting features, such as graph processing, real-time processing, in-memory processing, batch processing and more quickly and easily.

What is Apache Spark? | Apache Spark Python | Spark Training

This Edureka "What is Apache Spark?" video will help you to understand the Architecture of Spark in depth. It includes an example where we Understand what is Python and Apache Spark.