I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark.
I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark. In this post I’ll share why I wrote this tool, why the existing tools weren’t enough, and how this tool may be helpful to you.
In every data-driven organisation, we must always recognise that without confidence in the quality of our data, that data is useless. Despite that there are relatively few tools available to help us ensure our data quality stays high.
What I was looking for was a tool that:
The tools that I found were more limited, constraining me to simpler checks defined in yaml or json, or only letting me check simpler properties on a single dataset. I wrote data-flare to fill in these gaps, and provide a one-stop-shop for our data quality needs.
Let’s talk about Open Data : According to the International Open Data Charter(1), it defines open data as those digital data that are made available with the technical.
Data quality is top of mind for every data professional — and for good reason. Bad data costs companies valuable time, resources, and most of all, revenue.
Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline.
Apache Spark is a popular open-source data processing framework. This widely-known big data platform provides several exciting features, such as graph processing, real-time processing, in-memory processing, batch processing and more quickly and easily.
This Edureka "What is Apache Spark?" video will help you to understand the Architecture of Spark in depth. It includes an example where we Understand what is Python and Apache Spark.