Delta lake with Spark: What and Why?

Delta lake with Spark: What and Why?

Get to know the storage layer which enabled ACID and updates with Spark. Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark.

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

  1. Data “overwrite” on the same path causing data loss in case of Job Failure.
  2. Updates in the data.

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.

The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.

When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?

What is Delta Lake?

Delta Lake Documentation introduces Delta lake as:

Delta Lake_ is an [open source storage layer_](https://github.com/delta-io/delta)_ that brings reliability to [data lakes_](https://databricks.com/discover/data-lakes/introduction). Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake key points:

  • Supports ACID
  • Enables Time travel
  • Enables UPSERT

apache-spark big-data data delta-lake data-engineering data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Lakes Are Not Just For Big Data - DZone Big Data

A data expert discusses the three different types of data lakes and how data lakes can be used with data sets not considered 'big data.'

Top Microsoft big data solutions Companies | Best Microsoft big data Developers

An extensively researched list of top microsoft big data analytics and solution with ratings & reviews to help find the best Microsoft big data solutions development companies around the world.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.