Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

  1. Data “overwrite” on the same path causing data loss in case of Job Failure.
  2. Updates in the data.

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.

The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.

When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?

What is Delta Lake?

Delta Lake Documentation introduces Delta lake as:

Delta Lake_ is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs._

Delta Lake key points:

  • Supports ACID
  • Enables Time travel
  • Enables UPSERT

#apache-spark #big-data #data #delta-lake #data-engineering #data analysis

Delta lake with Spark: What and Why?
1.35 GEEK