Introduction

Enterprise has been spending millions of dollars getting data into data lakes using Apache Spark with the aspiration to perform Machine Learning and to build Recommendation engines, Fraud Detection, IoT & Predictive maintenance etc. But the fact is majority of these projects are failing in getting the reliable data.

Challenges with the traditional data lake

  • Failed Production Jobs will leave the data in corrupted state and it requires tedious job to recover the data. We need to have some script to clean up and to revert the transaction.
  • Lack of schema enforcement creates inconsistent data and low quality data.
  • Lack of consistency, while reading the data when there is a concurrent write, result will not be inconsistent until Parquet is fully updated. When there is multiple writes happening in streaming job, the downstream apps reading this data will be inconsistent because there is no isolation between each writes.

“Delta Lake overcomes the above challenges”

Delta Lake

Databricks open sourced their proprietary storage name in the name of Delta Lake, to bring ACID transactions to Apache Spark and big data workloads. Earlier Delta lake is available in Azure/AWS Databricks only where the data will get stored only on DBFS, which may lie on top of ADLS/S3. Now Delta format can lie on HDFS, ADLS, S3 or local File system, etc…. Delta Lake is also compatible with MLFlow.

How Delta Works?

Delta lake is based on Parquet, it adds the transactional awareness to Parquet using transaction log which will be maintained in additional folder (_delta_log ) under the table directory. Lot of vendors like Informatica, Talend embrace delta and working on native readers and writers.

Json file under the _delta_log folder will have the information like add/remove parquet files(for Atomicity), stats(for optimized performance & data skipping), partitionBy(for partition pruning), readVersions(for time travel), commitInfo(for audit).

Below is the Json file which present in delta transactional log when we write sample DataFrame with 2 records. Notice that it analyse the status like min,max in each file which helps to effectively skips the unnecessary data and helps in performance optimization.

#spark #databricks #delta #delta-lake #big-data #data analysis

How Change Data Capture (CDC) gets benefits from Delta Lake
4.25 GEEK