Enterprise has been spending millions of dollars getting data into data lakes using Apache Spark with the aspiration to perform Machine Learning and to build Recommendation engines, Fraud Detection, IoT & Predictive maintenance etc. But the fact is majority of these projects are failing in getting the reliable data.
“Delta Lake overcomes the above challenges”
Databricks open sourced their proprietary storage name in the name of Delta Lake, to bring ACID transactions to Apache Spark and big data workloads. Earlier Delta lake is available in Azure/AWS Databricks only where the data will get stored only on DBFS, which may lie on top of ADLS/S3. Now Delta format can lie on HDFS, ADLS, S3 or local File system, etc…. Delta Lake is also compatible with MLFlow.
How Delta Works?
Delta lake is based on Parquet, it adds the transactional awareness to Parquet using transaction log which will be maintained in additional folder (_delta_log ) under the table directory. Lot of vendors like Informatica, Talend embrace delta and working on native readers and writers.
Json file under the _delta_log folder will have the information like add/remove parquet files(for Atomicity), stats(for optimized performance & data skipping), partitionBy(for partition pruning), readVersions(for time travel), commitInfo(for audit).
Below is the Json file which present in delta transactional log when we write sample DataFrame with 2 records. Notice that it analyse the status like min,max in each file which helps to effectively skips the unnecessary data and helps in performance optimization.
#spark #databricks #delta #delta-lake #big-data #data analysis