Apache Iceberg: A Different Table Design for Big Data

Like so many tech projects, Apache Iceberg grew out of frustration.

Ryan Blue experienced it while working on data formats at Cloudera.

“We kept seeing problems that were not really at the file level that people were trying to solve at the file level, or, you know, basically trying to work around limitations,” he said.

Those problems included the inability to reliably write to Hive tables, correctness issues and not being able to trust the results from its massively parallel processing database.

When he moved to Netflix, “the problems were still there, only 10 times worse,” he said.

“At Netflix, I spent a couple of years working around those problems and trying to basically patch them or deal with the underlying format. … I describe it as putting Band-Aids over these problems, very different problems here, there. We had a lot of different ones. And we finally just said, ‘You know, we know what the problem is here. It’s that we’re tracking data in our tables the wrong way. We need to fix that and go back to a design that would definitely work.’”

The outgrowth of that frustration is Iceberg, an open table format for huge analytic datasets.

It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table. Anything in between leaves a lot of clean-up work.

With Hive, he explained, the idea was to keep data in directories and be able to prune out the directories you don’t need. That allows Hive tables to have fast queries on really large amounts of data.

The problem, though, is that what they were doing was trying to keep track of these directories. And that didn’t scale in the end. So they ended up adding a database of those directories. And then you would go find out what files are in those directories when you needed to query the data. That created a problem in which the state of a table is stored in two places in the database that holds the directories and in the file system itself.

“The problem with holding that state in the file system is that you can’t make fine-grained changes to it. You can make fine-grained changes to the set of directories. But you can’t make fine-grained changes to the set of files, which meant that if you wanted to commit new data to two directories at the same time, you can’t do that in a single operation that either succeeds or fails. So that’s the atomicity that we that we want from our tables,” said Blue, project management chair.

Netflix open-sourced the project in 2018 and donated it to the Apache Software Foundation. It emerged from the Incubator as a top-level project last May. Its contributors include AirBnB, Amazon Web Services, Alibaba, Expedia, Dremio and others.

The project consists of a core Java library that tracks table snapshots and metadata. It’s designed to improve on the table layout of Hive, Trino, and Spark as well integrating with new engines such as Flink.

#data #devops #open source #profile #apache iceberg: a different table design for big data #table design

thenewstack.io

Apache Iceberg: A Different Table Design for Big Data