Gerhard  Brink

Gerhard Brink

1624785780

Apache Iceberg: A Different Table Design for Big Data

Like so many tech projects, Apache Iceberg grew out of frustration.

Ryan Blue experienced it while working on data formats at Cloudera.

“We kept seeing problems that were not really at the file level that people were trying to solve at the file level, or, you know, basically trying to work around limitations,” he said.

Those problems included the inability to reliably write to Hive tables, correctness issues and not being able to trust the results from its massively parallel processing database.

When he moved to Netflix, “the problems were still there, only 10 times worse,” he said.

“At Netflix, I spent a couple of years working around those problems and trying to basically patch them or deal with the underlying format. … I describe it as putting Band-Aids over these problems, very different problems here, there. We had a lot of different ones. And we finally just said, ‘You know, we know what the problem is here. It’s that we’re tracking data in our tables the wrong way. We need to fix that and go back to a design that would definitely work.’”

The outgrowth of that frustration is Iceberg, an open table format for huge analytic datasets.

It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table. Anything in between leaves a lot of clean-up work.

With Hive, he explained, the idea was to keep data in directories and be able to prune out the directories you don’t need. That allows Hive tables to have fast queries on really large amounts of data.

The problem, though, is that what they were doing was trying to keep track of these directories. And that didn’t scale in the end. So they ended up adding a database of those directories. And then you would go find out what files are in those directories when you needed to query the data. That created a problem in which the state of a table is stored in two places in the database that holds the directories and in the file system itself.

“The problem with holding that state in the file system is that you can’t make fine-grained changes to it. You can make fine-grained changes to the set of directories. But you can’t make fine-grained changes to the set of files, which meant that if you wanted to commit new data to two directories at the same time, you can’t do that in a single operation that either succeeds or fails. So that’s the atomicity that we that we want from our tables,” said Blue, project management chair.

Netflix open-sourced the project in 2018 and donated it to the Apache Software Foundation. It emerged from the Incubator as a top-level project last May. Its contributors include AirBnB, Amazon Web Services, Alibaba, Expedia, Dremio and others.

The project consists of a core Java library that tracks table snapshots and metadata. It’s designed to improve on the table layout of HiveTrino, and Spark as well integrating with new engines such as Flink.

#data #devops #open source #profile #apache iceberg: a different table design for big data #table design

What is GEEK

Buddha Community

Apache Iceberg: A Different Table Design for Big Data
Gerhard  Brink

Gerhard Brink

1624785780

Apache Iceberg: A Different Table Design for Big Data

Like so many tech projects, Apache Iceberg grew out of frustration.

Ryan Blue experienced it while working on data formats at Cloudera.

“We kept seeing problems that were not really at the file level that people were trying to solve at the file level, or, you know, basically trying to work around limitations,” he said.

Those problems included the inability to reliably write to Hive tables, correctness issues and not being able to trust the results from its massively parallel processing database.

When he moved to Netflix, “the problems were still there, only 10 times worse,” he said.

“At Netflix, I spent a couple of years working around those problems and trying to basically patch them or deal with the underlying format. … I describe it as putting Band-Aids over these problems, very different problems here, there. We had a lot of different ones. And we finally just said, ‘You know, we know what the problem is here. It’s that we’re tracking data in our tables the wrong way. We need to fix that and go back to a design that would definitely work.’”

The outgrowth of that frustration is Iceberg, an open table format for huge analytic datasets.

It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table. Anything in between leaves a lot of clean-up work.

With Hive, he explained, the idea was to keep data in directories and be able to prune out the directories you don’t need. That allows Hive tables to have fast queries on really large amounts of data.

The problem, though, is that what they were doing was trying to keep track of these directories. And that didn’t scale in the end. So they ended up adding a database of those directories. And then you would go find out what files are in those directories when you needed to query the data. That created a problem in which the state of a table is stored in two places in the database that holds the directories and in the file system itself.

“The problem with holding that state in the file system is that you can’t make fine-grained changes to it. You can make fine-grained changes to the set of directories. But you can’t make fine-grained changes to the set of files, which meant that if you wanted to commit new data to two directories at the same time, you can’t do that in a single operation that either succeeds or fails. So that’s the atomicity that we that we want from our tables,” said Blue, project management chair.

Netflix open-sourced the project in 2018 and donated it to the Apache Software Foundation. It emerged from the Incubator as a top-level project last May. Its contributors include AirBnB, Amazon Web Services, Alibaba, Expedia, Dremio and others.

The project consists of a core Java library that tracks table snapshots and metadata. It’s designed to improve on the table layout of HiveTrino, and Spark as well integrating with new engines such as Flink.

#data #devops #open source #profile #apache iceberg: a different table design for big data #table design

 iOS App Dev

iOS App Dev

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Big Data Consulting Services | Big Data Development Experts USA

Big Data Consulting Services

Traditional data processing application has limitations of its own in terms of processing the large chunk of complex data and this is where the big data processing application comes into play. Big data processing app can easily process complex and large information with their advanced capabilities.

Want to develop a Big Data Processing Application?

WebClues Infotech with its years of experience and serving 350+ clients since our inception is the agency to trust for the Big Data Processing Application development services. With a team that is skilled in the latest technologies, there can be no one better for fulfilling your development requirements.

Want to know more about our Big Data Processing App development services?

Visit: https://www.webcluesinfotech.com/big-data-solutions/

Share your requirements https://www.webcluesinfotech.com/contact-us/

View Portfolio https://www.webcluesinfotech.com/portfolio/

#big data consulting services #big data development experts usa #big data analytics services #big data services #best big data analytics solution provider #big data services and consulting

Silly mistakes that can cost ‘Big’ in Big Data Analytics

Big Data has played a major role in defining the expansion of businesses of all kinds as it helps the companies to understand their audience and devise their business techniques in accordance with the requirement.

The importance of ‘Data’ has been spoken very highly in the modern-day business. Thus, while using big data analysis, the companies must keep away from these minor mistakes otherwise it could have a major impact on their performances. Big Data analysis can be the silver bullet that can answer your questions and help your business to scale newer heights.

Read More: Silly mistakes that can cost ‘Big’ in Big Data Analytics

#top big data analytics companies #best big data service providers #big data for business #big data technology #big data mistakes #big data analytics

Big Data can be The ‘Big’ boon for The Modern Age Businesses

The rapid growth of technology has led to many people opting for online services, and thus the collection and maintenance of data becomes a significant factor for any company. Big data analytics service providers can help the companies get a massive edge over their competitors as they would manage the data well and allow the businesses to make better business decisions. It will provide you with a combination of increased customer experience, revenue, and reduced cost and thus will create a win-win situation for your business. Big data technologies will be your perfect ally in excelling in the cut-throat business environment and come out with flying colors.

Read More: Big Data can be The ‘Big’ boon for The Modern Age Businesses

#big data analytics service providers #top big data analytics companies #impact of big data on businesses #best big data consulting firms #big data #big data for businesses