Apache Iceberg - A High-performance format For Huge Analytic Tables

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Background and documentation is available at https://iceberg.apache.org

Status

Iceberg is under active development at the Apache Software Foundation.

The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on adding row-level deletes and upserts, and integration work with new engines like Flink and Hive.

The Iceberg format specification is being actively updated and is open for comment. Until the specification is complete and released, it carries no compatibility guarantees. The spec is currently evolving as the Java reference implementation changes.

Java API javadocs are available for the master.

Collaboration

Iceberg tracks issues in GitHub and prefers to receive contributions as pull requests.

Community discussions happen primarily on the dev mailing list or on specific issues.

Building

Iceberg is built using Gradle with Java 1.8 or Java 11.

  • To invoke a build and run tests: ./gradlew build
  • To skip tests: ./gradlew build -x test -x integrationTest

Iceberg table support is organized in library modules:

  • iceberg-common contains utility classes used in other modules
  • iceberg-api contains the public Iceberg API
  • iceberg-core contains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend on
  • iceberg-parquet is an optional module for working with tables backed by Parquet files
  • iceberg-arrow is an optional module for reading Parquet into Arrow memory
  • iceberg-orc is an optional module for working with tables backed by ORC files
  • iceberg-hive-metastore is an implementation of Iceberg tables backed by the Hive metastore Thrift client
  • iceberg-data is an optional module for working with tables directly from JVM applications

This project Iceberg also has modules for adding Iceberg support to processing engines:

  • iceberg-spark2 is an implementation of Spark's Datasource V2 API in 2.4 for Iceberg (use iceberg-spark-runtime for a shaded version)
  • iceberg-spark3 is an implementation of Spark's Datasource V2 API in 3.0 for Iceberg (use iceberg-spark3-runtime for a shaded version)
  • iceberg-flink contains classes for integrating with Apache Flink (use iceberg-flink-runtime for a shaded version)
  • iceberg-mr contains an InputFormat and other classes for integrating with Apache Hive
  • iceberg-pig is an implementation of Pig's LoadFunc API for Iceberg

Engine Compatibility

See the Multi-Engine Support page to know about Iceberg compatibility with different Spark, Flink and Hive versions. For other engines such as Presto or Trino, please visit their websites for Iceberg integration details.

Download details:
Author: apache
Source code: https://github.com/apache/iceberg
License: Apache-2.0 license

#java

What is GEEK

Buddha Community

Apache Iceberg - A High-performance format For Huge Analytic Tables

Apache Iceberg - A High-performance format For Huge Analytic Tables

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Background and documentation is available at https://iceberg.apache.org

Status

Iceberg is under active development at the Apache Software Foundation.

The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on adding row-level deletes and upserts, and integration work with new engines like Flink and Hive.

The Iceberg format specification is being actively updated and is open for comment. Until the specification is complete and released, it carries no compatibility guarantees. The spec is currently evolving as the Java reference implementation changes.

Java API javadocs are available for the master.

Collaboration

Iceberg tracks issues in GitHub and prefers to receive contributions as pull requests.

Community discussions happen primarily on the dev mailing list or on specific issues.

Building

Iceberg is built using Gradle with Java 1.8 or Java 11.

  • To invoke a build and run tests: ./gradlew build
  • To skip tests: ./gradlew build -x test -x integrationTest

Iceberg table support is organized in library modules:

  • iceberg-common contains utility classes used in other modules
  • iceberg-api contains the public Iceberg API
  • iceberg-core contains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend on
  • iceberg-parquet is an optional module for working with tables backed by Parquet files
  • iceberg-arrow is an optional module for reading Parquet into Arrow memory
  • iceberg-orc is an optional module for working with tables backed by ORC files
  • iceberg-hive-metastore is an implementation of Iceberg tables backed by the Hive metastore Thrift client
  • iceberg-data is an optional module for working with tables directly from JVM applications

This project Iceberg also has modules for adding Iceberg support to processing engines:

  • iceberg-spark2 is an implementation of Spark's Datasource V2 API in 2.4 for Iceberg (use iceberg-spark-runtime for a shaded version)
  • iceberg-spark3 is an implementation of Spark's Datasource V2 API in 3.0 for Iceberg (use iceberg-spark3-runtime for a shaded version)
  • iceberg-flink contains classes for integrating with Apache Flink (use iceberg-flink-runtime for a shaded version)
  • iceberg-mr contains an InputFormat and other classes for integrating with Apache Hive
  • iceberg-pig is an implementation of Pig's LoadFunc API for Iceberg

Engine Compatibility

See the Multi-Engine Support page to know about Iceberg compatibility with different Spark, Flink and Hive versions. For other engines such as Presto or Trino, please visit their websites for Iceberg integration details.

Download details:
Author: apache
Source code: https://github.com/apache/iceberg
License: Apache-2.0 license

#java

 iOS App Dev

iOS App Dev

1624136280

The Growing High Demand for Data Analytics Skills Globally

Demand for data analytics skills are in hot demand in the work environment

There’s no uncertainty that organizations are going digital and have gigantic development potential. Also, what is driving this development? Without a doubt its data, presently considered as the new oil. In recent years, big data is assisting organizations with making more informed decisions and introducing the best products and services that are to a great extent acknowledged by individuals. If you don’t know how serious organizations really are about their big data analytics, you should take a look at these figures – the worldwide big data analytics market, that was valued at USD 37.34 billion in 2018 is relied upon to grow at a CAGR of 12.3% (from 2019 to 2027) to arrive at USD 105.08 billion by 2027.

The case for big data incorporates the affirmation of a more steady business climate because of data-driven, safer decisions. There’s essentially no kind of organization that isn’t moved by data analytics. It’s now become the foundation of specific industries such as finance.

Technology professionals who are experts in analytics are in high-demand as companies are searching for approaches to harness the power of Big Data. The number of job postings in and around analytics on Indeed and Dice has expanded considerably in the last year. This evident surge is because of the ever-growing number of companies integrating analytics and subsequently searching for analytics experts.

The demand for analytics skills is going up consistently yet there is an immense shortage on the supply side. This is occurring universally and isn’t limited to any part of the world. Despite big data analytics being a ‘Hot’ job, there are as yet countless unfilled positions across the globe because of a deficiency of required expertise.

India, as of now, has the most noteworthy concentration of analytics universally. Disregarding this, the shortage of data analytics professionals is especially acute and demand for professionals is expected to be on the higher side as more worldwide companies are outsourcing their work.

As stated by TROY, data can likewise assist a company with managing risk. While Dr. Bohler recommends that predictive analytics can go far to lessening the exposure to the danger associated with dealing with a global business, for example, he likewise cautions that it doesn’t totally annihilate it. Dr. Bohler is the Assistant Professor at TROY University having expertise in data analytics.

“If we just had wonderful data, we could never settle on terrible decisions,” says Dr. Bohler. “Actually, we are never going to have 100% of all the relevant information. Most leaders are likely upbeat enough to settle on a decision with about 80% of the data they need — not exactly that, and it turns into somewhat of a gamble.”

When you have data in hand, settling on business decisions will be more agreeable on the grounds that they are upheld by realities. With the data gathered you can perceive what functions admirably, and what you need to improve or dispose of totally.

#big data #latest news #data analytics #experts in analytics #the growing high demand for data analytics skills globally #data analytics skills globally

Jackson  Crist

Jackson Crist

1618209540

Measuring Crop Health Using Deep Learning – Notes From Tiger Analytics

Agrochemical companies manufacture a range of offerings for yield maximisation, pest resistance, hardiness, water quality and availability and other challenges facing farmers. These companies need to measure the efficacy of their products in real-world conditions, not just controlled experimental environments. Single-crop farms are divided into plots and a specific intervention performed in each. For example, hybrid seeds are sown in one plot while another is treated with fertilisers, and so on. The relative performance of each treatment is assessed by tracking the plants’ health in the plot where that treatment was administered.

#featured #deep learning solution #tiger analytics #tiger analytics deep learning #tiger analytics deep learning solution #tiger analytics machine learning #tiger analytics ml #tiger analytics ml-powered digital twin

The Analytics That Matter

I’ve long been skeptical of quoting global browser usage percentages to justify their usage of browser features. It doesn’t matter what global usage of a browser is, other than nerdy cocktail party fodder. The usage that matters is what users on your site are using, and that can be wildly different from site to site.

That idea of tracking real usage of your actual site concept has bounced around my head the last few days. And it’s not just “I can’t use CSS grid because IE 11 has 1.42% of global usage still” stuff, it’s about measuring metrics that matter to your site, no matter what they are.

Performance metrics are a big one. When you’re doing performance testing, much of it is what you would call synthetic testing. An automated browser loads your site and tracks what it finds as it loads, like the timing of a thing, the size of assets, the number of assets, etc. Synthetic information like this enters my mind when spending tons of time on performance. “I bet I can get rid of this one extra request,” I think. “I bet I can optimize this asset a little further.” And the performance tools we use report this kind of data to us readily. How big is our JavaScript bundle? What is our “Largest Contentful Paint”? What is our Lighthouse performance score? All things that are related to performance, but aren’t measuring actual user’s experience.

Let that sit for a second.

There are other analytics we can gather on a site, like usage analytics. For example, we might slap Google Analytics on a site, doing nothing but installing the generic snippet. This is going to tell us stuff like what pages are the most popular, how long people spend on the site, and what countries deliver the most traffic. Those are real user analytics, but it’s very generic analytic information.

If you’re hoping for more useful analytics data on your site, you have to think about it a little harder up front. What do you want to know? Maybe you want to know how often people use Feature X. Or you want to know how many files they have uploaded this week. Or how many messages they have sent. Or how many times they have clicked the star button. This is stuff that tells you how your site is doing. Generic analytics tracking won’t do that; you’ll have to write a little JavaScript to capture and report on those things. It takes a little effort to get the analytics you really care about.

Now apply that to performance tooling.

Rather than generic synthetic tests, why not measure things that are actually important to your specific site? One aspect to this is RUM, that is, “Real User Monitoring.” So rather than a single synthetic test being the source of all performance testing on your site, you’re tracking real users actually using the site on their actual devices. That makes a lot of sense to me, but aside from the logic of it, it unlocks some important data.

For example, one of Google’s Web Core Vitals, which are soon to affect the SEO of our pages, include a metric called First Input Delay (FID) and you have to collect data via JavaScript¹ on your page to use it.

Another Web Core Vital is “Largest Contentful Paint” which is a fascinating attempt at a more meaningful performance metric. Imagine a metric like “start render” or the first page paint. Is that interesting? Sorta. At least it is signaling to the user that something is happening (probably). Yet that first render might not be actually useful content, like the headline and body copy of a news article. So this metric makes a guess at what that useful content probably is and measures that. Very clever.

#article #analytics #google analytics #performance #data analytic

Gerhard  Brink

Gerhard Brink

1624785780

Apache Iceberg: A Different Table Design for Big Data

Like so many tech projects, Apache Iceberg grew out of frustration.

Ryan Blue experienced it while working on data formats at Cloudera.

“We kept seeing problems that were not really at the file level that people were trying to solve at the file level, or, you know, basically trying to work around limitations,” he said.

Those problems included the inability to reliably write to Hive tables, correctness issues and not being able to trust the results from its massively parallel processing database.

When he moved to Netflix, “the problems were still there, only 10 times worse,” he said.

“At Netflix, I spent a couple of years working around those problems and trying to basically patch them or deal with the underlying format. … I describe it as putting Band-Aids over these problems, very different problems here, there. We had a lot of different ones. And we finally just said, ‘You know, we know what the problem is here. It’s that we’re tracking data in our tables the wrong way. We need to fix that and go back to a design that would definitely work.’”

The outgrowth of that frustration is Iceberg, an open table format for huge analytic datasets.

It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table. Anything in between leaves a lot of clean-up work.

With Hive, he explained, the idea was to keep data in directories and be able to prune out the directories you don’t need. That allows Hive tables to have fast queries on really large amounts of data.

The problem, though, is that what they were doing was trying to keep track of these directories. And that didn’t scale in the end. So they ended up adding a database of those directories. And then you would go find out what files are in those directories when you needed to query the data. That created a problem in which the state of a table is stored in two places in the database that holds the directories and in the file system itself.

“The problem with holding that state in the file system is that you can’t make fine-grained changes to it. You can make fine-grained changes to the set of directories. But you can’t make fine-grained changes to the set of files, which meant that if you wanted to commit new data to two directories at the same time, you can’t do that in a single operation that either succeeds or fails. So that’s the atomicity that we that we want from our tables,” said Blue, project management chair.

Netflix open-sourced the project in 2018 and donated it to the Apache Software Foundation. It emerged from the Incubator as a top-level project last May. Its contributors include AirBnB, Amazon Web Services, Alibaba, Expedia, Dremio and others.

The project consists of a core Java library that tracks table snapshots and metadata. It’s designed to improve on the table layout of HiveTrino, and Spark as well integrating with new engines such as Flink.

#data #devops #open source #profile #apache iceberg: a different table design for big data #table design