Lawrence  Lesch

Lawrence Lesch

1675910400

Querybook: A Big Data Querying UI, Combining Collocated Table Metadata

Querybook

Querybook is a Big Data IDE that allows you to discover, create, and share data analyses, queries, and tables. 

Features

  • 📚 Organize analyses with rich text, queries, and charts
  • ✏️ Compose queries with autocompletion and hovering tooltip
  • 📈 Use scheduling + charting in DataDocs to build dashboards
  • 🙌 Live query collaborations with others
  • 📝 Add additional documentation to your tables
  • 🧮 Get lineage, sample queries, frequent user, search ranking based on past query runs

Getting started

Prerequisite

Please install Docker before trying out Querybook.

Quick setup

Pull this repo and run make. Visit https://localhost:10001 when the build completes.

For more details on installation, click here

Configuration

For infrastructure configuration, click here For general configuration, click here

Supported Integrations

Query Engines

Authentication

  • User/Password
  • OAuth
    • Google Cloud OAuth
    • Okta OAuth
    • GitHub OAuth
  • LDAP

Metastore

Can be used to fetch schema and table information for metadata enrichment.

  • Hive Metastore
  • Sqlalchemy Inspect
  • AWS Glue Data Catalog

Result Storage

Use one of the following to store query results.

  • Database (MySQL, Postgres, etc)
  • S3
  • Google Cloud Storage
  • Local file

Result Export

Upload query results from Querybook to other tools for further analyses.

  • Google Sheets Export
  • Python export

Notification

Get notified upon completion of queries and DataDoc invitations via IM or email.

  • Email
  • Slack

User Interface

Query Editor editor.gif

Charting visualization.gif

Scheduling

Lineage & Analytics analytics.gif

Contributing Back

See CONTRIBUTING.

Check out the full documentation & feature highlights here.

Download Details:

Author: Pinterest
Source Code: https://github.com/pinterest/querybook 
License: Apache-2.0 license

#typescript #flask #presto #hive #notebook 

Querybook: A Big Data Querying UI, Combining Collocated Table Metadata
Nigel  Uys

Nigel Uys

1670711520

Tutorial Geospatial Analytics using Presto and Hive

Introduction to Geospatial Analytics

Geospatial Analytics is related to data that is used for locating anything on the globe, an uber driver to a man in a new neighbourhood place everybody uses its data in some way or the other. Its technology involves GPS (global positioning systems), GIS (geographical information systems), and RS (remote sensing). This blog we will explore the topic in depth. We start with the basics and then deep dive into all the details.

Why is it important?

It is necessary for so many things and is used daily for various reasons. From commuting purposes for an ordinary man to data in missiles of a defence organization of a particular county, everything requires its data. It is extracted from various resources. Every phone having an active internet connection somehow adds up to contributing to geospatial data, satellites collect data daily. It is of great use in everyday life, and so it requires a significant amount of attention. It can be used for various reasons, to help support natural hazards and, to know of disasters, global climate change, wildlife, natural resources, etc. It is used for satellite imagery too that could be for tactical or for weather forecasting purposes. Many tech giants like uber etc. use it on daily bases to help ease everyday life. A company has to be efficient in extracting the data efficiently and use it, to stand out in the market. 

How to retrieve Geospatial Data?

Various methods could do this, but mainly Presto and hives are used to extract and reform the data that's present in hundreds of petabyte and use it efficiently and make the lives of billions easy. This data is vital as it touches the mass majority and is used every second. GIS is a part of its data that helps in the collection, storage, manipulation, analyzation, and present spatial data. Whatever the situation is going on at local, regional or national level, if where is asked for it come to play. It wouldn't be effective without Visualization. 

Geospatial Analytics Using Presto

Presto is an open-source distributed SQL query, used to solve the question of any size or type. It runs on Hadoop. It supports many non-relational resources and Teradata. It can query data on its respective location, without moving the actual data to any separate system. The execution of the query runs parallel over a pure memory-based architecture, with most results returning within seconds. Many tech giants use it. It's a popular choice for undertaking interactive queries that are in data ranging in100s of PetaByte.

Geospatial Analytics Using Hive

It is a data warehouse infrastructure tool to process any structured data and developed on top of the Hadoop distributed file system. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing of any kind of data accessible.

What is the architecture of Hive?

It is an ETL and Data Warehousing tool built on top of the Hadoop. It helps to perform many operations secure like :

  • Analysis of large data sets
  • Data encapsulation
  • Ad-hoc queries

What are its major components?

  1. Client
  2. Services
  3. Processing & Resource Management
  4. Distributed Storage

Hive Clients

 It supports all the application written in languages like Java, Python, C++ etc. It is using Thrift, JDBC and ODBC drivers. It's easy to write its client application in the desired language. Its clients are categorized into three types:-

  • Thrift Clients: Apache Hive's servers are based on Thrift, so it's easy for it to serve all the request from the languages that support Thrift
  • JDBC Clients: It allows java apps to connect to it by using its JDBC driver
  • ODBC Clients: ODBC Driver will enable applications that support ODBC protocol to connect to it. It uses Thrift to communicate to its server.

Hive Services

 It provides with various services like -

  1. CLI(Command Line Interface) â€“ It is the default shell provided by it, which helps to execute its queries and command directly.
  2. Web Interface – It gives an option to execute queries and commands on a web-based GUI provided by it.
  3. Server – It is built on Apache Thrift and is also knows as Thrift Server. It allows different clients to submit requests and retrieve the final result from it.
  4. Driver – It is responsible for receiving the queries submitted by clients. It compiles, optimizes and executes the queries.

What is the architecture of Presto?

There is two central part in it: Coordinator and Worker. It is an open-source distributed system that can be run on multiple machines. Its distributed SQL query engine was built for fast analytic queries. Its deployment will include one Coordinator and any number of it.

  • Coordinator â€“ Used to submit queries and manages parsing, planning, and scheduling query processing. 
  • Worker – Processes the queries, adding more workers gives faster query processing.

What are its key components?

The key components of presto are:

Coordinator

It is the brain of any installation; it manages all the worker nodes for all the work comes related to queries. It gets results from workers and returns the final output to the client. It connects with workers and clients via REST.

Worker

It helps to execute the task and to process the data. These nodes share data amongst each other and get data from the Coordinator.

Catalogue

It contains information related to data, such as where the data is located, where the schema is located and the data source. 

Tables and Schemas

It is similar to what it means in a relational database. The table is set of rows organized into named columns and schema is what you use to hold your tables.

Connector

lt issued to help it to integrate with the external data source.

Stage

To execute a query, Presto breaks it up into steps.

Tasks

Stages are implemented as a series of functions that might get distributed on Workers.

Drivers and Operators

Tasks contains one or more parallel drivers, and they are operators in memory. An operator consumes, transforms and produces data.

What are the deployment strategies?

The deployment strategies for Hive are listed below:

AWS

Amazon EMR is used to deploy its megastore. User can opt from three configurations that Amazon has to offer, namely – Embedded, Local or Remote.  There are two options for creating an external Hive megastore for EMR:

  1. By using AWS Glue data catalogue
  2. Use Amazon RDS / Amazon Aurora

Cloud Dataproc

Apache Hive on Cloud Dataproc provides an efficient and flexible way by storing data of it in Cloud Storage and hosting its metastore in MySQL database on the Cloud SQL. It offers some advantages like flexibility and agility by letting user tailor cluster configuration for specific workloads and scale the cluster according to the need. It also helps in saving cost.

The deployment strategies for Presto

AWS

Amazon EMR allows to quickly spin up a managed EMR cluster with a presto query engine and run interactive analysis on the data stored in Amazon S3. It is used to run interactive queries. Its implementation can be built on the cloud on Amazon Web Services. Amazon EMR and Amazon Athena provides with building and implementation of it.

Cloud Dataproc

The cluster that includes its component can easily prepare in Presto. 

What are the various ways to optimise?

The various ways to optimise are described below:

Hive

  1. Tez-Execution Engine  â€“ It is an application framework built on Hadoop Yarn. 
  2. Usage of Suitable File Format – Usage of appropriate file format on the basis of data will drastically increase the query performance. ORC file format is best suited for the same.
  3. Partitioning – By partitioning the entries into the different dataset, only the required data is called during the time of the execution of the query, thus making the performance more efficient and optimized.
  4. Bucketing â€“ It helps divide the datasets into more manageable parts, for this purpose bucketing is used. User can set the size of manageable pieces or Buckets too.
  5. Vectorization â€“ Vectorized query execution is used for more optimized performance of it. It happens by performing aggregation over batches of 1024 rows at once instead of the single row each time.
  6. Cost-Based Optimization (CBO) – It performs optimization based on query cost. To use CBO parameters are to be set at the beginning of the query.
  7. Indexing – Indexing helps increase optimization. It helps the speed of the process of executing queries by taking less time to do so. 

Presto

  1. File format - Usage of ORC file format is best suited for optimizing the execution of queries while using it.
  2. It can join automatically if the feature is enabled.
  3. Dynamic filter feature optimizes the use of JOIN queries
  4. It has added a new connector configuration to skip corrupt records in input formats other than orc, parquet and rcfile.
  5. By setting task.max-worker-threads in config.properties, number of CPU cores into hyper-threads per core on a worker node.
  6. Splits can be used for efficient and optimized use in executing the queries in Presto.

What are the advantages?

The advantages of Hive and Presto are:

Hive

  1. It is a stable query engine and has a large and active community
  2. Its queries are similar to that of SQL, which are easy to understand by RDBMS professionals
  3. It supports ORC, TextFile, RCFile, Avro and Parquet file Formats

Presto 

  1. It supports file formats like ORC, Parquet and RCFile formats, eliminating the need for data transformation.
  2. It works well with Amazon S3 queries and Storage, it can query data in mere seconds even if the data is of the size of petabytes.
  3. It also has an active community.

Geospatial Analytics Using Presto and Hive

Modelling geospatial data has quite many complexities. Well, Known Texts are used to model different locations on the map. Various types like point and polygon shapes are used for these purposes. The Spatial Library is used for spatial processing in it with User-Defined Functions and SerDes. Through allowing this library in it, queries may be created using its Query Language (HQL), which is somewhat close to SQL. You will, therefore, stop complex MapReduce algorithms and stick to a more common workflow. Its plugin is running in production at Uber. All GeoSpatial traffic at Uber, more than 90% of it is completed within 5 minutes. Compared with brute force its MapReduce execution, Uber's Geospatial Plugin is more than 50X faster, leading to greater efficiency.

Summing up

Presto has the edge over Hive as it can be used to process unstructured data too, and query processing in it is faster than that in it. The data is collected in a humongous amount daily, and it needs to be extracted efficiently and judiciously to have better working software that requires it.

Original article source at: https://www.xenonstack.com/

#analytics #presto #hive #geospatial 

Tutorial Geospatial Analytics using Presto and Hive
Poppy Cooke

Poppy Cooke

1641969579

Presto: SQL-on-Anything using Kubernetes

Presto: SQL-on-Anything using Kubernetes. 

Used by Facebook, Netflix, Twitter, Uber, Lyft, and others, Presto has become a ubiquitous solution for running fast SQL analytics across disparate data sources. Presto is an open source distributed SQL query engine known for its low-latency queries, high concurrency, and ability to query multiple data sources. These data sources may include Ceph, S3, Google Cloud Storage, Azure Storage, HDFS, relational database systems and non-relational systems. Using Presto you can query virtually anything.

In the talk we will focus on what Presto is, its background, and its architecture. In the next part of the talk we will learn about Presto’s cloud native capabilities using Red Hat OpenShift and Kubernetes. Kubernetes reduces the burden and complexity of configuring, deploying, managing, and monitoring containerized applications. To achieve these capabilities with Presto, Reed Hat and Starburst partnered to provide the Presto Kubernetes Operator and the Presto Container on OpenShift.


#kubernetes #presto #devops #sql 

 

Presto: SQL-on-Anything using Kubernetes

Top 5 Reasons Presto Is the Foundation of the Data Analytics Stack - DZone Big Data

The need for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data continues to grow explosively. Data platform teams are increasingly using the federated SQL query engine PrestoDB to run such analytics for a variety of use cases across a wide range of data lakes and databases in-place, without the need to move data. PrestoDB is hosted by the Linux Foundation’s Presto Foundation and is the same project running at massive scale at Facebook, Uber and Twitter.

Let’s look at some important characteristics of Presto that account for its growing adoption.

**Easier Integration With Ecosystem **

Presto was designed to seamlessly integrate with an existing data ecosystem without any modification needed to the on-going system. It’s like turbocharging your existing stack with an additional faster data access interface.

Presto provides an additional compute layer for faster analytics. It doesn’t store the data, which gives it the massive advantage of being able to scale resources for queries up and down f based on the demand.

This compute and storage separation makes the Presto query engine extremely suitable for cloud environments. Most of the cloud deployments leverage object storage, which is already disintegrated from the compute layer, and auto-scale to optimize resource costs.

**Unified SQL Interface **

SQL is by far the oldest and the most widely-used language for data analysis. Analysts, data engineers and data scientists use SQL for exploring data, building dashboards, and testing hypotheses with notebooks like Jupyter and Zeppelin, or with BI tools like Tableau, PowerBI, and Looker, etc.

Presto is a federated query engine that has the ability to query data not just from distributed file systems, but also from other sources such as NoSQL stores like Cassandra, Elasticsearch, and RDBMS and even message queues like Kafka.

#big data #analytics #big data analytics #presto #data-science

Top 5 Reasons Presto Is the Foundation of the Data Analytics Stack - DZone Big Data
Lindsey  Koepp

Lindsey Koepp

1603759811

Running Presto on Kubernetes With Ahana Cloud and AWS EKS

The need for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data continues to grow exponentially. Data platform teams are increasingly using PrestoDB, a federated SQL query engine, to run such analytics across a wide range of data lakes and databases, in-place, without the need to move data.

In this post, we will explore the following:

  1. The requirements that companies have for self-service ad hoc analytics on data stored in AWS
  2. How Presto, an open source distributed SQL engine, answers many of these requirements
  3. How Ahana Cloud, a Presto-as-a-service, built for AWS using Amazon EKS, ECS, and many other Amazon services, enables platform teams to provide self-service analytics for their teams effortlessly.

1. Self-Service SQL Analytics Requirements

As enterprises rely on more AWS services as well as purpose-built databases installed in the cloud, data and metadata are spread very widely. Platform teams have resorted to heavy data pipelining, moving data around across multiple platforms, and in some cases, even creating multiple copies of the data to make the data more accessible. In addition, self-service analytics requires platform engineers to integrate many business intelligence, reporting tools, data science notebooks, and data visualization tools across every data source used.

The obvious downsides are added latency—consumers need to wait longer for those data pipelines to complete and for their tools to be connected–and added costs, since duplicate data consumes additional storage and data movement burns compute cycles. All those platform engineering, management, and monitoring tasks add up. Given the complexity of these activities, platform teams are looking to simplify their approach, and we often see the following requirements from users:

  1. Query data wherever it lives. While some level of data transformation, cleansing, and wrangling will always be required, users want to eliminate pipelines that simply move data around and remove unnecessary duplicates—with the added bonus of reducing data store proliferation.
  2. An ANSI SQL engine that works with the widest possible range of tools and data sources, with pre-integrated connectors that are ready to use.
  3. Ability to query any data in any form, including relational and non-relational sources and object stores, and in any file format, like JSON, Parquet, ORC, RCFile, CSV flat-files and others or Kafka streams.
  4. Low-latency querying for ad-hoc analysis. With the increase in data-driven analysis for making every decision in the enterprise, users are looking for query results in seconds and minutes–not hours.
  5. Ability to deal with data of any size, with practically unlimited scalability.

#sql #aws #kubernetes #data analytics #aws cloud #presto #prestodb

Running Presto on Kubernetes With Ahana Cloud and AWS EKS

Presto Federated Queries

Introduction

According to The Presto Foundation, Presto (aka PrestoDB), not to be confused with PrestoSQL, is an open-source, distributed, ANSI SQL compliant query engine built for running interactive, ad-hoc analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto is used in production at an immense scale at many well-known organizations, including FacebookTwitterUberAlibabaAirbnbNetflixPinterestAtlassianNasdaq, and more.

In the following post, we will gain a better understanding of Presto’s ability to execute federated queries, which join multiple disparate data sources without having to move the data. Additionally, we will explore Apache Hive, the Hive Metastore, the Apache Parquet file format, and the advantages of partitioning data.

Presto on AWS

There are several options to use Presto on AWS. AWS recommends Amazon EMR and Amazon Athena as the best ways to use Presto on the AWS platform. Presto comes pre-installed on EMR, while the Athena query engine is based on Presto 0.172, but has diverged over time, evolving its own set of features. If you need full control, you could deploy and manage your own instance of Presto and associated resources on Amazon EC2, Amazon ECS, or Amazon EKS. Lastly, you might choose to purchase a Presto distribution with commercial support from an AWS Partner, such as Ahana or Starburst. If your organization needs 24x7x365 production-grade support from experienced Presto support engineers, this is an excellent choice.

Federated Queries

In a modern Enterprise, it is rare to find all data living in a single monolithic datastore. Given the multitude of available data sources, internal and external to an organization, and the growing number of purpose-built database engines, effective analytics engines must be able to join and aggregate data across many disparate sources efficiently. AWS defines a federated query as a capability that ‘enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources.’

Presto allows querying data where it lives, including Apache HiveThriftKafkaKudu, and CassandraElasticsearch, and MongoDB. In fact, there are currently 24 different Presto data source connectors available. With Presto, we can write queries that join multiple disparate data sources, without moving the data. Below is a simple example of a Presto federated query statement, which correlates a customer’s credit rating with their age and gender. The query federates two different data sources, a PostgreSQL database table, postgresql.public.customer, and an Apache Hive Metastore table, hive.default.customer_demographics, whose underlying data resides in Amazon S3.

#aws #prestodb #data-analytics #presto #data #data analysis

Presto Federated Queries

Save The Date

In this article we will address:

  1. How to generate a table with a range of dates using Presto SQL
  2. Rules of thumb for joining tables that are supposed to complete missing data from one another.

Data integrity is one of the most important things we need to address before performing calculations over our data. Even with all the right intentions, we can sometimes miss that there is an error in our data. This can become very difficult when the mistake is not in the data we have but** in the data we don’t have**.

When performing a calculation that takes into account the number of samples in our data (calculating an average or median value) we need to address rows where the value is NULL or zero.

Let’s assume we run an online shop and want to see the average amount of daily purchases a customer makes over a month. In dates in which a customer doesn’t make any purchases we won’t have an indication for that in our data. If we ignore this issue and calculate the average amount of purchases for each customer, we will get an overestimate.

   customer_id    | order_date | purchase_amount |
10000100005411274 | 2020-04-11 |        1        |
10000100005411274 | 2020-04-16 |        1        |
10000100005411274 | 2020-04-18 |        2        |
10000100005411274 | 2020-04-21 |        2        |
10000100005411274 | 2020-04-24 |        1        |

If we calculate the customer’s daily average purchase amount without looking at our raw data, we would think his average purchase amount is 1.4 (what a customer!).

To overcome this issue, we must generate and match between all dates to all customers. Via Presto SQL we can do this in a simple query:

SELECT
     CAST(date_column AS DATE) date_column
 FROM
     (VALUES
         (SEQUENCE(date('2020-04-01'),
                   date('2020-04-30'),
                   INTERVAL '1' DAY)
         )
     ) AS t1(date_array)
 CROSS JOIN
     UNNEST(date_array) AS t2(date_column)

Using SEQUENCE, we will create an array with the dates in our range and preform a cross join between each element in the array to the array itself. The result is a column with a row for each of the different dates.

A quick alternative could be to extract all the different dates regardless of customer from our initial data and store it as a WITH AS statement as well.

Next, we will preform another cross join in order to match between our customers and the different dates in order to fill in the missing ones:

with all_dates as (
SELECT
     CAST(date_column AS DATE) date_column
 FROM
     (VALUES
         (SEQUENCE(date('2020-04-01'),
                   date('2020-04-30'),
                   INTERVAL '1' DAY)
         )
     ) AS t1(date_array)
 CROSS JOIN
     UNNEST(date_array) AS t2(date_column)
)
select distinct customer_id
               ,date_column as order_date
from customer_purchases
cross join all_dates

Last, we will join the table with the new matches between customer and dates to the initial table with our data.

#presto #data #sql #data-engineering #data-science

Save The Date
Eldred  Metz

Eldred Metz

1597646640

Reducing Large S3 API Costs Using Alluxio

Previous Works

There have been numerous articles and online webinars dealing with the benefits of using Alluxio as an intermediate storage layer between the S3 data storage and the data processing system used for ingestion or retrieval of data (i.e. Spark, Presto), as depicted in the picture below:

To name a few use cases:

The main conclusion from these use cases is that using Alluxio as a data orchestration layer has the following benefits:

  • **lower latency **in data processing pipelines:
  • Co-locating data and computation reduces network traffic
  • **horizontal scalability **for usage concurrency:
  • S3 API has limitations on the request rate for a given prefix

All these benefits are critical when deciding a production-grade data processing architecture, however one important benefit has so far not been sufficiently highlighted when choosing such architecture – cost reduction on the usage of the S3 API. This will be the focus of this article.

datasapiens

datasapiens is an international startup, based in Prague, that focuses primarily on helping companies to transform their business through data. We offer 3 key products:

#hadoop #apache spark #cloud storage #alluxio #benchmark #data engineering #aws s3 #apache hive #presto #data orchestration

Reducing Large S3 API Costs Using Alluxio
Evil  David

Evil David

1594189807

How to Load and Query CSV Files in S3 with Presto

It is such a simple and common task in big data that I thought folks must have done this a thousand times, so when a customer asked me this, I went straight to the internet trying to find some good examples to share with the customer. Guess what? I couldn’t find one! So I decided to write one myself.

A typical data ETL flow with Presto and S3 looks like:

  1. Upload CSV files into S3.
  2. Load the CSV files on S3 into Presto.
  3. (optional) Convert to analytics optimised format in Parquet or ORC.
  4. Run complex query against the Parquet or ORC table.

In this blog, I use the NewYork City 2018 Yellow Taxi Trip Dataset. The dataset has 112 million rows, 17 columns each row in CSV format. Total size is 9.8GB.

Here is some example data:

head -n 3 tlc_yellow_trips_2018.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,05/19/2018 11:51:48 PM,05/20/2018 12:07:31 AM,1,2.01,1,N,48,158,2,11.5,0.5,0.5,0,0,0.3,12.8
1,05/19/2018 11:22:53 PM,05/19/2018 11:35:14 PM,1,1.3,1,N,142,164,2,9,0.5,0.5,0,0,0.3,10.3
1,05/19/2018 11:37:02 PM,05/19/2018 11:52:41 PM,1,2.2,1,N,164,114,1,11,0.5,0.5,3.05,0,0.3,15.35

I assume you have completed a basic Presto and S3 setup. You also need to set up the Hive catalog in Presto for it to query data in S3. If you haven’t, please take a look at my blog Presto with Kubernetes and S3 — Deployment.

#csv #sql #s3 #presto #aws

How to Load and Query CSV Files in S3 with Presto