Presto Federated Queries

Introduction

According to The Presto Foundation, Presto (aka PrestoDB), not to be confused with PrestoSQL, is an open-source, distributed, ANSI SQL compliant query engine built for running interactive, ad-hoc analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto is used in production at an immense scale at many well-known organizations, including Facebook, Twitter, Uber, Alibaba, Airbnb, Netflix, Pinterest, Atlassian, Nasdaq, and more.

In the following post, we will gain a better understanding of Presto’s ability to execute federated queries, which join multiple disparate data sources without having to move the data. Additionally, we will explore Apache Hive, the Hive Metastore, the Apache Parquet file format, and the advantages of partitioning data.

Presto on AWS

There are several options to use Presto on AWS. AWS recommends Amazon EMR and Amazon Athena as the best ways to use Presto on the AWS platform. Presto comes pre-installed on EMR, while the Athena query engine is based on Presto 0.172, but has diverged over time, evolving its own set of features. If you need full control, you could deploy and manage your own instance of Presto and associated resources on Amazon EC2, Amazon ECS, or Amazon EKS. Lastly, you might choose to purchase a Presto distribution with commercial support from an AWS Partner, such as Ahana or Starburst. If your organization needs 24x7x365 production-grade support from experienced Presto support engineers, this is an excellent choice.

Federated Queries

In a modern Enterprise, it is rare to find all data living in a single monolithic datastore. Given the multitude of available data sources, internal and external to an organization, and the growing number of purpose-built database engines, effective analytics engines must be able to join and aggregate data across many disparate sources efficiently. AWS defines a federated query as a capability that ‘enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources.’

Presto allows querying data where it lives, including Apache Hive, Thrift, Kafka, Kudu, and Cassandra, Elasticsearch, and MongoDB. In fact, there are currently 24 different Presto data source connectors available. With Presto, we can write queries that join multiple disparate data sources, without moving the data. Below is a simple example of a Presto federated query statement, which correlates a customer’s credit rating with their age and gender. The query federates two different data sources, a PostgreSQL database table, postgresql.public.customer, and an Apache Hive Metastore table, hive.default.customer_demographics, whose underlying data resides in Amazon S3.

#aws #prestodb #data-analytics #presto #data #data analysis

Introduction

Presto on AWS

Federated Queries

towardsdatascience.com

Presto Federated Queries