It is such a simple and common task in big data that I thought folks must have done this a thousand times, so when a customer asked me this, I went straight to the internet trying to find some good examples to share with the customer. Guess what? I couldn’t find one! So I decided to write one myself.

A typical data ETL flow with Presto and S3 looks like:

  1. Upload CSV files into S3.
  2. Load the CSV files on S3 into Presto.
  3. (optional) Convert to analytics optimised format in Parquet or ORC.
  4. Run complex query against the Parquet or ORC table.

In this blog, I use the NewYork City 2018 Yellow Taxi Trip Dataset. The dataset has 112 million rows, 17 columns each row in CSV format. Total size is 9.8GB.

Here is some example data:

head -n 3 tlc_yellow_trips_2018.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,05/19/2018 11:51:48 PM,05/20/2018 12:07:31 AM,1,2.01,1,N,48,158,2,11.5,0.5,0.5,0,0,0.3,12.8
1,05/19/2018 11:22:53 PM,05/19/2018 11:35:14 PM,1,1.3,1,N,142,164,2,9,0.5,0.5,0,0,0.3,10.3
1,05/19/2018 11:37:02 PM,05/19/2018 11:52:41 PM,1,2.2,1,N,164,114,1,11,0.5,0.5,3.05,0,0.3,15.35

I assume you have completed a basic Presto and S3 setup. You also need to set up the Hive catalog in Presto for it to query data in S3. If you haven’t, please take a look at my blog Presto with Kubernetes and S3 — Deployment.

#csv #sql #s3 #presto #aws

How to Load and Query CSV Files in S3 with Presto
35.05 GEEK