It is such a simple and common task in big data that I thought folks must have done this a thousand times, so when a customer asked me this, I went straight to the internet trying to find some good examples to share with the customer. Guess what? I couldn’t find one! So I decided to write one myself.
A typical data ETL flow with Presto and S3 looks like:
In this blog, I use the NewYork City 2018 Yellow Taxi Trip Dataset. The dataset has 112 million rows, 17 columns each row in CSV format. Total size is 9.8GB.
Here is some example data:
head -n 3 tlc_yellow_trips_2018.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,05/19/2018 11:51:48 PM,05/20/2018 12:07:31 AM,1,2.01,1,N,48,158,2,11.5,0.5,0.5,0,0,0.3,12.8
1,05/19/2018 11:22:53 PM,05/19/2018 11:35:14 PM,1,1.3,1,N,142,164,2,9,0.5,0.5,0,0,0.3,10.3
1,05/19/2018 11:37:02 PM,05/19/2018 11:52:41 PM,1,2.2,1,N,164,114,1,11,0.5,0.5,3.05,0,0.3,15.35
I assume you have completed a basic Presto and S3 setup. You also need to set up the Hive catalog in Presto for it to query data in S3. If you haven’t, please take a look at my blog Presto with Kubernetes and S3 — Deployment.
#csv #sql #s3 #presto #aws