I’ll be the first to admit, I’m basic. Avocado toast, pumpkin spice latté, Christmas-lights-in-July kind of basic. I have an unhealthy Mason-jars-to-food ratio. If Trader Joe’s and Target were heaven, I wouldn’t even be mad (especially the latter; my local one has a Starbucks).
But I’m also a master’s candidate in data science, and what the world didn’t teach me about SQL I decided to learn myself. I recently finished IBM’s _SQL for Data Science _Certificate and wanted to try getting my hands a little dirtier with it.
And then, the holy grail: avocado.csv
(1) Gathering the data
I took the easy road here and scoured Kaggle (in addition to Kaggle Competitions, they also have tons of clean datasets). Voilà, a treasure trove of avocado data by region, size, PLU code, and date.
(2) Create tables to be populated
Before we can load in data, we need to create an empty matrix to populate. Here are a few quirks with SQL regarding this:
Remembering these things from my IBM course, I made a few quick adjustments in Excel to help the data read easier (i.e. I deleted a few title rows, renamed some columns to start with letters instead of numbers, and wrote underscores in place of spaces).
Most importantly, the primary key needs to be a unique identifier, so I also added in a column that counted 1–18,249. For maximum laziness, you can just take my final version of the dataset here.
-- create avocado data table
-- save to two decimal points
DROP TABLE avocados;
CREATE TABLE avocados(
index INT NOT NULL,
date DATE NOT NULL,
avgprice DEC (10, 2) NOT NULL,
totalvolume DEC (10, 2) NOT NULL,
PLU4046 DEC (10, 2) NOT NULL,
PLU4225 DEC (10, 2) NOT NULL,
PLU4770 DEC (10, 2) NOT NULL,
Total_Bags DEC (10, 2) NOT NULL,
Small_Bags DEC (10, 2) NOT NULL,
Large_Bags DEC (10, 2) NOT NULL,
XLarge_Bags DEC (10, 2) NOT NULL,
type VARCHAR(50) NOT NULL,
year INT NOT NULL,
region VARCHAR(50) NOT NULL,
PRIMARY KEY(index)
);
#data-science #ibm #avocado #sql #basic-sql