I’ll be the first to admit, I’m basic. Avocado toast, pumpkin spice latté, Christmas-lights-in-July kind of basic. I have an unhealthy Mason-jars-to-food ratio. If Trader Joe’s and Target were heaven, I wouldn’t even be mad (especially the latter; my local one has a Starbucks).

But I’m also a master’s candidate in data science, and what the world didn’t teach me about SQL I decided to learn myself. I recently finished IBM’s _SQL for Data Science _Certificate and wanted to try getting my hands a little dirtier with it.

And then, the holy grail: avocado.csv

(1) Gathering the data

I took the easy road here and scoured Kaggle (in addition to Kaggle Competitions, they also have tons of clean datasets). Voilà, a treasure trove of avocado data by region, size, PLU code, and date.

(2) Create tables to be populated

Before we can load in data, we need to create an empty matrix to populate. Here are a few quirks with SQL regarding this:

  • One can create column names in mixed case or using reserved keywords (like Date), but we’ll need to reference them with apostrophes (‘Date’) in later use. It’s better to not name columns after reserved keywords.
  • Underscores, letters, and numbers are A.O.K.
  • SQL doesn’t like variable names that _start _with a number or contain _any _special characters (including spaces). Technically, one could use brackets [the date] , but it’s considered better practice to not have these things at all.
  • SQL likes column names to be in the first row ONLY
  • Common variables are int, decimal, date, char(x), and varchar(x). Int will truncate decimals, so I went with decimal.
  • The x in char() denotes the number of characters in every entry (so char(2) would be for a column with state names such as AZ, WI, TX). varchar(x) doesn’t denote character length by byte size, and one character can be more than one byte; pick a value a little larger than your longest string of characters for that row. I used varchar(50).
  • NOT NULL just specifies that the entries in a column must be filled in. An error will pop up if there isn’t anything.

Remembering these things from my IBM course, I made a few quick adjustments in Excel to help the data read easier (i.e. I deleted a few title rows, renamed some columns to start with letters instead of numbers, and wrote underscores in place of spaces).

Most importantly, the primary key needs to be a unique identifier, so I also added in a column that counted 1–18,249. For maximum laziness, you can just take my final version of the dataset here.

-- create avocado data table 
	-- save to two decimal points

	DROP TABLE avocados; 

	CREATE TABLE avocados(
	index INT NOT NULL, 
	date DATE NOT NULL,
	avgprice DEC (10, 2) NOT NULL,
	totalvolume DEC (10, 2) NOT NULL,
	PLU4046 DEC (10, 2) NOT NULL,
	PLU4225 DEC (10, 2) NOT NULL,
	PLU4770 DEC (10, 2) NOT NULL,
	Total_Bags DEC (10, 2) NOT NULL,
	Small_Bags DEC (10, 2) NOT NULL,
	Large_Bags DEC (10, 2) NOT NULL,
	XLarge_Bags DEC (10, 2) NOT NULL,
	type VARCHAR(50) NOT NULL,
	year INT NOT NULL,
	region VARCHAR(50) NOT NULL,
	PRIMARY KEY(index)
	);

#data-science #ibm #avocado #sql #basic-sql

Basic SQL: Ranking Best Cities for Avocado Lovers
1.15 GEEK