In this video we write a python script to automatically generate a sales dataset. To do this we use the NumPy, Pandas, Calendar, & Datetime libraries. This is ultimately the data that we used in my last video “Solving real world data science problems with python pandas”.

Link to finished code on GitHub:
https://github.com/KeithGalli/Pandas-Data-Science-Tasks/tree/master/Misc

0:00 - Intro & Background Info
1:15 - What we’re creating in this video!
2:03 - Start writing code (generating a simple dataframe & csv)
8:26 - Task: Making our data more realistic, selecting some products with higher probability than others
14:15 - Task: Generate 12 months worth of data in 12 csvs (calendar library, f-strings)
18:12 - Make some months have more purchases than others
19:28 - Normal distributions in NumPy
23:43 - Improving speed of our code (making testing easier)
26:41 - Task: Generate random addresses for our data
35:03 - Task: Generate pea times for purchases (datetime library overview)
40:02 - Using timedelta objects to add & subtract time from dates
45:09 - Generate a realistic quantity ordered for each product (using numpy geometric distribution)
49:38 - Add multiple items being more likely to be sold together and cleaning code a bit

Detailed video description!
We start by creating a simple dataframe and programmatically adding rows of product purchases to it. We use the random library to select these products.

We make our data more realistic by utilizing normal distributions and geometric distributions in numpy to spread out the number of purchases we make and the quantity of each item purchased.

We use the datetime library to allow us to generate thousands of different times for each purchase with the most common times peaking around 12pm and 8pm.

We take a list of the most common US street addresses to help us randomly generate addresses for each purchases.

Link to finished code on GitHub:

https://github.com/KeithGalli/Pandas-…

Subscribe : https://www.youtube.com/channel/UCq6XkhO5SZ66N04IcPbqNcw

#python #numpy #pandas #machine-learning

Generating Mock Data with Python (NumPy, Pandas, & Datetime Libraries)
8.85 GEEK