Instacart Users Segmentation and Market Basket Analysis

Instacart Users Segmentation and Market Basket Analysis. Understanding the customer shopping behaviors of Instacart and make efficient recommendation

Covid-19 has been a world spread pandemic in 2020. Thus, New Yorkers follow the quarantine policy and keep a social distance from each other. As we know, the most popular mode of transportation in NYC is by subway. However, the subway is one of the most dangerous transportation tools to spread Covid-19, which may increase the risk of having Covid-19 for subway passengers. Therefore, going out to get daily needs became a headache issue for New Yorkers. People in grocery stores are not practical to keep a social distance. During Covid-19 outbreaks, New York City published a “staying at home” order, which increased the demand for online grocery shopping. Instacart is a grocery delivery platform that has experienced rapid growth during the Covid-19 crisis. Now, users gain the value of staying home to flatten the curve and to reduce their own risk of getting the virus.

The primary research goals are doing user segments based on time intervals and building a recommendation system based on product choices of users. The expectation of the research could optimize the Supply side’s inventory allocation and increase the probability that customers get essential goods without breaking the social distancing rule.

First, let’s explore the data!

The primary data source is from Instacart’s 2017 anonymized customers’ orders over time (Stanley, 2017). It contains the order file, product file, order and product file, aisles file, and department file. Each entity in the dataset has an associated unique id.

In the order dataset, it contains user id, order id, order purchased day of the week(order_dow), order purchased hour of the day(order_hour_of_the_day), days since the last purchase(day_since_prior) and an indicator of the order’s belongs(eval_set). If it is a first time purchase, the days since the last purchase will be NaN. In the department dataset, it contains an unique department id and associated departments’ names. In the aisles dataset, it has aisle id and aisles’ names. In the product dataset, it contains the product id, the name of the product, the aisles’ id and the department id.

To make the time interval of user orders, we first divided user orders based on days. The data we used here is order.csv, column name ‘dow’. From Figure 1, The most popular days of user orders are days 0 and 1. After reviewing the data instruction, we did not find the definition of days 0 to 6. We believe the two busy days, 0 and 1, should be Sunday and Monday.

Image for post

