Sparkify is a popular digital music service similar to Spotify or Pandora created by Udacity. The goal of the project is to predict which users are at risk to churn cancelling their service. If we can identify these users before they leave, we can offer them discounts or incentives.
Different page actions a user can do, are, for example, playing a song, like a song, dislike a song, logging in, logging out, add a friend or at worst case cancel the service. What we know about the user is the id, the first and last name, the gender, if they are logging in or not, if they are a paid user or not, their location, the timestamp of their registration as well as and which browser the user is using. What we know about the songs they are listening to, is the name of the artist and the length of the song.
Image by author on Sparkify data
Part I: Example of a user story
Let’s take a single user to better understand the interactions a user can do. Let’s investigate three actions of user id 2, Natalee Charles from Raleigh, NC.
#logistic-regression #udacity #spark #data-science #nanodegree
An Absurd Challenge
Today I will show you how to obtain churn predictions before your coffee is ready. Put some coffee on the machine or french press so that when you get all those churn predictions you can enjoy going through them with that hot coffee you just brewed in the meantime.
A Friendly Introduction :)
Let me introduce myself. I am M Ahmed Tayib working as a Data-Scientist in Gauss Statistical Solutions. I am your friendly neighborhood data-scientist guy who loves coffee and loves an irrelevant challenge like making coffee vs conducting churn prediction.
Firstly, A definition
Churn is a term/label that is given to the customers who discontinue the services/subscription a company provides. For instance; if a user has not renewed Spotify subscription for 4 months then Spotify may consider that user a Churn.
#machine-learning #predictive-analytics #ai #customer-churn #churn
Predictive modeling in data science is used to answer the question “What is going to happen in the future, based on known past behaviors?” Modeling is an essential part of data science, and it is mainly divided into predictive and preventive modeling. Predictive modeling, also known as predictive analytics, is the process of using data and statistical algorithms to predict outcomes with data models. Anything from sports outcomes, television ratings to technological advances, and corporate economies can be predicted using these models.
#big data #data science #predictive analytics #predictive analysis #predictive modeling #predictive models
Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In this project, an event data set from a fictional music streaming company named Sparkify was analyzed. A tiny subset (128MB) of the full dataset (12GB) was first analyzed locally in Jupyter Notebook with a scalable script in Spark and the whole data set was analyzed on the AWS EMR cluster. Find the code here.
Let’s first have a look at the data. There were 286500 rows and 18 columns in the mini data set (in the big data set, there were 26259199 rows). The columns and first five rows were shown as follows.
First five rows of the dataframe
Let’s check missing values in the data set. We will find a pattern from the table below in the missing values: There was the same number of missing values in the “artist”,” length”, and the ”song” columns, and the same number of missing values in the “firstName”, “gender”, “lastName”, “location”,” registration”, and ”userAgent” columns.
Missing values in the dataframe
If we see closer at the “userId”, whose “firstName” was missing, we will find that those “userId” was actually empty strings (in the bid data was the user with the ID 1261737), with exactly 8346 records (with 778479 rows in the bid data), which I decided to treat as missing values and deleted. This might be someone who has only visited the Sparkify website without registering.
Empty strings in UserId
Number of missing UserId
After deleting the “problematic” userId, there was 255 unique users left (this number was 22277 for the big data).
Let’s dig further on remaining missing values. As the data is event data, which means every operation of single users was recorded. I hypothesized that those missing values in the “artist” column might have an association with the certain actions (page visited) of the users, that’s why I check the visited “pages” associated with the missing “artist” and compared with the “pages” in the complete data and found that: “missing artist” is combined with all the other pages except “next song”, which means the “artist” (singer of the song) information is recorded only when a user hit “next song”.
Categories in “page” column
If I delete those “null” artist rows, there will be no missing values anymore in the data set and unique users number in the clean data set will still be 255.
After dealing with missing values, I transformed timestamp into epoch date, and simplified two categorical columns, extracting only “states” information from the “location” column and platform used by the users (marked as “agent”) from the “userAgent” column.
The data cleaning step is completed so far, and let’s start to explore the data and find out more information. As the final purpose is to predict churn, we need to first label the churned users (downgrade was also labeled in the same method). I used the “Cancellation Confirmation” events to define churn: those churned users who visited the “Cancellation Confirmation” page was marked as “1”, and who did not was marked as “0”. Similarly who visited page “Downgrade” at least once was marked as “1”, and who did not was marked as “0”. Now the data set with 278154 rows and columns shown below is ready for some exploratory analysis. Let’s do some comparisons between churned and stayed users.
#aws #data-science #churn-prediction #spark #pyspark #data analytic
Relationship management is one of the determining factors in the business health. One of the most important factors of this connection is the ability to identify when a customer is likely to cancel a service. For that reason, it is necessary to take initiatives that maximize customer retention.
Therefore, projects that identify customers prone to churn have become a frequent concern for organizations, as the cost of retention is usually lower than the cost of acquisition.
Although it has gained the attention of many companies, there is no magic formula to solve the churn problem. In addition, the solution can have numerous complexities, like identifying the churn reason to apply different retention strategies.
It is essential to observe financial and strategic expenses in order to acquire and retain customers, since for some companies the cost of acquisition may be 5x higher than the cost of retention.
It is important to highlight that the churn increase for a product or service occurs in many ways, such as:
#machine-learning #churn #data-science #churn-prediction #deep learning
Companies usually have a greater focus on customer acquisition and keep retention as a secondary priority. However, it can cost five times more to attract a new customer than it does to retain an existing one. Increasing customer retention rates by 5% can increase profits by 25% to 95%, according to research done by Bain & Company.
Churn is a metric that shows** customers who stop doing business** with a company or a particular service, also known as customer attrition. By following this metric, what most businesses could do was try to understand the reason behind churn numbers and tackle those factors, with reactive action plans.
But what if you could know in advance that a specific customer is likely to leave your business, and have a chance to take proper actions in time to prevent it from happening?
The reasons that lead customers to the cancellation decision can be numerous, coming from poor service quality, delay on customer support, prices, new competitors entering the market, and so on. Usually, there is no single reason, but a combination of events that somehow culminated in customer displeasure.
If your company were not capable to identify these signals and take actions prior to the cancel button click, there is no turning back, your customer is already gone. But you still have something valuable: the data. Your customer left very good clues about where you left to be desired. It can be a valuable source for meaningful insights and to train customer churn models. Learn from the past, and have strategic information at hand to improve future experiences, it’s all about machine learning.
#telecommunication #machine-learning #churn-prediction #data-science #churn #deep learning