Eldred  Metz

Eldred Metz


How to predict churns in Sparkify

Sparkify is a popular digital music service similar to Spotify or Pandora created by Udacity. The goal of the project is to predict which users are at risk to churn cancelling their service. If we can identify these users before they leave, we can offer them discounts or incentives.

Different page actions a user can do, are, for example, playing a song, like a song, dislike a song, logging in, logging out, add a friend or at worst case cancel the service. What we know about the user is the id, the first and last name, the gender, if they are logging in or not, if they are a paid user or not, their location, the timestamp of their registration as well as and which browser the user is using. What we know about the songs they are listening to, is the name of the artist and the length of the song.

Image for post

Image by author on Sparkify data

Part I: Example of a user story

Let’s take a single user to better understand the interactions a user can do. Let’s investigate three actions of user id 2, Natalee Charles from Raleigh, NC.

#logistic-regression #udacity #spark #data-science #nanodegree

What is GEEK

Buddha Community

How to predict churns in Sparkify
Amara  Legros

Amara Legros


Churn Prediction in 5 minutes

An Absurd Challenge

Today I will show you how to obtain churn predictions before your coffee is ready. Put some coffee on the machine or french press so that when you get all those churn predictions you can enjoy going through them with that hot coffee you just brewed in the meantime.

A Friendly Introduction :)

Let me introduce myself. I am M Ahmed Tayib working as a Data-Scientist in Gauss Statistical Solutions. I am your friendly neighborhood data-scientist guy who loves coffee and loves an irrelevant challenge like making coffee vs conducting churn prediction.

Firstly, A definition

Churn is a term/label that is given to the customers who discontinue the services/subscription a company provides. For instance; if a user has not renewed Spotify subscription for 4 months then Spotify may consider that user a Churn.

#machine-learning #predictive-analytics #ai #customer-churn #churn

Ian  Robinson

Ian Robinson


Predictive Modeling in Data Science

Predictive modeling is an integral tool used in the data science world — learn the five primary predictive models and how to use them properly.

Predictive modeling in data science is used to answer the question “What is going to happen in the future, based on known past behaviors?” Modeling is an essential part of data science, and it is mainly divided into predictive and preventive modeling. Predictive modeling, also known as predictive analytics, is the process of using data and statistical algorithms to predict outcomes with data models. Anything from sports outcomes, television ratings to technological advances, and corporate economies can be predicted using these models.

Top 5 Predictive Models

  1. Classification Model: It is the simplest of all predictive analytics models. It puts data in categories based on its historical data. Classification models are best to answer “yes or no” types of questions.
  2. Clustering Model: This model groups data points into separate groups, based on similar behavior.
  3. **Forecast Model: **One of the most widely used predictive analytics models. It deals with metric value prediction, and this model can be applied wherever historical numerical data is available.
  4. Outliers Model: This model, as the name suggests, is oriented around exceptional data entries within a dataset. It can identify exceptional figures either by themselves or in concurrence with other numbers and categories.
  5. Time Series Model: This predictive model consists of a series of data points captured, using time as the input limit. It uses the data from previous years to develop a numerical metric and predicts the next three to six weeks of data using that metric.

#big data #data science #predictive analytics #predictive analysis #predictive modeling #predictive models

Churn Prediction on Sparkify Using Spark

My Capstone Project of Udacity Data Scientist Nanodgree

Image for post


Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In this project, an event data set from a fictional music streaming company named Sparkify was analyzed. A tiny subset (128MB) of the full dataset (12GB) was first analyzed locally in Jupyter Notebook with a scalable script in Spark and the whole data set was analyzed on the AWS EMR cluster. Find the code here.

Data preparation

Let’s first have a look at the data. There were 286500 rows and 18 columns in the mini data set (in the big data set, there were 26259199 rows). The columns and first five rows were shown as follows.

Image for post

Image for post

First five rows of the dataframe

Let’s check missing values in the data set. We will find a pattern from the table below in the missing values: There was the same number of missing values in the “artist”,” length”, and the ”song” columns, and the same number of missing values in the “firstName”, “gender”, “lastName”, “location”,” registration”, and ”userAgent” columns.

Image for post

Missing values in the dataframe

If we see closer at the “userId”, whose “firstName” was missing, we will find that those “userId” was actually empty strings (in the bid data was the user with the ID 1261737), with exactly 8346 records (with 778479 rows in the bid data), which I decided to treat as missing values and deleted. This might be someone who has only visited the Sparkify website without registering.

Image for post

Empty strings in UserId

Image for post

Number of missing UserId

After deleting the “problematic” userId, there was 255 unique users left (this number was 22277 for the big data).

Image for post

Let’s dig further on remaining missing values. As the data is event data, which means every operation of single users was recorded. I hypothesized that those missing values in the “artist” column might have an association with the certain actions (page visited) of the users, that’s why I check the visited “pages” associated with the missing “artist” and compared with the “pages” in the complete data and found that: “missing artist” is combined with all the other pages except “next song”, which means the “artist” (singer of the song) information is recorded only when a user hit “next song”.

Image for post

Categories in “page” column

If I delete those “null” artist rows, there will be no missing values anymore in the data set and unique users number in the clean data set will still be 255.

Image for post

After dealing with missing values, I transformed timestamp into epoch date, and simplified two categorical columns, extracting only “states” information from the “location” column and platform used by the users (marked as “agent”) from the “userAgent” column.

The data cleaning step is completed so far, and let’s start to explore the data and find out more information. As the final purpose is to predict churn, we need to first label the churned users (downgrade was also labeled in the same method). I used the “Cancellation Confirmation” events to define churn: those churned users who visited the “Cancellation Confirmation” page was marked as “1”, and who did not was marked as “0”. Similarly who visited page “Downgrade” at least once was marked as “1”, and who did not was marked as “0”. Now the data set with 278154 rows and columns shown below is ready for some exploratory analysis. Let’s do some comparisons between churned and stayed users.

Image for post

#aws #data-science #churn-prediction #spark #pyspark #data analytic

Unraveling churn and its challenges

Relationship management is one of the determining factors in the business health. One of the most important factors of this connection is the ability to identify when a customer is likely to cancel a service. For that reason, it is necessary to take initiatives that maximize customer retention.

Therefore, projects that identify customers prone to churn have become a frequent concern for organizations, as the cost of retention is usually lower than the cost of acquisition.

Although it has gained the attention of many companies, there is no magic formula to solve the churn problem. In addition, the solution can have numerous complexities, like identifying the churn reason to apply different retention strategies.


Is the cost of acquiring new customers greater than the cost of retention?

It is essential to observe financial and strategic expenses in order to acquire and retain customers, since for some companies the cost of acquisition may be 5x higher than the cost of retention.

What type of churn will be treated?

It is important to highlight that the churn increase for a product or service occurs in many ways, such as:

  1. Volunteer: when the customer chooses to cancel the service due to dissatisfaction or preference for a competitor.
  2. Silent: happens when a customer stops using the service for a long period and it does not generate costs — as using a credit card without monthly fees.
  3. Involuntary: when the consumer does not intend to cancel the service, but due to a negligence he may end up having his plan not renewed or canceled for irregular use, lack of payment, among others.

#machine-learning #churn #data-science #churn-prediction #deep learning

Customer Churn in Telecom Segment

Companies usually have a greater focus on customer acquisition and keep retention as a secondary priority. However, it can cost five times more to attract a new customer than it does to retain an existing one. Increasing customer retention rates by 5% can increase profits by 25% to 95%, according to research done by Bain & Company.

Churn is a metric that shows** customers who stop doing business** with a company or a particular service, also known as customer attrition. By following this metric, what most businesses could do was try to understand the reason behind churn numbers and tackle those factors, with reactive action plans.

But what if you could know in advance that a specific customer is likely to leave your business, and have a chance to take proper actions in time to prevent it from happening?

The reasons that lead customers to the cancellation decision can be numerous, coming from poor service quality, delay on customer support, prices, new competitors entering the market, and so on. Usually, there is no single reason, but a combination of events that somehow culminated in customer displeasure.

If your company were not capable to identify these signals and take actions prior to the cancel button click, there is no turning back, your customer is already gone. But you still have something valuable: the data. Your customer left very good clues about where you left to be desired. It can be a valuable source for meaningful insights and to train customer churn models. Learn from the past, and have strategic information at hand to improve future experiences, it’s all about machine learning.

#telecommunication #machine-learning #churn-prediction #data-science #churn #deep learning