This April, a $1.2 billion Medicare scheme took advantage of hundreds of thousands of seniors in the US. In reality, this is just a small sliver of the billions of dollars healthcare fraud costs both consumers and insurance providers annually.
Healthcare fraud can come from many different directions. Some people might think of the patient who pretends to be injured, but actually, much of fraud is caused by providers(as in the New York Times article above).
Providers often have financial incentives for increasing performing unnecessary surgeries or claiming work they never even did. This leads to many different flavors of fraud that can all be difficult to detect on a claim-by-claim basis.
For example, fraud from healthcare providers could include:
These four methods of fraud are often effective for several reasons. First, there are so many claims it can be hard for claims processors to discover them before paying them. So often these fraudulent claims will be paid before getting caught. Another good reason is that sometimes the cost of adjudicating the claims might be greater than the claims’ value themselves.
That makes it difficult for insurance providers to rationalize spending money on creating methods to capture these bad behaviors.
This is why, before investing hundreds of thousands of dollars into your first fraud detection system, you should first analyze your claims from multiple directions to get an idea where fraud could be coming from.
This is where the exploratory data analysis step comes into play.
When you first start to analyze data, your goal will be to get a good sense of the data set.
This is known as exploratory data analysis. In particular, if your company follows the OSEMN (Obtain, Scrub, Explore, Model, and iNterpret) data science process, then this is the E step.
Using this process can help provide clarity to the management of your progress.
The purpose of this step is to become familiar with the data as well as to drive future analysis. Generally, this step has a combination of analyzing data sets for skew, trends, making charts, etc.
It’s not about structure or process but instead meant to bring out possible insights through a flow state.
How you approach this step depends on how you work best. For instance, our preference is to think of questions we want to answer about the data set and then go about answering said questions.
For example, in our analysis today we will be looking at the healthcare fraud data set from Kaggle.com. The data set is focused on fraud and providing insights into which providers are likely to have fraudulent claims.
So our questions will be based on looking into what could support the case of fraud for these providers and why it is worth it for our business providers to invest in our project.
Here are some example questions:
These questions can help frame and guide our analysis so we don’t spend too much time wandering without a purpose.
The purpose of this EDA step is to provide support for later more in-depth analysis. In a recent post, we discussed the concept of agile data science, not so much as a strict process but as a framework. This is one of those steps where, when you are doing the analysis, you can bring up points that are interesting using charts and metrics that might help move your business case along.
Note: In this example, we have already joined all the data sets together for easy use. You can find the code for that here.
For example, in our questions above we are looking to support the idea that it is worth looking into fraudulent providers.
So let’s look at how they play out. We are running this all in SaturnCloud.io because it is easy to spin up a VM and run this analysis as well as to share it.
In this first part, we look at age. It’s usually a great place to start because it is a natural place you might see some patterns in the data.
So we can use the histogram function in Pandas to analyze this.
creating_data_sets.py
Train_ProviderWithPatientDetailsdata['PF'] = np.where(Train_ProviderWithPatientDetailsdata['PotentialFraud']=='Yes', 1, 0)
cleanData = Train_ProviderWithPatientDetailsdata["PF"]==1
ncleanData = Train_ProviderWithPatientDetailsdata["PF"]==0
fraud=Train_ProviderWithPatientDetailsdata[cleanData]
notFraud=Train_ProviderWithPatientDetailsdata[ncleanData]
fraud.hist(column='Age',bins=50)
notFraud.hist(column='Age',bins=50)
However, once we look at it, it seems to break down at a pretty even distribution. This means there is a pretty similar sample across both sets of data.
The next question we wanted to answer was focused on spending. First, we wanted to look at spending in general. Initially, when analyzing the gross amount, nothing sticks out, as seen in the charts below.
bar_plot.py
#creating not fraud bar plot
#cast claim start date as a date so we can extract the month value
notFraud['ClaimStartDt']=pd.to_datetime(notFraud['ClaimStartDt'])
#creating a field that = the month value
notFraud['cdMonth']=notFraud['ClaimStartDt'].dt.month
notFraudBoxPlot=notFraud.groupby(['cdMonth'])['DeductibleAmtPaid'].agg('sum')
notFraudBoxPlot.plot.bar(title="Total Monthly Spend For Non-Fraud Claims")
#creating fraud bar plot
#cast claim start date as a date so we can extract the month value
fraud['ClaimStartDt']=pd.to_datetime(fraud['ClaimStartDt'])
#creating a field that = the month value
fraud['cdMonth']=fraud['ClaimStartDt'].dt.month
FraudBoxPlot=fraud.groupby(['cdMonth'])['DeductibleAmtPaid'].agg('sum')
FraudBoxPlot.plot.bar(title="Total Monthly Spend For Fraud Claims")
Overall, the months seem to line up, except that the total amounts month over month seem to be much higher on the fraud side. So we wanted to look into this.
A better way we can look at this is this. First, let’s look at the average claim cost per month.
fraud_by_Claim.py
#creating bar plots that show the cost per claim
notFraud['ClaimStartDt']=pd.to_datetime(notFraud['ClaimStartDt'])
notFraud['cdMonth']=notFraud['ClaimStartDt'].dt.month
notFraudClaims=notFraud.groupby(['cdMonth'])['DeductibleAmtPaid'].agg('sum')*1.0/notFraud.groupby('cdMonth').ClaimID.nunique()
notFraudClaims.plot.bar(title="Per Claims Costs Per Non-Fraud Claims")
notFraudClaims.plot.bar()
#DeductibleAmtPaid InscClaimAmtReimbursed
#filter_no_Train_ProviderWithPatientDetailsdata.columns
fraud['ClaimStartDt']=pd.to_datetime(fraud['ClaimStartDt'])
fraud['cdMonth']=fraud['ClaimStartDt'].dt.month
fraudClaims=fraud.groupby(['cdMonth'])['DeductibleAmtPaid'].agg('sum')*1.0/fraud.groupby('cdMonth').ClaimID.nunique()
fraudClaims.plot.bar(title="Per Claims Costs Per Fraud Claims")
fraudClaims.plot.bar()
We can see here that there is a drastic difference in the average cost per claim. In the case of providers that are likely to commit fraud, they often charge two times what the non-fraud providers charge. (It would require more analysis into what the claims were to support this).
Another great metric used in healthcare is PMPM. This stands for Per Member Per Month. This is a great metric to see how much a patient is costing per month.
So instead of looking at the average claim costs, we will look at the average patient cost per month. Technically, we should be looking at this by calculating whether or not a patient has valid coverage for the month.
However, due to the data set, we don’t really have that specific data. So, for now, we are using the proxy of the patient’s ID. It’s not perfect, but it is what we will use for now as seen in the code below.
Fraud_By_Patient.py
notFraud['ClaimStartDt']=pd.to_datetime(notFraud['ClaimStartDt'])
notFraud['cdMonth']=notFraud['ClaimStartDt'].dt.month
notFraudClaims=notFraud.groupby(['cdMonth'])['DeductibleAmtPaid'].agg('sum')*1.0/notFraud.groupby('cdMonth').BeneID.nunique()
notFraudClaims.plot.bar(title="Per Claims Costs Per Fraud Claims")
notFraudClaims.plot.bar()
#DeductibleAmtPaid InscClaimAmtReimbursed
#filter_no_Train_ProviderWithPatientDetailsdata.columns
fraud['ClaimStartDt']=pd.to_datetime(fraud['ClaimStartDt'])
fraud['cdMonth']=fraud['ClaimStartDt'].dt.month
fraudClaims=fraud.groupby(['cdMonth'])['DeductibleAmtPaid'].agg('sum')*1.0/fraud.groupby('cdMonth').BeneID.nunique()
fraudClaims.plot.bar(title="Per Claims Costs Per Fraud Claims")
fraudClaims.plot.bar()
Looking at this, you will notice that an insurance provider that is likely to have fraudulent claims also charges two times per patient more than the non-fraudulent providers.
Now, why is it important that we have done this exploratory analysis before diving into model development?
The reason is that this provides a solid business case to sell to your stakeholders for why you would like to invest further in this project.
You already have a business reason that would intrigue any business partner. Based on the monthly spend charts, your provider could be saving upwards of $750,000 USD a month, or several million dollars a year, if you were able to crack down on this insurance fraud.
In addition, you are already seeing some tendencies of fraudulent providers.
But you can’t stop analyzing the data just yet.
Now, as a data scientist or analyst, you will want further supporting evidence to continue down this avenue.
This means bringing in other angles from this data that can further support the point of the providers costing your insurance company far more than is required.
Here are a few ways we can do so.
Let’s first start by looking at the overall count per physician of claims they had in a year. Let’s take a look at what the breakdown looks like, comparing fraud to non-fraudulent claims.
Top_P_Fraud.py
sns.set(rc={'figure.figsize':(12,8)},style='white')
ax= sns.countplot(x='AttendingPhysician',hue='PotentialFraud',data=filter_Train_ProviderWithPatientDetailsdata
,order=filter_Train_ProviderWithPatientDetailsdata.AttendingPhysician.value_counts().iloc[:20].index)
plt.title('Top-20 Attending physicians invloved in Healthcare Fraud')
plt.xticks(rotation=90)
plt.show()
plt.savefig('TopAttendingphysiciansinvlovedinHealthcareFraud')
sns.set(rc={'figure.figsize':(12,8)},style='white')
ax= sns.countplot(x='AttendingPhysician',hue='PotentialFraud',data=notFraud
,order=notFraud.AttendingPhysician.value_counts().iloc[:20].index)
plt.title('Top-20 Attending physicians invloved in Healthcare Non-Fraud')
plt.xticks(rotation=90)
plt.show()
plt.savefig('TopAttendingphysiciansinvlovedinHealthcareFraud')
Top 20 Attending Physicians Involved in Healthcare Fraud vs. Top 20 Attending Physicians Not Involved in Healthcare Fraud
What you will notice is that there is a drastic difference in the number of claims done by the physicians to providers where there is a likelihood of fraud vs. our non-fraud physicians.
In addition, physician PHY330576 seems to be doing a much larger number of claims compared to even his peers at the fraudulent providers. This would be worth digging into.
There could be a business reason for why this physician provides so many more claims. Perhaps they handle procedures that are very small and easy to do, and that could just be a confounding factor.
Again, hard to say. But, this still further supports the idea that fraudulent providers are providing or claiming to provide extra services that are not needed.
Let’s take one last look at this from another angle.
Instead of monthly breakdowns, let’s try analyzing the average number of claims a physician provides per day. If we analyze the number of claims done by physicians on a daily basis depending on whether the provider is fraudulent or not fraudulent, what do we find?
greater_3_fraud.py
#counting claims per day
notFraudCountdate = notFraud.groupby(['AttendingPhysician','ClaimStartDt']).ClaimID.nunique()
#converting series to data frame
notFraudCountdate=notFraudCountdate.to_frame()
#pulling out claims that have >2 claims per day
is_gt2NotFraud=notFraudCountdate['ClaimID']>2
#filtering out days with less than 2 claims
gtNotFraud=notFraudCountdate[is_gt2NotFraud]
gtNotFraud.shape
p=pd.merge(notFraud,gtNotFraud,on=['AttendingPhysician','ClaimStartDt'],how="inner")
p.ClaimID_x.count()/notFraud.ClaimID.count()
gtNotFraud.hist(column="ClaimID")
gtFraud.hist(column="ClaimID")
Looking at the two charts, we can see there is a much larger number of claims that exceed three or more claims per day in the fraudulent providers vs. the non-fraudulent providers.
In addition, when you further look into it, you will find that fraudulent providers have 15% of claims with three or more claim ids in a day compared to 3% for non-fraudulent providers.
As you can see, the fraudulent providers are claiming much more in the way of claims per day.
This is highly suspect and would be a great place to start analyzing data.
From here your goal as an analyst would be to analyze what types of claims have three or more claims per day. This might give you a pattern of behavior.
However, we will stop our analysis for now.
Before we go on, we want to point out a nifty feature that helped us during our analysis. SaturnCloud.io is automatically integrated with Git. This means that, as you are working on answering these various questions, if you accidentally change something in your code and don’t remember what it was, then you can easily roll back. This came in handy because you’re not even seeing all the charts we developed.
Thus, having the ability to roll back and see if there were snippets of code that made more sense was very helpful!
In the original analysis on Kaggle, they tried to develop a model right away without really finding a target population.
Here we have a possible population (physicians that provide three or more claims per day) that we might want to target. This would again be brought up in a meeting with stakeholders. But it goes to show why EDA is important.
It’s not always about going headfirst into the model. Sometimes it is about first developing solid support into what populations might be worth looking at.
In this case, there is value in analyzing the three or more claims per day as that seems to be a factor.
From here you would want to see what procedures or diagnoses are included in these cases as that might further provide information into what is going on.
We do hope this gave you valuable insight into why EDA is important. It helps you get a better understanding of the data while at the same time providing support that you can offer your business partners.
#python #programming #data analysis