LendingClub is the world’s largest peer-to-peer lending platform. Until recently (through the end of 2018), LendingClub published a public dataset of all loans issued since the company’s launch in 2007. I’m accessing the dataset via Kaggle.
import pandas as pd
loans = pd.read_csv(
"../input/lending-club/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv",
low_memory=False,
)
loans.shape
(2260701, 151)
With 2,260,701 loans to look at and 151 potential variables, my goal is to create a neural network model to predict the fraction of an expected loan return that a prospective borrower will pay back. Afterward, I’ll create a public API to serve that model.
Also, as you may have guessed from the preceding code block, this post is adapted from a Jupyter Notebook. If you’d like to follow along in your own notebook, go ahead and fork mine on Kaggle or GitHub.
I’ll first look at the data dictionary (downloaded directly from LendingClub’s website) to get an idea of how to create the desired output variable and which remaining features are available at the point of loan application (to avoid data leakage).
dictionary_df = pd.read_excel("https://resources.lendingclub.com/LCDataDictionary.xlsx")
## Drop blank rows, strip white space, convert to Python dictionary, fix one key name
dictionary_df.dropna(axis="index", inplace=True)
dictionary_df = dictionary_df.applymap(lambda x: x.strip())
dictionary_df.set_index("LoanStatNew", inplace=True)
dictionary = dictionary_df["Description"].to_dict()
dictionary["verification_status_joint"] = dictionary.pop("verified_status_joint")
## Print in order of dataset columns (which makes more sense than dictionary's order)
for col in loans.columns:
print(f"•{col}: {dictionary[col]}")
#data-science #machine-learning #machine-learning-tutorials #neural-networks #tensorflow #keras #data-cleaning #python