New and creative applications for machine learning are cropping up all over the place. Who knew that agriculturalists are using image recognition to evaluate the health of plants? Or that researchers are able to generate music imitating the styles of masters from Chopin to Charlie Parker? While there's a ton of interest in applying machine learning in new fields, there's no shortage of creativity among analysts solving age-old prediction problems.
This is a post exploring one of the oldest prediction problems--predicting risk on consumer loans.
Predicting Bad Loans
We're going to be using the publicly available dataset of Lending Club loan performance. It's a real world data set with a nice mix of categorical and continuous variables.
LendingClub makes several datasets available on their website. We're going to use the 2007 to 2011 file (
LoanStats3a.csv), and our goal will be to build a web app which can approve and decline new loan applications.
Read and Clean the Data
Let's load the data into R and look at the
status column. Notice that there are several different statuses which seem to indicate good repayment behavior and several more that indicate less than perfect repayment behavior.
We're going to approach this as a binary classification problem, so our first step is to decide what statuses we'll consider good and which bad.
I didn't do anything too crazy here. If a loan was ever delinquent or if it is currently active but behind schedule, I considered it "bad". Conversely, I treated other active loans, loans in a grace period, and fully paid off loans to be "good".
Certain statuses were ambigious, and, without a data dictionary, we need to choose how to deal with them. If it wasn't clear that an applicant actually recieved a loan (e.g.
df$status==""), I categorized it as NA.
Also, one of the reasons I chose to use the file from 2007 to 2011 was to limit the build sample to mature vintages only. In other words, if we were to look at loans originated in 2012, some would only be part-way through repayment and therefore would appear to be performing better than mature vintages which have had more time to go bad.
A Strategy for Finding Risky Applicants
For feature selection, I kept it quick and dirty. You can use
ggplot2 to quickly compare the default rates and the distributions of each variable. From there, visually inspect the distributions and pick out a few variables that appear to have significant differences in the bad and good populations.
After getting rid of loans issued after 2012, I was left with approximately 30,000 loan applications. From there I split the data into training (75%) and test (25%) sets.
Random Forest does a pretty outstanding job with most prediction problems (if you're interested, read our post on random forest using python), so I decided to use
R's Random Forest package.
There really are lots of ways to skin this cat, so you can and should explore a few. Checkout this post exploring the best modeling techniques among Kaggle participants in the Give Me Some Credit competition.
I horse raced Random Forest against other models, and Random Forest consistently outperformed the other algorithms like logistic regression.
Deploying to Yhat
So I've got this script on my laptop which is cool in an academic sort of way. But these insights would be more useful in a live application. Let's turn our R script into a routine that can be called via REST.
First, wrap your model variables in the
model.predict functions. Be sure to handle any categorical variables. You can do this by using the
levels function in
yhat.deploy function, and your model is deployed and exposed as a RESTful API that you can call from anywhere!
Automatically Generate API Docs
Your model is deployed and can be called via REST. Because you might not be the only person using your API (i.e. others on your team might need to call it to make predictions), you probably want to add some documentation around its usage. Yhat can generate a documentation page for you.
First, set up a test case with raw data as it appears in the wild. In other words, take real observations from your data set or generate some realistic sample data. For us, this will be a few raw loan applications. I'm just using the rows we used for training the model.
Pass that data to the
yhat.document function along with the model name and model version you want to document (for us this is version number 1). Yhat generates HTML docs and return a URL that looks like this:
Using the test data you provided, Yhat will identify the input parameters that your model expects when making new predictions. From there, it produces a web app that lets you test the model using a UI. Input some test values and click "Go!" to execute the model in real time.
You can visit my app here, or you can use it in the iframe below.
Use the Results dropdown to display the prediction in HTML or JSON. The JSON format is especially helpful to anybody using your model from other software applications.
- Lending Club Stats
- Lending Club Modeling
- Lending Club Loan Analysis: Making Money with Logistic Regression by Dr. Jason Davis
- The Complete Guide to Investor Risks at Lending Club & Prosper by Simon Cunningham
- Lending Club Review - How to Become a Bank
- Big Data + Machine Learning = Scared banks by Jeremy Liew
- Random Forest of 'Give Me Some Credit' Survey Results by Margit Zwemer
- Analysis of Survey Data for the ?Give Me Some Credit? Competition Hosted on Kaggle (PDF whitepaper)