Logistic Regression is a statistical technique capable of predicting a binary outcome. It's a well-known strategy, widely used in disciplines ranging from credit and finance to medicine to criminology and other social sciences. Logistic regression is fairly intuitive and very effective; you're likely to find it among the first few chapters of a machine learning or applied statistics book and it's usage is covered by many stats courses.
It's not hard to find quality logistic regression examples using R. This tutorial, for example, published by UCLA, is a great resource and one that I've consulted many times. Python is one of the most popular languages for machine learning, and while there are bountiful resources covering topics like Support Vector Machines and text classification using Python, there's far less material on logistic regression.
This is a post about using logistic regression in Python.
We'll use a few libraries in the code samples. Make sure you have these installed before you run through the code on your machine.
numpy: a language extension that defines the numerical array and matrix
pandas: primary package to handle and operate directly on data.
statsmodels: statistics & econometrics package with useful tools for parameter estimation & statistical testing
pylab: for generating plots
Check out our post on Setting Up Scientific Python if you're missing one or more of these.
Example Use Case for Logistic Regression
We'll be using the same dataset as UCLA's Logit Regression in R tutorial to explore logistic regression in Python. Our goal will be to identify the various factors that may influence admission into graduate school.
The dataset contains several columns which we can use as predictor variables:
rankor presitge of an applicant's undergraduate alma mater
The fourth column,
admit, is our binary target variable. It indicates whether or not a candidate was admitted our not.
Load the data
Load the data using
pandas.read_csv. We now have a
DataFrame and can explore the data.
Notice that one of the columns is called "
rank." This presents a problem since
rank is also the name of a method belonging to pandas
rank calculates the ordered rank (1 through n) of a
Series). To make things easier, I renamed the rank column to "prestige".
Summary Statistics & Looking at the data
Now that we've got everything loaded into Python and named appropriately let's take a look at the data. We can use the
describe to give us a summarized view of everything--
describe is analogous to
summary in R. There's also function for calculating the standard deviation,
std. I've included it here to be consistent UCLA's tutorial, but the standard deviation is also included in
A feature I really like in
pandas is the
crosstab makes it really easy to do multidimensional frequency tables (sort of like
table in R). You might want to play around with this to look at different cuts of the data.
Histograms are often one of the most helpful tools you can use during the exploratory phase of any data analysis project. They're normally pretty easy to plot, quick to interpret, and they give you a nice visual representation of your problem.
pandas gives you a great deal of control over how categorical variables are represented. We're going dummify the "prestige" column using
get_dummies creates a new
DataFrame with binary indicator variables for each category/option in the column specified. In this case,
prestige has four levels: 1, 2, 3 and 4 (1 being most prestigious). When we call
get_dummies, we get a dataframe with four columns, each of which describes one of those levels.
Once that's done, we merge the new dummy columns into the original dataset and get rid of the
prestige column which we no longer neeed.
Lastly we're going to add a constant term for our Logistic Regression. The
statsmodels function we're going to be using requires that intercepts/contsants are specified explicitely.
Performing the regression
Acutally doing the Logsitic Regression is quite simple. Specify the column containing the variable you're trying to predict followed by the columns that the model should use to make the prediction.
In our case we'll be predicting the
admit column using
gpa, and the prestige dummy variables
prestige_4. We're going to treat
prestige_1 as our baseline and exclude it from our fit. This is done to prevent multicollinearity, or the dummy variable trap caused by including a dummy variable for every single category.
Interpreting the results
One of my favorite parts about
statsmodels is the summary output it gives. If you're coming from R, I think you'll like the output and find it very familiar too.
Logit Regression Results ============================================================================== Dep. Variable: admit No. Observations: 400 Model: Logit Df Residuals: 394 Method: MLE Df Model: 5 Date: Sun, 03 Mar 2013 Pseudo R-squ.: 0.08292 Time: 12:34:59 Log-Likelihood: -229.26 converged: True LL-Null: -249.99 LLR p-value: 7.578e-08 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ gre 0.0023 0.001 2.070 0.038 0.000 0.004 gpa 0.8040 0.332 2.423 0.015 0.154 1.454 prestige_2 -0.6754 0.316 -2.134 0.033 -1.296 -0.055 prestige_3 -1.3402 0.345 -3.881 0.000 -2.017 -0.663 prestige_4 -1.5515 0.418 -3.713 0.000 -2.370 -0.733 intercept -3.9900 1.140 -3.500 0.000 -6.224 -1.756 ==============================================================================
You get a great overview of the coeffecients of the model, how well those coeffecients fit, the overall fit quality, and several other statistical measures.
The result object also lets you to isolate and inspect parts of the model output. The confidence interval gives you an idea for how robust the coeffecients of the model are.
In this example, we're very confident that there is an inverse relationship between the probability of being admitted and the prestige of a candidate's undergraduate school.
In other words, the probability of being accepted into a graduate program is higher for students who attended a top ranked undergraduate college (
prestige_1==True) as opposed to a lower ranked school with, say,
prestige_4==True (remember, a prestige of 1 is the most prestigious and a prestige of 4 is the least prestigious.
Take the exponential of each of the coeffecients to generate the odds ratios. This tells you how a 1 unit increase or decrease in a variable affects the odds of being admitted. For example, we can expect the odds of being admitted to decrease by about 50% if the prestige of a school is 2. UCLA gives a more in depth explanation of the odds ratio here.
We can also do the same calculations using the coeffecients estimated using the confidence interval to get a better picture for how uncertainty in variables can impact the admission rate.
Digging a little deeper
As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called
cartesian which I originally found here.
We're going to use
np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.
Now that we've generated our predictions, let's make some plots to visualize the results. I created a small helper function called
isolate_and_plot which allows you to compare a given variable with the different prestige levels and the mean probability for that combination. To isolate presitge and the other variable I used a
pivot_table which allows you to easily aggregate the data.
The resulting plots shows how gre, gpa, and prestige affect the admission levels. You can see how the probability of admission gradually increases as gre and gpa increase and that the different presitge levels yield drastic probabilities of admission (particularly the most/least prestigious schools).
Logistic Regression is an excellent algorithm for classification. Even though some of the sexier, black box classification algorithms like SVM and RandomForest can perform better in some cases, it's hard to deny the value in knowing exactly what your model is doing. Often times you can get by using RandomForest to select the features of your model and then rebuild the model with Logistic Regression using the best features.