Logistic Regression in Python with statsmodels
Logistic Regression is a relatively simple, powerful, and fast statistical model and an excellent tool for Data Analysis. In this post, we'll look at Logistic Regression in Python with the statsmodels package.
We'll look at how to fit a Logistic Regression to data, inspect the results, and related tasks such as accessing model parameters, calculating odds ratios, and setting reference values.
For this post, I'm going to assume a couple of things:
- Basic knowledge of Python.
- Familiar with popular data libraries like Pandas and NumPy.
- Have an understanding of Logistic Regression and associated statistical modeling terms such as coefficients and parameters.
If you're unfamiliar with Logistic Regression, I highly recommend starting with the Logistic Regression Playlist from StatQuest with Josh Starmer on YouTube.
Overview
What is statsmodels?
statsmodels is a Python package geared towards data exploration with statistical methods. It provides a wide range of statistical tools, integrates with Pandas and NumPy, and uses the R-style formula strings to define models.
Installing
The easiest way to install statsmodels is via pip:
pip install statsmodels
Logistic Regression with statsmodels
Before starting, it's worth mentioning there are two ways to do Logistic Regression in statsmodels:
statsmodels.api
: The Standard API. Data gets separated into explanatory variables (exog) and a response variable (endog). Specifying a model is done through classes.statsmodels.formula.api
: The Formula API. It uses the R-style formula syntax and dataframes.
For this guide, I've opted to use the Formula API. The Formula API is a more convenient way of building models that abstracts away the boilerplate required by the Standard API.
Under the hood, both the Standard and Formula APIs use the same underlying models.
Imports
First we need import Pandas and the statsmodels Formula API. Convention is to alias statsmodels.formula.api
to smf
.
import pandas as pd
import statsmodels.formula.api as smf
Load data
Next, we'll load some data. We'll use a subset of the data and drop rows with missing values to keep things simple.
titanic = pd.read_csv("titanic.csv")
titanic = titanic[["survived", "pclass", "sex", "age", "embark_town"]]
titanic = titanic.dropna()
The data we're using is the seaborn version of the Titanic Dataset and can be downloaded here. The seaborn version is a minimal dataset with some pre-processing applied.
Fitting a Logistic Regression
Fitting is a two-step process. First, we specify a model, then we fit. Typically the fit()
call is chained to the model specification.
The string provided to logit, "survived ~ sex + age + embark_town"
, is called the formula string and defines the model to build.
log_reg = smf.logit("survived ~ sex + age + embark_town", data=titanic).fit()
We read the formula string as "survived given (~) sex and age and emark town" —an explanation of formula strings can be found below.
Examining fit results
Lastly, we can inspect the results of the fit using the summary()
method. The summary includes information on the fit process as well as the estimated coefficients.
Optimization terminated successfully.
Current function value: 0.509889
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: survived No. Observations: 712
Model: Logit Df Residuals: 707
Method: MLE Df Model: 4
Date: Fri, 12 Nov 2021 Pseudo R-squ.: 0.2444
Time: 09:59:50 Log-Likelihood: -363.04
converged: True LL-Null: -480.45
Covariance Type: nonrobust LLR p-value: 1.209e-49
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 2.2046 0.322 6.851 0.000 1.574 2.835
sex[T.male] -2.4760 0.191 -12.976 0.000 -2.850 -2.102
embark_town[T.Queenstown] -1.8156 0.535 -3.393 0.001 -2.864 -0.767
embark_town[T.Southampton] -1.0069 0.237 -4.251 0.000 -1.471 -0.543
age -0.0081 0.007 -1.233 0.217 -0.021 0.005
==============================================================================================
The summary()
method has some helpful features explored further below.
In-full
Here's our minimal example in full.
import pandas as pd
import statsmodels.formula.api as smf
# Load data
titanic = pd.read_csv("titanic.csv")
titanic = titanic[["survived", "pclass", "sex", "age", "embark_town"]]
titanic = titanic.dropna()
# Define and fit model
log_reg = smf.logit("survived ~ sex + age + embark_town", data=titanic).fit()
# Summary of results
print(log_reg.summary())
Advanced Usage
Here we'll look at some of the more advanced features of statsmodels and its Logistic Regression implementation.
Accessing model parameters
In statsmodels, the fit()
method returns a Result
object. The model coefficients, standard errors, p-values, etc., are all available from this Result
object.
Conveniently these are stored as Pandas dataframes with the parameter name as the dataframe index.
# ... imports, load data, etc.
# Define and fit model
log_reg = smf.logit("survived ~ sex + age + embark_town", data=titanic).fit()
# Inspect paramaters
print(log_reg.params)
Intercept 0.321796
sex[T.male] 0.190807
embark_town[T.Queenstown] 0.535031
embark_town[T.Southampton] 0.236857
age 0.006550
dtype: float64
Here are some of the relevant values for a Logistic Regression.
| Attr/func | Description |
|---|---|---|
| params
| Estimated model parameters. Appears as coef
when calling summary()
on a fitted model. |
| bse
| Standard error. |
| tvalues
| z
column when calling summary()
on a fitted model. |
| pvalues
| Model's p values. |
| conf_int(alpha)
| Method that calculates the confidence interval for the estimated parameters. To call: model.conf_int(0.05)
|
To see the complete list of available attributes and methods, use Python's built-in dir()
function on the fitted model.
print(dir(log_reg))
Calculating Odds Ratios
After fitting a Logistic Regression, you'll likely want to calculate the Odds Ratios of the estimated parameters. As mentioned above, everything we need is available from the Results
object that comes from a model fit.
Here we take the estimated parameters and confidence intervals, combine them into a dataframe and apply NumPy's exp()
function to the whole dataframe.
# ... imports, load data, etc.
import numpy as np
# ... Define and fit model
odds_ratios = pd.DataFrame(
{
"OR": log_reg.params,
"Lower CI": log_reg.conf_int()[0],
"Upper CI": log_reg.conf_int()[1],
}
)
odds_ratios = np.exp(odds_ratios)
print(odds_ratios)
OR Lower CI Upper CI
Intercept 9.066489 4.825321 17.035387
sex[T.male] 0.084082 0.057848 0.122213
embark_town[T.Queenstown] 0.162742 0.057027 0.464428
embark_town[T.Southampton] 0.365332 0.229654 0.581167
age 0.991954 0.979300 1.004771
Hat tip to hedz.nz for the inspiration for this approach.
What happens with formula strings? Patsy, and Design Matrices
Most of the models in statsmodels require design matrices. You can think of design matrices as representing data in a way compatible with model building.
When we use the Formula API with a formula string, internally, this formula string is turned into a design matrix by the Patsy library.
We can explore how Patsy transforms the data by using the patsy.dmatrices()
function.
import pandas as pd
import patsy
titanic = pd.read_csv("titanic.csv")
y, X = patsy.dmatrices("survived ~ sex + age + embark_town", data=titanic, return_type="dataframe")
Inspecting X
, we can see that Patsy converted categorical variables into dummy variables and added a constant Intercept
column.
print(X[:5])
Intercept sex[T.male] ... embark_town[T.Southampton] age
0 1.0 1.0 ... 1.0 22.0
1 1.0 0.0 ... 0.0 38.0
2 1.0 0.0 ... 1.0 26.0
3 1.0 0.0 ... 1.0 35.0
4 1.0 1.0 ... 1.0 35.0
Specifically for building design matrices, Patsy is well worth exploring if you're coming from the R language or need advanced variable treatment.
Setting a reference or base level for categorical variables
With Categorical Variables, you'll sometimes want to set the reference category to be a specific value. This can help make the results more interpretable.
In the Titanic Dataset used above, we could examine how likely survival was for first-class passengers relative to third-class. We can do this with Patsy's categorical treatments.
In the Titanic dataset, the pclass
column gets interpreted as an integer. We change this by wrapping it in an uppercase C
and parentheses ()
.
formula = "survived ~ C(pclass)"log_reg = smf.logit(formula, data=titanic).fit()
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 0.6286 0.155 4.061 0.000 0.325 0.932
C(pclass)[T.2] -0.7096 0.217 -3.269 0.001 -1.135 -0.284
C(pclass)[T.3] -1.7844 0.199 -8.987 0.000 -2.174 -1.395
==================================================================================
Notice, though, this only signals to Patsy to treat pclass
as categorical. The reference level hasn't changed. To set the reference level, we include a Treatment
argument with a reference
set to the desired value.
formula = "survived ~ C(pclass, Treatment(reference=3))"log_reg = smf.logit(formula, data=titanic).fit()
==========================================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------------------
Intercept -1.1558 0.124 -9.293 0.000 -1.400 -0.912
C(pclass, Treatment(reference=3))[T.1] 1.7844 0.199 8.987 0.000 1.395 2.174
C(pclass, Treatment(reference=3))[T.2] 1.0748 0.197 5.469 0.000 0.690 1.460
==========================================================================================================
For more on categorical treatments, see here and here from the Patsy docs.
Customizing the fit summary
After we've fit a model, we'll typically inspect the results by calling summary()
on the returned result. Let's look at some of the helpful things this method can do.
Accessing the summary tables
When we print summary()
, we see two areas of information, fit details and a table of parameter estimates. We can access these tables from the Summary
object's tables
attribute.
fit_summary = log_reg.summary().tables[0]
These tables can also be outputted as LaTeX or HTML with the as_latex_tabular()
or as_html()
methods.
fit_summary = log_reg.summary().tables[0]
fit_summary.as_html()
Relabel parameter names
To relabel the parameter names, the summary()
method provides an xname
argument.
xname
is a list of labels that will be applied to each row of the summary's coefficient table. The length of xname
must match the length of the params
attribute of the Result
object returned when calling fit()
.
Default:
print(log_reg.summary())
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 0.6286 0.155 4.061 0.000 0.325 0.932
C(pclass)[T.2] -0.7096 0.217 -3.269 0.001 -1.135 -0.284
C(pclass)[T.3] -1.7844 0.199 -8.987 0.000 -2.174 -1.395
==================================================================================
Using xname
:
print(log_reg.summary(xname=["1st class", "2nd class", "3rd class"]))
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
1st class 0.6286 0.155 4.061 0.000 0.325 0.932
2nd class -0.7096 0.217 -3.269 0.001 -1.135 -0.284
3rd class -1.7844 0.199 -8.987 0.000 -2.174 -1.395
==============================================================================
Output summary to various formats
As you saw above, after we fit the model to the data, we called and printed the summary()
method to examine the specific details of the model fit.
The summary method also returns a Summary
object. The Summary
object has some useful methods for outputting to other formats.
Format | Method | Note |
---|---|---|
LaTeX | log_reg.summary().as_latex() |
Merges into a single table. statsmodels documentation recommend using as_latex_tabular() directly on individual summary tables. |
HTML | log_reg.summary().as_html() |
Output to an HTML <table> |
CSV | log_reg.summary().as_csv() |
Conclusion
In this guide, we looked at how to do Logistic Regression in Python with the statsmodels package. We covered how to fit the model to data and some of the other things associated with Logistic Regression. I hope you found it helpful!