Formulas: Fitting models using R-style formulas

Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs:

Loading modules and functions

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-JytjB9/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Import convention

You can import explicitly from statsmodels.formula.api

In [2]:
from statsmodels.formula.api import ols

Alternatively, you can just use the formula namespace of the main statsmodels.api.

In [3]:
sm.formula.ols
Out[3]:
<bound method Model.from_formula of <class 'statsmodels.regression.linear_model.OLS'>>

Or you can use the following conventioin

In [4]:
import statsmodels.formula.api as smf

These names are just a convenient way to get access to each model's from_formula classmethod. See, for instance

In [5]:
sm.OLS.from_formula
Out[5]:
<bound method Model.from_formula of <class 'statsmodels.regression.linear_model.OLS'>>

All of the lower case models accept formula and data arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula. data takes a pandas data frame or any other data structure that defines a __getitem__ for variable names like a structured array or a dictionary of variables.

dir(sm.formula) will print a list of available models.

Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs)

OLS regression using formulas

To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:

In [6]:
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/lib/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1317                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1318                           encode_chunked=req.has_header('Transfer-encoding'))
   1319             except OSError as err: # timeout error

/usr/lib/python3.6/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1238         """Send a complete request to the server."""
-> 1239         self._send_request(method, url, body, headers, encode_chunked)
   1240 

/usr/lib/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1284             body = _encode(body, 'body')
-> 1285         self.endheaders(body, encode_chunked=encode_chunked)
   1286 

/usr/lib/python3.6/http/client.py in endheaders(self, message_body, encode_chunked)
   1233             raise CannotSendHeader()
-> 1234         self._send_output(message_body, encode_chunked=encode_chunked)
   1235 

/usr/lib/python3.6/http/client.py in _send_output(self, message_body, encode_chunked)
   1025         del self._buffer[:]
-> 1026         self.send(msg)
   1027 

/usr/lib/python3.6/http/client.py in send(self, data)
    963             if self.auto_open:
--> 964                 self.connect()
    965             else:

/usr/lib/python3.6/http/client.py in connect(self)
   1391 
-> 1392             super().connect()
   1393 

/usr/lib/python3.6/http/client.py in connect(self)
    935         self.sock = self._create_connection(
--> 936             (self.host,self.port), self.timeout, self.source_address)
    937         self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

/usr/lib/python3.6/socket.py in create_connection(address, timeout, source_address)
    703     err = None
--> 704     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    705         af, socktype, proto, canonname, sa = res

/usr/lib/python3.6/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    744     addrlist = []
--> 745     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    746         af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-6-00127debe92e> in <module>()
----> 1 dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

/build/statsmodels-JytjB9/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
    288                      "master/doc/"+package+"/rst/")
    289     cache = _get_cache(cache)
--> 290     data, from_cache = _get_data(data_base_url, dataname, cache)
    291     data = read_csv(data, index_col=0)
    292     data = _maybe_reset_index(data)

/build/statsmodels-JytjB9/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
    219     url = base_url + (dataname + ".%s") % extension
    220     try:
--> 221         data, from_cache = _urlopen_cached(url, cache)
    222     except HTTPError as err:
    223         if '404' in str(err):

/build/statsmodels-JytjB9/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
    210     # not using the cache or didn't find it in cache
    211     if not from_cache:
--> 212         data = urlopen(url).read()
    213         if cache is not None:  # then put it in the cache
    214             _cache_it(data, cache_path)

/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    524             req = meth(req)
    525 
--> 526         response = self._open(req, data)
    527 
    528         # post-process response

/usr/lib/python3.6/urllib/request.py in _open(self, req, data)
    542         protocol = req.type
    543         result = self._call_chain(self.handle_open, protocol, protocol +
--> 544                                   '_open', req)
    545         if result:
    546             return result

/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

/usr/lib/python3.6/urllib/request.py in https_open(self, req)
   1359         def https_open(self, req):
   1360             return self.do_open(http.client.HTTPSConnection, req,
-> 1361                 context=self._context, check_hostname=self._check_hostname)
   1362 
   1363         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1318                           encode_chunked=req.has_header('Transfer-encoding'))
   1319             except OSError as err: # timeout error
-> 1320                 raise URLError(err)
   1321             r = h.getresponse()
   1322         except:

URLError: <urlopen error [Errno -2] Name or service not known>
In [7]:
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-6efe9019dec7> in <module>()
----> 1 df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
      2 df.head()

NameError: name 'dta' is not defined

Fit the model:

In [8]:
mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print(res.summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-e4e393898200> in <module>()
----> 1 mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
      2 res = mod.fit()
      3 print(res.summary())

NameError: name 'df' is not defined

Categorical variables

Looking at the summary printed above, notice that patsy determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy's default is also to include an intercept, so we automatically dropped one of the Region categories.

If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C() operator:

In [9]:
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-a108d6d9d250> in <module>()
----> 1 res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
      2 print(res.params)

NameError: name 'df' is not defined

Patsy's mode advanced features for categorical variables are discussed in: Patsy: Contrast Coding Systems for categorical variables

Operators

We have already seen that "~" separates the left-hand side of the model from the right-hand side, and that "+" adds new columns to the design matrix.

Removing variables

The "-" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:

In [10]:
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-5ae2b27f2e83> in <module>()
----> 1 res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
      2 print(res.params)

NameError: name 'df' is not defined

Multiplicative interactions

":" adds a new column to the design matrix with the interaction of the other two columns. "*" will also include the individual columns that were multiplied together:

In [11]:
res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print(res1.params, '\n')
print(res2.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-a6e365e8c2fd> in <module>()
----> 1 res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
      2 res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
      3 print(res1.params, '\n')
      4 print(res2.params)

NameError: name 'df' is not defined

Many other things are possible with operators. Please consult the patsy docs to learn more.

Functions

You can apply vectorized functions to the variables in your model:

In [12]:
res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-a5e76c93b840> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
      2 print(res.params)

NameError: name 'df' is not defined

Define a custom function:

In [13]:
def log_plus_1(x):
    return np.log(x) + 1.
res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()
print(res.params)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-53429952a2ef> in <module>()
      1 def log_plus_1(x):
      2     return np.log(x) + 1.
----> 3 res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()
      4 print(res.params)

NameError: name 'df' is not defined

Any function that is in the calling namespace is available to the formula.

Using formulas with models that do not (yet) support them

Even if a given statsmodels function does not support formulas, you can still use patsy's formula language to produce design matrices. Those matrices can then be fed to the fitting function as endog and exog arguments.

To generate numpy arrays:

In [14]:
import patsy
f = 'Lottery ~ Literacy * Wealth'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
print(y[:5])
print(X[:5])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-82bb8d9bb9aa> in <module>()
      1 import patsy
      2 f = 'Lottery ~ Literacy * Wealth'
----> 3 y,X = patsy.dmatrices(f, df, return_type='dataframe')
      4 print(y[:5])
      5 print(X[:5])

NameError: name 'df' is not defined

To generate pandas data frames:

In [15]:
f = 'Lottery ~ Literacy * Wealth'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
print(y[:5])
print(X[:5])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-c6aa9cb503d4> in <module>()
      1 f = 'Lottery ~ Literacy * Wealth'
----> 2 y,X = patsy.dmatrices(f, df, return_type='dataframe')
      3 print(y[:5])
      4 print(X[:5])

NameError: name 'df' is not defined
In [16]:
print(sm.OLS(y, X).fit().summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-47cf6a136406> in <module>()
----> 1 print(sm.OLS(y, X).fit().summary())

NameError: name 'y' is not defined