Patsy: Contrast Coding Systems for categorical variables¶
Note
This document is based heavily on this excellent resource from UCLA.
A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means. That is, each test statistic for these variables amounts to testing whether the mean for that level is statistically significantly different from the mean of the base category. This dummy coding is called Treatment coding in R parlance, and we will follow this convention. There are, however, different coding methods that amount to different sets of linear hypotheses.
In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model’s intercept. On the other hand, a set of contrasts for a categorical variable with k levels is a set of k-1 functionally independent linear combinations of the factor level means that are also independent of the sum of the dummy variables. The dummy coding isn’t wrong per se. It captures all of the coefficients, but it complicates matters when the model assumes independence of the coefficients such as in ANOVA. Linear regression models do not assume independence of the coefficients and thus dummy coding is often the only coding that is taught in this context.
To have a look at the contrast matrices in Patsy, we will use data from UCLA ATS. First let’s load the data.
Example Data¶
In [1]: import pandas
In [2]: url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'
In [3]: hsb2 = pandas.read_table(url, delimiter=",")
---------------------------------------------------------------------------
ConnectionRefusedError Traceback (most recent call last)
/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1316 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317 encode_chunked=req.has_header('Transfer-encoding'))
1318 except OSError as err: # timeout error
/usr/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
1228 """Send a complete request to the server."""
-> 1229 self._send_request(method, url, body, headers, encode_chunked)
1230
/usr/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
1274 body = _encode(body, 'body')
-> 1275 self.endheaders(body, encode_chunked=encode_chunked)
1276
/usr/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
1223 raise CannotSendHeader()
-> 1224 self._send_output(message_body, encode_chunked=encode_chunked)
1225
/usr/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
1015 del self._buffer[:]
-> 1016 self.send(msg)
1017
/usr/lib/python3.7/http/client.py in send(self, data)
955 if self.auto_open:
--> 956 self.connect()
957 else:
/usr/lib/python3.7/http/client.py in connect(self)
927 self.sock = self._create_connection(
--> 928 (self.host,self.port), self.timeout, self.source_address)
929 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
/usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
726 if err is not None:
--> 727 raise err
728 else:
/usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
715 sock.bind(source_address)
--> 716 sock.connect(sa)
717 # Break explicitly a reference cycle
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
URLError Traceback (most recent call last)
<ipython-input-3-22ef82cb39e0> in <module>()
----> 1 hsb2 = pandas.read_table(url, delimiter=",")
/usr/lib/python3/dist-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
676 skip_blank_lines=skip_blank_lines)
677
--> 678 return _read(filepath_or_buffer, kwds)
679
680 parser_f.__name__ = name
/usr/lib/python3/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
422 compression = _infer_compression(filepath_or_buffer, compression)
423 filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer(
--> 424 filepath_or_buffer, encoding, compression)
425 kwds['compression'] = compression
426
/usr/lib/python3/dist-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
193
194 if _is_url(filepath_or_buffer):
--> 195 req = _urlopen(filepath_or_buffer)
196 content_encoding = req.headers.get('Content-Encoding', None)
197 if content_encoding == 'gzip':
/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):
/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
523 req = meth(req)
524
--> 525 response = self._open(req, data)
526
527 # post-process response
/usr/lib/python3.7/urllib/request.py in _open(self, req, data)
541 protocol = req.type
542 result = self._call_chain(self.handle_open, protocol, protocol +
--> 543 '_open', req)
544 if result:
545 return result
/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
--> 503 result = func(*args)
504 if result is not None:
505 return result
/usr/lib/python3.7/urllib/request.py in http_open(self, req)
1343
1344 def http_open(self, req):
-> 1345 return self.do_open(http.client.HTTPConnection, req)
1346
1347 http_request = AbstractHTTPHandler.do_request_
/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1317 encode_chunked=req.has_header('Transfer-encoding'))
1318 except OSError as err: # timeout error
-> 1319 raise URLError(err)
1320 r = h.getresponse()
1321 except:
URLError: <urlopen error [Errno 111] Connection refused>
It will be instructive to look at the mean of the dependent variable, write, for each level of race ((1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian)).
Treatment (Dummy) Coding¶
Dummy coding is likely the most well known coding scheme. It compares each level of the categorical variable to a base reference level. The base reference level is the value of the intercept. It is the default contrast in Patsy for unordered categorical factors. The Treatment contrast matrix for race would be
In [4]: from patsy.contrasts import Treatment
In [5]: levels = [1,2,3,4]
In [6]: contrast = Treatment(reference=0).code_without_intercept(levels)
In [7]: print(contrast.matrix)
[[0. 0. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Here we used reference=0, which implies that the first level, Hispanic, is the reference category against which the other level effects are measured. As mentioned above, the columns do not sum to zero and are thus not independent of the intercept. To be explicit, let’s look at how this would encode the race variable.
In [8]: contrast.matrix[hsb2.race-1, :][:20]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-eae0b0d66a00> in <module>()
----> 1 contrast.matrix[hsb2.race-1, :][:20]
NameError: name 'hsb2' is not defined
This is a bit of a trick, as the race category conveniently maps to zero-based indices. If it does not, this conversion happens under the hood, so this won’t work in general but nonetheless is a useful exercise to fix ideas. The below illustrates the output using the three contrasts above
In [9]: from statsmodels.formula.api import ols
In [10]: mod = ols("write ~ C(race, Treatment)", data=hsb2)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-3bdf176f3042> in <module>()
----> 1 mod = ols("write ~ C(race, Treatment)", data=hsb2)
NameError: name 'hsb2' is not defined
In [11]: res = mod.fit()