Contingency tables¶
Statsmodels supports a variety of approaches for analyzing contingency tables, including methods for assessing independence, symmetry, homogeneity, and methods for working with collections of tables from a stratified population.
The methods described here are mainly for two-way tables. Multi-way
tables can be analyzed using log-linear models. Statsmodels does not
currently have a dedicated API for loglinear modeling, but Poisson
regression in statsmodels.genmod.GLM
can be used for this
purpose.
A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with \(r\) levels and one with \(c\) levels, then we have a \(r \times c\) contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. \(T_{ij}\) is the number of observations that have level \(i\) for the first variable and level \(j\) for the second variable. Note that each variable must have a finite number of levels (or categories), which can be either ordered or unordered. In different contexts, the variables defining the axes of a contingency table may be called categorical variables or factor variables. They may be either nominal (if their levels are unordered) or ordinal (if their levels are ordered).
The underlying population for a contingency table is described by a distribution table \(P_{i, j}\). The elements of \(P\) are probabilities, and the sum of all elements in \(P\) is 1. Methods for analyzing contingency tables use the data in \(T\) to learn about properties of \(P\).
The statsmodels.stats.Table
is the most basic class for
working with contingency tables. We can create a Table
object
directly from any rectangular array-like object containing the
contingency table cell counts:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import statsmodels.api as sm
In [4]: df = sm.datasets.get_rdataset("Arthritis", "vcd", cache=True).data
In [5]: tab = pd.crosstab(df['Treatment'], df['Improved'])
In [6]: tab = tab.loc[:, ["None", "Some", "Marked"]]
In [7]: table = sm.stats.Table(tab)
Alternatively, we can pass the raw data and let the Table class construct the array of cell counts for us:
In [8]: table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])
Independence¶
Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:
P_{ij} = sum_k P_{ij} cdot sum_k P_{kj} forall i, j
We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:
In [9]: print(table.table_orig) Improved Marked None Some Treatment Placebo 7 29 7 Treated 21 13 7 In [10]: print(table.fittedvalues) Improved Marked None Some Treatment Placebo 14.333333 21.5 7.166667 Treated 13.666667 20.5 6.833333 In [11]: print(table.resid_pearson) Improved Marked None Some Treatment Placebo -1.936992 1.617492 -0.062257 Treated 1.983673 -1.656473 0.063758
In this example, compared to a sample from a population in which the rows and columns are independent, we have too many observations in the placebo/no improvement and treatment/marked improvement cells, and too few observations in the placebo/marked improvement and treated/no improvement cells. This reflects the apparent benefits of the treatment.
If the rows and columns of a table are unordered (i.e. are nominal factors), then the most common approach for formally assessing independence is using Pearson’s \(\chi^2\) statistic. It’s often useful to look at the cell-wise contributions to the \(\chi^2\) statistic to see where the evidence for dependence is coming from.
In [12]: rslt = table.test_nominal_association() In [13]: print(rslt.pvalue) 0.0014626434089526352 In [14]: print(table.chi2_contribs) Improved Marked None Some Treatment Placebo 3.751938 2.616279 0.003876 Treated 3.934959 2.743902 0.004065
For tables with ordered row and column factors, we can us the linear by linear association test to obtain more power against alternative hypotheses that respect the ordering. The test statistic for the linear by linear association test is
sum_k r_i c_j T_{ij}
where \(r_i\) and \(c_j\) are row and column scores. Often these scores are set to the sequences 0, 1, …. This gives the ‘Cochran-Armitage trend test’.
In [15]: rslt = table.test_ordinal_association()
In [16]: print(rslt.pvalue)
0.023644578093923983
We can assess the association in a \(r\times x\) table by constructing a series of \(2\times 2\) tables and calculating their odds ratios. There are two ways to do this. The local odds ratios construct \(2\times 2\) tables from adjacent row and column categories.
In [17]: print(table.local_oddsratios)
Improved Marked None Some
Treatment
Placebo 0.149425 2.230769 NaN
Treated NaN NaN NaN
In [18]: taloc = sm.stats.Table2x2(np.asarray([[7, 29], [21, 13]]))
In [19]: print(taloc.oddsratio)
0.14942528735632185
In [20]: taloc = sm.stats.Table2x2(np.asarray([[29, 7], [13, 7]]))
In [21]: print(taloc.oddsratio)
2.230769230769231
The cumulative odds ratios construct \(2\times 2\) tables by dichotomizing the row and column factors at each possible point.
In [22]: print(table.cumulative_oddsratios)
Improved Marked None Some
Treatment
Placebo 0.185185 1.058824 NaN
Treated NaN NaN NaN
In [23]: tab1 = np.asarray([[7, 29 + 7], [21, 13 + 7]])
In [24]: tacum = sm.stats.Table2x2(tab1)
In [25]: print(tacum.oddsratio)
0.18518518518518517
In [26]: tab1 = np.asarray([[7 + 29, 7], [21 + 13, 7]])
In [27]: tacum = sm.stats.Table2x2(tab1)
In [28]: print(tacum.oddsratio)
1.0588235294117647
A mosaic plot is a graphical approach to informally assessing dependence in two-way tables.
from statsmodels.graphics.mosaicplot import mosaic
mosaic(data)
Symmetry and homogeneity¶
Symmetry is the property that \(P_{i, j} = P_{j, i}\) for every \(i\) and \(j\). Homogeneity is the property that the marginal distribution of the row factor and the column factor are identical, meaning that
sum_j P_{ij} = sum_j P_{ji} forall i
Note that for these properties to be applicable the table \(P\) (and \(T\)) must be square, and the row and column categories must be identical and must occur in the same order.
To illustrate, we load a data set, create a contingency table, and
calculate the row and column margins. The Table
class
contains methods for analyzing \(r \times c\) contingency tables.
The data set loaded below contains assessments of visual acuity in
people’s left and right eyes. We first load the data and create a
contingency table.
In [29]: df = sm.datasets.get_rdataset("VisualAcuity", "vcd", cache=True).data --------------------------------------------------------------------------- gaierror Traceback (most recent call last) /usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1316 h.request(req.get_method(), req.selector, req.data, headers, -> 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error /usr/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked) 1243 """Send a complete request to the server.""" -> 1244 self._send_request(method, url, body, headers, encode_chunked) 1245 /usr/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked) 1289 body = _encode(body, 'body') -> 1290 self.endheaders(body, encode_chunked=encode_chunked) 1291 /usr/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked) 1238 raise CannotSendHeader() -> 1239 self._send_output(message_body, encode_chunked=encode_chunked) 1240 /usr/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked) 1025 del self._buffer[:] -> 1026 self.send(msg) 1027 /usr/lib/python3.7/http/client.py in send(self, data) 965 if self.auto_open: --> 966 self.connect() 967 else: /usr/lib/python3.7/http/client.py in connect(self) 1405 -> 1406 super().connect() 1407 /usr/lib/python3.7/http/client.py in connect(self) 937 self.sock = self._create_connection( --> 938 (self.host,self.port), self.timeout, self.source_address) 939 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) /usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address) 706 err = None --> 707 for res in getaddrinfo(host, port, 0, SOCK_STREAM): 708 af, socktype, proto, canonname, sa = res /usr/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags) 747 addrlist = [] --> 748 for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 749 af, socktype, proto, canonname, sa = res gaierror: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: URLError Traceback (most recent call last) <ipython-input-29-d8333d293e90> in <module>() ----> 1 df = sm.datasets.get_rdataset("VisualAcuity", "vcd", cache=True).data /build/statsmodels-rCzyDX/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache) 288 "master/doc/"+package+"/rst/") 289 cache = _get_cache(cache) --> 290 data, from_cache = _get_data(data_base_url, dataname, cache) 291 data = read_csv(data, index_col=0) 292 data = _maybe_reset_index(data) /build/statsmodels-rCzyDX/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension) 219 url = base_url + (dataname + ".%s") % extension 220 try: --> 221 data, from_cache = _urlopen_cached(url, cache) 222 except HTTPError as err: 223 if '404' in str(err): /build/statsmodels-rCzyDX/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache) 210 # not using the cache or didn't find it in cache 211 if not from_cache: --> 212 data = urlopen(url).read() 213 if cache is not None: # then put it in the cache 214 _cache_it(data, cache_path) /usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener): /usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout) 523 req = meth(req) 524 --> 525 response = self._open(req, data) 526 527 # post-process response /usr/lib/python3.7/urllib/request.py in _open(self, req, data) 541 protocol = req.type 542 result = self._call_chain(self.handle_open, protocol, protocol + --> 543 '_open', req) 544 if result: 545 return result /usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args) 501 for handler in handlers: 502 func = getattr(handler, meth_name) --> 503 result = func(*args) 504 if result is not None: 505 return result /usr/lib/python3.7/urllib/request.py in https_open(self, req) 1358 def https_open(self, req): 1359 return self.do_open(http.client.HTTPSConnection, req, -> 1360 context=self._context, check_hostname=self._check_hostname) 1361 1362 https_request = AbstractHTTPHandler.do_request_ /usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error -> 1319 raise URLError(err) 1320 r = h.getresponse() 1321 except: URLError: <urlopen error [Errno -2] Name or service not known> In [30]: df = df.loc[df.gender == "female", :] --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-30-cefa155c8818> in <module>() ----> 1 df = df.loc[df.gender == "female", :] /usr/lib/python3/dist-packages/pandas/core/generic.py in __getattr__(self, name) 4376 if self._info_axis._can_hold_identifiers_and_holds_name(name): 4377 return self[name] -> 4378 return object.__getattribute__(self, name) 4379 4380 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'gender' In [31]: tab = df.set_index(['left', 'right']) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3077 try: -> 3078 return self._engine.get_loc(key) 3079 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'left' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-31-a9a89cea7be4> in <module>() ----> 1 tab = df.set_index(['left', 'right']) /usr/lib/python3/dist-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity) 3907 names.append(None) 3908 else: -> 3909 level = frame[col]._values 3910 names.append(col) 3911 if drop: /usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key) 2686 return self._getitem_multilevel(key) 2687 else: -> 2688 return self._getitem_column(key) 2689 2690 def _getitem_column(self, key): /usr/lib/python3/dist-packages/pandas/core/frame.py in _getitem_column(self, key) 2693 # get column 2694 if self.columns.is_unique: -> 2695 return self._get_item_cache(key) 2696 2697 # duplicate columns & possible reduce dimensionality /usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item) 2489 res = cache.get(item) 2490 if res is None: -> 2491 values = self._data.get(item) 2492 res = self._box_item_values(item, values) 2493 cache[item] = res /usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item, fastpath) 4113 4114 if not isna(item): -> 4115 loc = self.items.get_loc(item) 4116 else: 4117 indexer = np.arange(len(self.items))[isna(self.items)] /usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3078 return self._engine.get_loc(key) 3079 except KeyError: -> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key)) 3081 3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'left' In [32]: del tab["gender"] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3077 try: -> 3078 return self._engine.get_loc(key) 3079 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'gender' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-32-21569d1674ca> in <module>() ----> 1 del tab["gender"] /usr/lib/python3/dist-packages/pandas/core/generic.py in __delitem__(self, key) 2743 # there was no match, this call should raise the appropriate 2744 # exception: -> 2745 self._data.delete(key) 2746 2747 # delete from the caches /usr/lib/python3/dist-packages/pandas/core/internals.py in delete(self, item) 4172 Delete selected item (items if non-unique) in-place. 4173 """ -> 4174 indexer = self.items.get_loc(item) 4175 4176 is_deleted = np.zeros(self.shape[0], dtype=np.bool_) /usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3078 return self._engine.get_loc(key) 3079 except KeyError: -> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key)) 3081 3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'gender' In [33]: tab = tab.unstack() In [34]: tab.columns = tab.columns.get_level_values(1) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-34-63fa72c45fa5> in <module>() ----> 1 tab.columns = tab.columns.get_level_values(1) /usr/lib/python3/dist-packages/pandas/core/generic.py in __getattr__(self, name) 4372 if (name in self._internal_names_set or name in self._metadata or 4373 name in self._accessors): -> 4374 return object.__getattribute__(self, name) 4375 else: 4376 if self._info_axis._can_hold_identifiers_and_holds_name(name): AttributeError: 'Series' object has no attribute 'columns' In [35]: print(tab) Improved Treatment None Placebo 29 Treated 13 Some Placebo 7 Treated 7 Marked Placebo 7 Treated 21 dtype: int64
Next we create a SquareTable
object from the contingency
table.
In [36]: sqtab = sm.stats.SquareTable(tab) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-36-5f46247189bc> in <module>() ----> 1 sqtab = sm.stats.SquareTable(tab) /build/statsmodels-rCzyDX/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/stats/contingency_tables.py in __init__(self, table, shift_zeros) 423 def __init__(self, table, shift_zeros=True): 424 table = _make_df_square(table) # Non-pandas passes through --> 425 k1, k2 = table.shape 426 if k1 != k2: 427 raise ValueError('table must be square') ValueError: not enough values to unpack (expected 2, got 1) In [37]: row, col = sqtab.marginal_probabilities --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-37-71324058bc5a> in <module>() ----> 1 row, col = sqtab.marginal_probabilities NameError: name 'sqtab' is not defined In [38]: print(row) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-38-e1bf2d8dc4a4> in <module>() ----> 1 print(row) NameError: name 'row' is not defined In [39]: print(col) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-39-9a769a44a6f2> in <module>() ----> 1 print(col) NameError: name 'col' is not defined
The summary
method prints results for the symmetry and homogeneity
testing procedures.
In [40]: print(sqtab.summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-40-8d0c6fc13a84> in <module>()
----> 1 print(sqtab.summary())
NameError: name 'sqtab' is not defined
If we had the individual case records in a dataframe called data
,
we could also perform the same analysis by passing the raw data using
the SquareTable.from_data
class method.
sqtab = sm.stats.SquareTable.from_data(data[['left', 'right']])
print(sqtab.summary())
A single 2x2 table¶
Several methods for working with individual 2x2 tables are provided in
the sm.stats.Table2x2
class. The summary
method displays
several measures of association between the rows and columns of the
table.
In [41]: table = np.asarray([[35, 21], [25, 58]])
In [42]: t22 = sm.stats.Table2x2(table)
In [43]: print(t22.summary())
Estimate SE LCB UCB p-value
-------------------------------------------------
Odds ratio 3.867 1.890 7.912 0.000
Log odds ratio 1.352 0.365 0.636 2.068 0.000
Risk ratio 2.075 0.636 2.068 0.000
Log risk ratio 0.730 0.197 0.345 1.115 0.000
-------------------------------------------------
Note that the risk ratio is not symmetric so different results will be obtained if the transposed table is analyzed.
In [44]: table = np.asarray([[35, 21], [25, 58]])
In [45]: t22 = sm.stats.Table2x2(table.T)
In [46]: print(t22.summary())
Estimate SE LCB UCB p-value
-------------------------------------------------
Odds ratio 3.867 1.890 7.912 0.000
Log odds ratio 1.352 0.365 0.636 2.068 0.000
Risk ratio 2.194 0.636 2.068 0.000
Log risk ratio 0.786 0.216 0.362 1.210 0.000
-------------------------------------------------
Stratified 2x2 tables¶
Stratification occurs when we have a collection of contingency tables
defined by the same row and column factors. In the example below, we
have a collection of 2x2 tables reflecting the joint distribution of
smoking and lung cancer in each of several regions of China. It is
possible that the tables all have a common odds ratio, even while the
marginal probabilities vary among the strata. The ‘Breslow-Day’
procedure tests whether the data are consistent with a common odds
ratio. It appears below as the Test of constant OR. The
Mantel-Haenszel procedure tests whether this common odds ratio is
equal to one. It appears below as the Test of OR=1. It is also
possible to estimate the common odds and risk ratios and obtain
confidence intervals for them. The summary
method displays all of
these results. Individual results can be obtained from the class
methods and attributes.
In [47]: data = sm.datasets.china_smoking.load()
In [48]: mat = np.asarray(data.data)
In [49]: tables = [np.reshape(list(x)[1:], (2, 2)) for x in mat]
In [50]: st = sm.stats.StratifiedTable(tables)
In [51]: print(st.summary())
Estimate LCB UCB
-----------------------------------------
Pooled odds 2.174 1.984 2.383
Pooled log odds 0.777 0.685 0.868
Pooled risk ratio 1.519
Statistic P-value
-----------------------------------
Test of OR=1 280.138 0.000
Test constant OR 5.200 0.636
-----------------------
Number of tables 8
Min n 213
Max n 2900
Avg n 1052
Total n 8419
-----------------------
Module Reference¶
Table (table[, shift_zeros]) |
Analyses that can be performed on a two-way contingency table. |
Table2x2 (table[, shift_zeros]) |
Analyses that can be performed on a 2x2 contingency table. |
SquareTable (table[, shift_zeros]) |
Methods for analyzing a square contingency table. |
StratifiedTable (tables[, shift_zeros]) |
Analyses for a collection of 2x2 contingency tables. |
mcnemar (table[, exact, correction]) |
McNemar test of homogeneity. |
cochrans_q (x[, return_object]) |
Cochran’s Q test for identical binomial proportions. |