Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-wzY4ON/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     947.5
Date:                Fri, 08 Mar 2019   Prob (F-statistic):           2.43e-41
Time:                        05:02:05   Log-Likelihood:               -0.63373
No. Observations:                  50   AIC:                             9.267
Df Residuals:                      46   BIC:                             16.92
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8113      0.087     55.250      0.000       4.636       4.987
x1             0.5328      0.013     39.671      0.000       0.506       0.560
x2             0.5486      0.053     10.391      0.000       0.442       0.655
x3            -0.0222      0.001    -18.802      0.000      -0.025      -0.020
==============================================================================
Omnibus:                        1.373   Durbin-Watson:                   2.057
Prob(Omnibus):                  0.503   Jarque-Bera (JB):                0.643
Skew:                          -0.199   Prob(JB):                        0.725
Kurtosis:                       3.387   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.25701866  4.77903971  5.25789697  5.6636919   5.97731617  6.19359113
  6.32211864  6.38570354  6.41660682  6.45124528  6.52420836  6.66257512
  6.88146534  7.18155527  7.5489666   7.95754661  8.37316495  8.75932122
  9.08314184  9.32078109  9.46133893  9.50865178  9.48066228  9.40647132
  9.32155685  9.26194496  9.25829091  9.33084252  9.48611356  9.71581351
  9.9982097  10.3016979  10.58999484 10.82809941 10.98804328 11.05348598
 11.02239963 10.90740209 10.73368388 10.53486948 10.34749288 10.20499519
 10.1322303  10.14138155 10.22995995 10.38121316 10.56687786 10.75182328
 10.89982317 10.9795076 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.95381602 10.78314684 10.48901432 10.12044674  9.74198259  9.41786924
  9.19633299  9.09777161  9.10976017  9.19009318]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.811292
x1                  0.532792
np.sin(x1)          0.548606
I((x1 - 5) ** 2)   -0.022171
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.953816
1    10.783147
2    10.489014
3    10.120447
4     9.741983
5     9.417869
6     9.196333
7     9.097772
8     9.109760
9     9.190093
dtype: float64