Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-JytjB9/statsmodels-0.8.0/.pybuild/cpython3_3.6_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     957.1
Date:                Mon, 02 Jul 2018   Prob (F-statistic):           1.93e-41
Time:                        14:17:59   Log-Likelihood:                0.97002
No. Observations:                  50   AIC:                             6.060
Df Residuals:                      46   BIC:                             13.71
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8680      0.084     57.723      0.000       4.698       5.038
x1             0.5241      0.013     40.294      0.000       0.498       0.550
x2             0.5565      0.051     10.885      0.000       0.454       0.659
x3            -0.0223      0.001    -19.497      0.000      -0.025      -0.020
==============================================================================
Omnibus:                        2.297   Durbin-Watson:                   1.932
Prob(Omnibus):                  0.317   Jarque-Bera (JB):                2.209
Skew:                          -0.467   Prob(JB):                        0.331
Kurtosis:                       2.565   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.31133434  4.83331633  5.31158539  5.71581021  6.02660593  6.23871893
  6.36189006  6.4192542   6.44353927  6.47168917  6.53879394  6.67232459
  6.8876197   7.18536535  7.55148234  7.95943942  8.37461228  8.75997223
  9.08216989  9.31701442  9.45344799  9.49536259  9.46096053  9.37976388
  9.28776409  9.22150867  9.21209656  9.28006864  9.43203296  9.659579
  9.94065954 10.24321295 10.53043079 10.76680538 10.92396359 10.98532845
 10.94884273 10.82730691 10.64627611 10.43986145 10.24512572 10.09599401
 10.01767996 10.02254321 10.10805878 10.25723185 10.44139002 10.6248941
 10.77099347 10.8478642 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.81704419 10.63926166 10.33634226  9.95802388  9.56977909  9.23678515
  9.00796637  8.90401571  8.91232845  8.99008822]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.867960
x1                  0.524072
np.sin(x1)          0.556546
I((x1 - 5) ** 2)   -0.022265
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.817044
1    10.639262
2    10.336342
3     9.958024
4     9.569779
5     9.236785
6     9.007966
7     8.904016
8     8.912328
9     8.990088
dtype: float64