Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-rCzyDX/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     775.3
Date:                Fri, 16 Aug 2019   Prob (F-statistic):           2.25e-39
Time:                        03:56:30   Log-Likelihood:                -2.3511
No. Observations:                  50   AIC:                             12.70
Df Residuals:                      46   BIC:                             20.35
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.2024      0.090     57.725      0.000       5.021       5.384
x1             0.4719      0.014     33.952      0.000       0.444       0.500
x2             0.4240      0.055      7.760      0.000       0.314       0.534
x3            -0.0175      0.001    -14.351      0.000      -0.020      -0.015
==============================================================================
Omnibus:                        1.296   Durbin-Watson:                   1.869
Prob(Omnibus):                  0.523   Jarque-Bera (JB):                1.095
Skew:                           0.356   Prob(JB):                        0.578
Kurtosis:                       2.865   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.76455782  5.194036    5.5900289   5.92942963  6.19747048  6.39014918
  6.51488646  6.58930683  6.63834296  6.6901393   6.77142799  6.9031369
  7.09695121  7.35339357  7.66173823  8.00177326  8.34712131  8.66957332
  8.9437232   9.15114199  9.28340565  9.34347917  9.34522934  9.31114657
  9.2686497   9.24558123  9.26563282  9.34445285  9.48707542  9.68709328
  9.92771049 10.18450199 10.42942654 10.6354338  10.78090866 10.85322254
 10.8508082  10.78341724 10.67051797 10.5380968  10.41438873 10.32523845
 10.28985412 10.31765151 10.40670704 10.54407379 10.70790867 10.87106117
 11.00553408 11.08708342]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.09151366 10.98370926 10.78029735 10.51916905 10.25020241 10.02305052
  9.87498469  9.82176916  9.85380144  9.93846335]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.202416
x1                  0.471916
np.sin(x1)          0.423986
I((x1 - 5) ** 2)   -0.017514
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    11.091514
1    10.983709
2    10.780297
3    10.519169
4    10.250202
5    10.023051
6     9.874985
7     9.821769
8     9.853801
9     9.938463
dtype: float64