Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.989
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     1395.
Date:                Thu, 05 Jan 2023   Prob (F-statistic):           3.72e-45
Time:                        20:34:52   Log-Likelihood:                 12.089
No. Observations:                  50   AIC:                            -16.18
Df Residuals:                      46   BIC:                            -8.531
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0938      0.068     75.444      0.000       4.958       5.230
x1             0.4918      0.010     47.229      0.000       0.471       0.513
x2             0.4363      0.041     10.658      0.000       0.354       0.519
x3            -0.0197      0.001    -21.541      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        1.210   Durbin-Watson:                   2.242
Prob(Omnibus):                  0.546   Jarque-Bera (JB):                1.056
Skew:                           0.152   Prob(JB):                        0.590
Kurtosis:                       2.356   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.60141583  5.05242307  5.4684162   5.82561773  6.1088313   6.31393849
  6.4485754   6.53087809  6.58650286  6.64441102  6.73211059  6.87113685
  7.07351412  7.33978023  7.65889813  8.01006925  8.36615066  8.69811457
  8.97981751  9.19229564  9.32688042  9.38662287  9.38579229  9.34753182
  9.30005613  9.27201579  9.2877901   9.36348161  9.50427054  9.70356366
  9.94407758 10.20067829 10.44451069 10.64773935 10.7881218  10.85266292
 10.83974998 10.75941763 10.63169921 10.48333535 10.34338056 10.23842964
 10.18824824 10.20252542 10.27928176 10.40519458 10.55778681 10.70911993
 10.83038414 10.8966318 ]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.87878765 10.74461563 10.51122546 10.217608    9.91508897  9.65476257
  9.4749819   9.39196871  9.39584182  9.45303634]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7ff286d5f910>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.093769
x1                  0.491787
np.sin(x1)          0.436292
I((x1 - 5) ** 2)   -0.019694
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.878788
1    10.744616
2    10.511225
3    10.217608
4     9.915089
5     9.654763
6     9.474982
7     9.391969
8     9.395842
9     9.453036
dtype: float64