Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.989
Model: OLS Adj. R-squared: 0.988
Method: Least Squares F-statistic: 1395.
Date: Thu, 05 Jan 2023 Prob (F-statistic): 3.72e-45
Time: 20:34:52 Log-Likelihood: 12.089
No. Observations: 50 AIC: -16.18
Df Residuals: 46 BIC: -8.531
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0938 0.068 75.444 0.000 4.958 5.230
x1 0.4918 0.010 47.229 0.000 0.471 0.513
x2 0.4363 0.041 10.658 0.000 0.354 0.519
x3 -0.0197 0.001 -21.541 0.000 -0.022 -0.018
==============================================================================
Omnibus: 1.210 Durbin-Watson: 2.242
Prob(Omnibus): 0.546 Jarque-Bera (JB): 1.056
Skew: 0.152 Prob(JB): 0.590
Kurtosis: 2.356 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.60141583 5.05242307 5.4684162 5.82561773 6.1088313 6.31393849
6.4485754 6.53087809 6.58650286 6.64441102 6.73211059 6.87113685
7.07351412 7.33978023 7.65889813 8.01006925 8.36615066 8.69811457
8.97981751 9.19229564 9.32688042 9.38662287 9.38579229 9.34753182
9.30005613 9.27201579 9.2877901 9.36348161 9.50427054 9.70356366
9.94407758 10.20067829 10.44451069 10.64773935 10.7881218 10.85266292
10.83974998 10.75941763 10.63169921 10.48333535 10.34338056 10.23842964
10.18824824 10.20252542 10.27928176 10.40519458 10.55778681 10.70911993
10.83038414 10.8966318 ]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.87878765 10.74461563 10.51122546 10.217608 9.91508897 9.65476257
9.4749819 9.39196871 9.39584182 9.45303634]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7ff286d5f910>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.093769
x1 0.491787
np.sin(x1) 0.436292
I((x1 - 5) ** 2) -0.019694
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.878788
1 10.744616
2 10.511225
3 10.217608
4 9.915089
5 9.654763
6 9.474982
7 9.391969
8 9.395842
9 9.453036
dtype: float64