Thursday, November 11, 2010

Cross-validation with one model

This is essentially a repost of Rob J Hyndman's blog post on the relevance of cross-validation for statisticians.


Within this very nice piece, Rob drops this bomb of mathematical knowledge:

It is not necessary to actually fit n separate models when computing the CV statistic for linear models.


Say what?


Here is a broader excerpt and the method itself (after the jump). 




While cross-validation can be computationally expensive in general, it is very easy and fast to compute LOOCV for linear models. A linear model can be written as
\[<br />
\mathbf{Y} = \mathbf{X}\mbox{\boldmath$\beta$} + \mathbf{e}.<br />
\]
Then
\[<br />
\hat{\mbox{\boldmath$\beta$}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}<br />
\]
and the fitted values can be calculated using
\[<br />
\mathbf{\hat{Y}} = \mathbf{X}\hat{\mbox{\boldmath$\beta$}} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{H}\mathbf{Y},<br />
\]
where \mathbf{H} =  \mathbf{X}(\mathbf{X}’\mathbf{X})^{-1}\mathbf{X}’ is known as the “hat-matrix” because it is used to compute \mathbf{\hat{Y}} (“Y-hat”).
If the diagonal values of \mathbf{H} are denoted by h_{1},\dots,h_{n}, then the cross-validation statistic can be computed using
\[<br />
\text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2,<br />
\]
where e_{i} is the residual obtained from fitting the model to all n observations. See Christensen’s book Plane Answers to Complex Questions for a proof. Thus, it is not necessary to actually fit n separate models when computing the CV statistic for linear models. This remarkable result allows cross-validation to be used while only fitting the model once to all available observations.


Very cool.

No comments: