[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: k-fold cross validation |

Date |
Fri, 15 Feb 2008 09:51:19 -0800 |

I prefer bootstrapping myself. One issue with LOO is that the 'residuals' are correlated and with small samples (say n<30 or so) the increase in variance is a problem. The method was first proposed by Quenouille in the 1950s, I used it in the 1960s, and Ned Glick showed that the bootstrap did a better job (i.e., smaller MSE) in the late 70s. If you want to use k-observations at a time, it may be better to redo these with random sampling of the k observations. David Allen did something akin to this with his PRESS criterion (Prediction Error Sum of Squares) - I don't think Stata has this (at least under the name of PRESS), but may be able to give an equivalent statistic. The trick to all of this is a simple matrix inversion formula, so that one only needs to compute the inverse of X'X once and the rest is multiplications. Tony Peter A. Lachenbruch Department of Public Health Oregon State University Corvallis, OR 97330 Phone: 541-737-3832 FAX: 541-737-4001 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Richard Goldstein Sent: Friday, February 15, 2008 8:32 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: k-fold cross validation 1. see the jackknife command for the extreme version of this 2. you may prefer to use bootstrap -- see that command Rich Nalin Payakachat wrote: > Hi, > > I would like to perform k-fold cross validation using Stata. Here are > explanation for k-fold (http://www.cs.cmu.edu/~schneide/tut5/node42.html): > > K-fold cross validation is one way to improve over the holdout method. The data > set is divided into k subsets, and the holdout method is repeated k times. Each > time, one of the k subsets is used as the test set and the other k-1 subsets are > put together to form a training set. Then the average error across all k trials > is computed. The advantage of this method is that it matters less how the data > gets divided. Every data point gets to be in a test set exactly once, and gets > to be in a training set k-1 times. The variance of the resulting estimate is > reduced as k is increased. The disadvantage of this method is that the training > algorithm has to be rerun from scratch k times, which means it takes k times as > much computation to make an evaluation. A variant of this method is to randomly > divide the data into a test and training set k different times. The advantage of > doing this is that you can independently choose how large each test set is and > how many trials you average over. > > If anybody could help, I would deeply appreciate it. > Thank you so much. > > Nalin * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: k-fold cross validation***From:*Nalin Payakachat <npayakac@purdue.edu>

**Re: st: k-fold cross validation***From:*Richard Goldstein <richgold@ix.netcom.com>

- Prev by Date:
**RE: st: Ask Non linear equation** - Next by Date:
**Re: Re: st: drop redundant value labels** - Previous by thread:
**Re: st: k-fold cross validation** - Next by thread:
**st: invalid 'and' error with mim** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |