[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Cox overfitting test?

From   Maarten buis <>
Subject   Re: st: Cox overfitting test?
Date   Tue, 13 Nov 2007 20:54:23 +0000 (GMT)

--- Andrzej Niemierko <aniemierko@PARTNERS.ORG> wrote:
> Is there a test for overfitting when using Cox proportional hazards
> model? Specifically, I have a small data set with only 52 cases (39
> events) and 12 independent variables. Six of those independent
> variables are jointly significant (all at p<0.001) in the Cox model.
> Although in my case it makes perfect sense that those six variables 
> are associated with the outcome I have a feeling that I am
> overfitting the data at hand. 

The problem with overfitting is that you are fitting such a flexible
model that what you are picking up are the random fluctions due to
sampling and not the systematic part of the fluctuations. This is
problematic because the findings will now be uninformative about the
population or the next sample. 

One way to get a feel for this is to 1) estimate your model in your
data, 2) create a bootstrap sample, and 3) see how the model with the
estimates from the real data would fit in the bootstrap sample. If you
are overfitting, than your estimates in the real data would have little
relevance for the bootstrap data and the fit would be bad. If you
haven't overfitted the data than the fit would cluster around the fit
in your real data. 

Usually I am not such a fan of fit statistics, so I can't remember an
appropriate one for the Cox model, so I improvised by looking at the
mean deviance residual. You might have to change that. Otherwise, the
example below should do what you want: If you are overfitting that the
center of the histogram should be to the right of the vertical line I
added with the -xline()- option (assuming that "more is bad" in your
choice of fit statistic).

*--------------- begin example -------------------
sysuse cancer, clear
stset studytim, failure(died)
xi: stcox age i.drug, mgale(mgale)

tempname memhold
tempfile results
postfile `memhold' msd using `results'
forvalues i = 1/1000 {
	predict dev, deviance
	gen dev2 = dev^2
	sum dev2, meanonly
      post `memhold' (`r(mean)')
postclose `memhold'
predict dev, deviance
gen dev2 = dev^2
sum dev2, meanonly
local observed = r(mean)

use `results', clear
hist msd, xline(`observed') /*
*/ xtitle("mean squared deviance residual")
*---------------------- end example -------------------------
(For more on how to use examples I sent to the Statalist, see )

Hope this helps,

Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

visiting address:
Buitenveldertselaan 3 (Metropolitan), room Z434

+31 20 5986715

Want ideas for reducing your carbon footprint? Visit Yahoo! For Good
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index