Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Model selection using AIC/BIC and other information criteria


From   Richard Williams <Richard.A.Williams.5@ND.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>, statalist <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Model selection using AIC/BIC and other information criteria
Date   Tue, 23 Jun 2009 21:20:28 -0500

At 06:07 PM 6/23/2009, kokootchke wrote:
Dear all,

I have a model that says that the return or yield spread of a bond issued by a country depends non-linearly on the country's probability of default. If I assume that this probability of default follows a logistic form, I get that the log spread depends linearly on "stuff" which I take to be macroeconomic variables. To choose the best model, I use AIC/BIC.

One interesting fact I observe is that in some cases, I see that both AIC and BIC select a model that contains some variable X even when a lot of data points are missing for that particular variable, which means I actually lose a lot of observations when I include such variable X.

More specifically, I have:

MODEL 1

regress log_spread a b c X
estat ic

which gives AIC = 915

then,

MODEL 2

regress log_spread a b c
estat ic

which gives AIC = 1500

but the OLS in model 1 uses 1200 observations while the OLS in model 2 uses 2800 observations (because 1600 observations are missing in variable X)!!

You would think that this would be because X is very relevant to explain the spread, but in fact I see some cases when this variable is statistically insignificant!!

Somebody can correct me if I am wrong, but I don't think it is legit to compare BIC and AIC statistics that have been estimated on different samples. I don't think these stats are totally immune to differences in sample size -- and even if they were the two samples used might be very different, e.g. maybe those 1600 missing cases are all bonds from the US.

I'm guessing a fairer comparison would be

nestreg, lr: reg log_spread (a b c) X

The same sample will be used for both regressions and you will get BIC and AIC stats at the end.

I think your bigger concern, though, is losing more than half your cases when you include X. You need to find out why those data are missing and then decide what to do about it.


-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  Richard.A.Williams.5@ND.Edu
WWW:    http://www.nd.edu/~rwilliam

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index