# st: RE: Model selection using AIC/BIC and other information criteria

 From jhilbe@aol.com To statalist@hsphsun2.harvard.edu Subject st: RE: Model selection using AIC/BIC and other information criteria Date Wed, 24 Jun 2009 11:47:12 -0400

Stata has two versions of AIC statistics, one used with -glm- and another -estat ic- The -estat ic- version does not adjust the log-likelihood and penalty term by the number of observations in the model, whereas the version used in -glm- does.
```
ESTAT-IC

AIC = -2*LL + 2*k      = -2(LL-k)

GLM

AIC  = -2*LL + 2*k           -2(LL - k)
----------------    =    --------------
n                         n

```
where LL is the model log-likelihood and k is the number of predictors. 2k is a penalty term, adjusting for the number of predictors in the model. Larger n affects -2LL. Dividing by n adjusts the statistic to yield a per observation contribution to the adjusted -2*LL. That is, the version used in -glm- adjusts for sample size.
```
```
Note that -estat ic- uses a particular ersion of the BIC statistic that is based on the LL. The original version proposed by raftery in 1986 is based on the deviance. -glm- uses the orignal version - hence the descrepancy in displayed values.
```
```
Regardless, for several of my publications I developed two programs that calculate the AIC and BIC statistic folllowing a Stata maximum likelihood or GLM command. Look at the difference in applying the two versions of AIC when applied to a simple logistic regression
```
. use auto,clear
(1978 Automobile Data)

. glm foreign mpg length, nolog fam(bin)

```
Generalized linear models No. of obs = 74 Optimization : ML Residual df = 71 Scale parameter = 1 Deviance = 60.3449833 (1/df) Deviance = .8499293 Pearson = 54.91238538 (1/df) Pearson = .7734139
```
Variance function: V(u) = u*(1-u)                            [Bernoulli]
```
Link function : g(u) = ln(u/(1-u)) [Logit]
```
```
AIC = .8965538 Log likelihood = -30.17249165 BIC = -245.2436
```-------------------------------------------------------------------------
-----
|                 OIM
```
foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval]
```-------------+-----------------------------------------------------------
-----
```
mpg | -.0988457 .0784404 -1.26 0.208 -.2525861 .0548946 length | -.1051447 .0295657 -3.56 0.000 -.1630923 -.047197 _cons | 20.43339 6.700286 3.05 0.002 7.301072 33.56571
```-------------------------------------------------------------------------
-----

. estat ic

-------------------------------------------------------------------------
----
```
Model | Obs ll(null) ll(model) df AIC BIC
```-------------+-----------------------------------------------------------
----
```
. | 74 . -30.17249 3 66.34498 73.25718
```-------------------------------------------------------------------------
----
Note:  N=Obs used in calculating BIC; see [R] BIC note

. aic
AIC Statistic =   .8965538             AIC*n =  66.344983
BIC Statistic =  -245.2436

. abic
AIC Statistic   =   .8965538           AIC*n      = 66.344986
BIC Statistic   =   .9045494           BIC(Stata) = 73.257179

```
** -aic- calculates both versions of AIC, and the deviance based BIC.Note that it is consistent
```    to the displayed -glm- values
```
** -abic- gives the same two version of AIC, and the same BIC used by -estat ic-. The BIC on the left side is that used in LIMDEP econometric software. It adjusts for sample size as well
```
. expand 2
(74 observations created)

. glm foreign mpg length, nolog fam(bin)

```
Generalized linear models No. of obs = 148 Optimization : ML Residual df = 145 Scale parameter = 1 Deviance = 120.6899666 (1/df) Deviance = .8323446 Pearson = 109.8247708 (1/df) Pearson = .7574122
```
```
Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u/(1-u)) [Logit]
```
```
AIC = .8560133 Log likelihood = -60.3449833 BIC = -603.9058
```-------------------------------------------------------------------------
-----
|                 OIM
```
foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval]
```-------------+-----------------------------------------------------------
-----
```
mpg | -.0988457 .0554657 -1.78 0.075 -.2075566 .0098651 length | -.1051447 .0209061 -5.03 0.000 -.1461198 -.0641695 _cons | 20.43339 4.737818 4.31 0.000 11.14744 29.71934
```-------------------------------------------------------------------------
-----

. estat ic

-------------------------------------------------------------------------
----
```
Model | Obs ll(null) ll(model) df AIC BIC
```-------------+-----------------------------------------------------------
----
```
. | 148 . -60.34498 3 126.69 135.6816
```-------------------------------------------------------------------------
----
Note:  N=Obs used in calculating BIC; see [R] BIC note

. aic
AIC Statistic =   .8560133             AIC*n =  126.68997
BIC Statistic =  -603.9058

. abic
AIC Statistic   =   .8560133           AIC*n      = 126.68996
BIC Statistic   =   .8600111           BIC(Stata) = 135.68161

***
```
Note the enlarged AIC statistic when using -estat ic- , but not when using the AIC used in -glm-. Also note the constancy of the Limdep BIC statistic when the
```data was expanded.

```
By adjusting for the number of observations in the model, the AIC can better be used as a comparative fit statistic, regardless if there is a difference in sample
```sizes. This was the intent of the statistic in the first place.

```
Also be aware that there have been other versions of the AIC. Some are the finite sample AIC, Swartz AIC, and Limdep AIC. Each of these has an explicit
```adjustment for sample size, unlike the version used in -estat ic-.

```
I discuss this topic in some detail in my new book, "Logistic Regression Models", and provide a table of Degrees of Model Preference based on the difference in AIC values between 2 models. The criteria of strength of Preference is based on simulation studies. The table is similar to the
```table developed by Raftery for his original version of BIC.

```
It must be understood that the penalty and observation corrections are not completely successful in eliminating bias resulting from additional predictors and differences in observations. But having an adjustment for sample size appears to me to preferable than not. Others developing alternatives to the traditonal AIC statistics (estata ic and glm) seem to agree. The primary caveat to be aware of when using AIC (glm) relates to its use with correlated data. But
```that's another discussion.

Joseph Hilbe

=========================================

ate: Tue, 23 Jun 2009 22:20:36 -0500
From: Richard Williams <Richard.A.Williams.5@ND.edu>
```
Subject: RE: st: Model selection using AIC/BIC and other information criteria
```
At 08:39 PM 6/23/2009, kokootchke wrote:
```
```Thank you, Richard. This was exactly what I thought... but I
remember from my metrics classes long time ago that both AIC and BIC
depend on N (sample size)... and I confirmed this by simply looking
at these wikipedia entries... but, just like you, I also feared
that, even though both criteria adjust for the sample size, maybe
you can't compare between AICs and BICs when the models use
different # of observations...
```
```
```
Here is a simple example that shows the sensitivity of BIC and AIC to sample size:
```
. sysuse auto, clear
(1978 Automobile Data)

. quietly reg  price mpg trunk weight

. estat ic

```
- -------------------------------------------------------------------------
```----
```
Model | Obs ll(null) ll(model) df AIC BIC - -------------+-----------------------------------------------------------
```----
```
. | 74 -695.7129 -682.6073 4 1373.215 1382.431 - -------------------------------------------------------------------------
```----
Note:  N=Obs used in calculating BIC; see [R] BIC note

. expand 2
(74 observations created)

. quietly reg  price mpg trunk weight

. estat ic

```
- -------------------------------------------------------------------------
```----
```
Model | Obs ll(null) ll(model) df AIC BIC - -------------+-----------------------------------------------------------
```----
```
. | 148 -1391.426 -1365.215 4 2738.429 2750.418 - -------------------------------------------------------------------------
```----
Note:  N=Obs used in calculating BIC; see [R] BIC note

So, even if data are missing at random with your X variable, the
smaller sample sizes that result from its inclusion will drive down
the BIC and AIC stats quite a bit.

- -------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  Richard.A.Williams.5@ND.Edu
WWW:    http://www.nd.edu/~rwilliam

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```