Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Factor variable notation vs. hand made dummy vars

From	[email protected] (Jeff Pitblado, StataCorp LP)
To	[email protected]
Subject	Re: st: Factor variable notation vs. hand made dummy vars
Date	Mon, 06 Feb 2012 10:11:58 -0600

Ulrich Kohler <[email protected]> is comparing results from -logit- between two
different specifications of what seem be the same model, but is getting
different results:

> I cannot replicate the model 
> 
> . sysuse auto, clear
> . tab rep78, gen(d)
> . logit for mpg d2-d5
> 
> with factor variable notation. I tried
> 
> . logit for mpg ib1.rep78
> 
> but results differ. Can anybody explain why?
> 
> (Note as an aside that
> 
> . logit for mpg d1-d5
> 
> reproduces the factor variables solution, but normally I would not
> specify the model this way)

Here is the output form Uli's first model:

***** BEGIN:
. logit for mpg d2-d5

note: d2 != 0 predicts failure perfectly
      d2 dropped and 8 obs not used

Iteration 0:   log likelihood = -39.273156  
Iteration 1:   log likelihood = -26.016988  
Iteration 2:   log likelihood = -25.527683  
Iteration 3:   log likelihood = -25.487362  
Iteration 4:   log likelihood = -25.480362  
Iteration 5:   log likelihood = -25.478768  
Iteration 6:   log likelihood = -25.478391  
Iteration 7:   log likelihood = -25.478309  
Iteration 8:   log likelihood = -25.478292  
Iteration 9:   log likelihood = -25.478288  
Iteration 10:  log likelihood = -25.478287  

Logistic regression                               Number of obs   =         61
                                                  LR chi2(4)      =      27.59
                                                  Prob > chi2     =     0.0000
Log likelihood = -25.478287                       Pseudo R2       =     0.3513

------------------------------------------------------------------------------
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .1310881   .0707293     1.85   0.064    -.0075387    .2697149
          d2 |          0  (omitted)
          d3 |   14.28187   2465.084     0.01   0.995    -4817.194    4845.758
          d4 |   16.29835   2465.084     0.01   0.995    -4815.177    4847.774
          d5 |   17.41793   2465.084     0.01   0.994    -4814.058    4848.894
       _cons |  -19.14137   2465.084    -0.01   0.994    -4850.618    4812.335
------------------------------------------------------------------------------
***** END:

Technically, this model should not have converged.  The coefficients on the
binary predictors are way too big; the standard errors don't look reasonable
either.

The problem here is that 'd1' and 'd2' are prefect predicts for 'foreign', but
Uli dropped 'd1' from the list of predictors.  Dropping a level from the
indicators of a factor variable is normally a natural thing to want to do. One
of the levels is going to be omitted because of collinearity anyway, so by
dropping you can control which level to treat as the base level for the fitted
coefficient effects of the factor variable.  But 'd1' is a perfect predictor,
so -logit- would have dropped it along with 'd2' (and the observations they
indicate) for that reason and then found that it still needed to drop one of
the other 'd#' variables because of collinearity.  However by not including
'd1' in the list of predictors, the observations that 'd1' indicates are left
in the estimation sample, and -logit- is unable to identify that it has a
collinearity problem.

We can prevent this by adding 'd1' back in to the list of predictors:

***** BEGIN:
. logit for mpg d1-d5

note: d1 != 0 predicts failure perfectly
      d1 dropped and 2 obs not used

note: d2 != 0 predicts failure perfectly
      d2 dropped and 8 obs not used

note: d5 omitted because of collinearity
Iteration 0:   log likelihood = -38.411464  
Iteration 1:   log likelihood = -25.814503  
Iteration 2:   log likelihood = -25.480135  
Iteration 3:   log likelihood = -25.478287  
Iteration 4:   log likelihood = -25.478287  

Logistic regression                               Number of obs   =         59
                                                  LR chi2(3)      =      25.87
                                                  Prob > chi2     =     0.0000
Log likelihood = -25.478287                       Pseudo R2       =     0.3367

------------------------------------------------------------------------------
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .1310946    .070733     1.85   0.064    -.0075396    .2697287
          d1 |          0  (omitted)
          d2 |          0  (omitted)
          d3 |  -3.136422   1.044601    -3.00   0.003    -5.183803    -1.08904
          d4 |  -1.119903   .9741478    -1.15   0.250    -3.029198    .7893916
          d5 |          0  (omitted)
       _cons |  -1.723275   1.776453    -0.97   0.332    -5.205059    1.758509
------------------------------------------------------------------------------
***** END:

Uli already mentioned that this specification reproduces the results from the
one using factor variables.

We do not recommend this, but Uli can reproduce the first model specification
using factor variables notation by explicitly specifying the levels of 'rep78'
to use:

***** BEGIN:
. logit for mpg i(2/5).rep78

note: 2.rep78 != 0 predicts failure perfectly
      2.rep78 dropped and 8 obs not used

Iteration 0:   log likelihood = -39.273156  
Iteration 1:   log likelihood = -26.016988  
Iteration 2:   log likelihood = -25.527683  
Iteration 3:   log likelihood = -25.487362  
Iteration 4:   log likelihood = -25.480362  
Iteration 5:   log likelihood = -25.478768  
Iteration 6:   log likelihood = -25.478391  
Iteration 7:   log likelihood = -25.478309  
Iteration 8:   log likelihood = -25.478292  
Iteration 9:   log likelihood = -25.478288  
Iteration 10:  log likelihood = -25.478287  

Logistic regression                               Number of obs   =         61
                                                  LR chi2(4)      =      27.59
                                                  Prob > chi2     =     0.0000
Log likelihood = -25.478287                       Pseudo R2       =     0.3513

------------------------------------------------------------------------------
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .1310881   .0707293     1.85   0.064    -.0075387    .2697149
             |
       rep78 |
          2  |          0  (empty)
          3  |   14.28187   2465.084     0.01   0.995    -4817.194    4845.758
          4  |   16.29835   2465.084     0.01   0.995    -4815.177    4847.774
          5  |   17.41793   2465.084     0.01   0.994    -4814.058    4848.894
             |
       _cons |  -19.14137   2465.084    -0.01   0.994    -4850.618    4812.335
------------------------------------------------------------------------------
***** END:

--Jeff
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Factor variable notation vs. hand made dummy vars
  - From: Ulrich Kohler <[email protected]>

Prev by Date: Re: st: change keyboard mapping in Stata 12 for Mac
Next by Date: Re: st: creating a text file with a specific format
Previous by thread: st: RE: Factor variable notation vs. hand made dummy vars
Next by thread: Re: st: Factor variable notation vs. hand made dummy vars
Index(es):
- Date
- Thread