Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Factor variable notation vs. hand made dummy vars

From	Ulrich Kohler <[email protected]>
To	[email protected]
Subject	Re: st: Factor variable notation vs. hand made dummy vars
Date	Mon, 06 Feb 2012 17:38:27 +0100
Jeff,

thank you very much. The take away message of this then is 

(1) take care that you do not use perfect predictors as reference
category of a categorcal explantory variable in a logit/probit model

(2) as it is cumbersome to search for perfect predictors before deciding
about the reference category it is better to use factor variables
notation. 

Uli


Am Montag, den 06.02.2012, 10:11 -0600 schrieb Jeff Pitblado, StataCorp
LP:
> Ulrich Kohler <[email protected]> is comparing results from -logit- between two
> different specifications of what seem be the same model, but is getting
> different results:
> 
> > I cannot replicate the model 
> > 
> > . sysuse auto, clear
> > . tab rep78, gen(d)
> > . logit for mpg d2-d5
> > 
> > with factor variable notation. I tried
> > 
> > . logit for mpg ib1.rep78
> > 
> > but results differ. Can anybody explain why?
> > 
> > (Note as an aside that
> > 
> > . logit for mpg d1-d5
> > 
> > reproduces the factor variables solution, but normally I would not
> > specify the model this way)
> 
> Here is the output form Uli's first model:
> 
> ***** BEGIN:
> . logit for mpg d2-d5
> 
> note: d2 != 0 predicts failure perfectly
>       d2 dropped and 8 obs not used
> 
> Iteration 0:   log likelihood = -39.273156  
> Iteration 1:   log likelihood = -26.016988  
> Iteration 2:   log likelihood = -25.527683  
> Iteration 3:   log likelihood = -25.487362  
> Iteration 4:   log likelihood = -25.480362  
> Iteration 5:   log likelihood = -25.478768  
> Iteration 6:   log likelihood = -25.478391  
> Iteration 7:   log likelihood = -25.478309  
> Iteration 8:   log likelihood = -25.478292  
> Iteration 9:   log likelihood = -25.478288  
> Iteration 10:  log likelihood = -25.478287  
> 
> Logistic regression                               Number of obs   =         61
>                                                   LR chi2(4)      =      27.59
>                                                   Prob > chi2     =     0.0000
> Log likelihood = -25.478287                       Pseudo R2       =     0.3513
> 
> ------------------------------------------------------------------------------
>      foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>          mpg |   .1310881   .0707293     1.85   0.064    -.0075387    .2697149
>           d2 |          0  (omitted)
>           d3 |   14.28187   2465.084     0.01   0.995    -4817.194    4845.758
>           d4 |   16.29835   2465.084     0.01   0.995    -4815.177    4847.774
>           d5 |   17.41793   2465.084     0.01   0.994    -4814.058    4848.894
>        _cons |  -19.14137   2465.084    -0.01   0.994    -4850.618    4812.335
> ------------------------------------------------------------------------------
> ***** END:
> 
> Technically, this model should not have converged.  The coefficients on the
> binary predictors are way too big; the standard errors don't look reasonable
> either.
> 
> The problem here is that 'd1' and 'd2' are prefect predicts for 'foreign', but
> Uli dropped 'd1' from the list of predictors.  Dropping a level from the
> indicators of a factor variable is normally a natural thing to want to do. One
> of the levels is going to be omitted because of collinearity anyway, so by
> dropping you can control which level to treat as the base level for the fitted
> coefficient effects of the factor variable.  But 'd1' is a perfect predictor,
> so -logit- would have dropped it along with 'd2' (and the observations they
> indicate) for that reason and then found that it still needed to drop one of
> the other 'd#' variables because of collinearity.  However by not including
> 'd1' in the list of predictors, the observations that 'd1' indicates are left
> in the estimation sample, and -logit- is unable to identify that it has a
> collinearity problem.
> 
> We can prevent this by adding 'd1' back in to the list of predictors:
> 
> ***** BEGIN:
> . logit for mpg d1-d5
> 
> note: d1 != 0 predicts failure perfectly
>       d1 dropped and 2 obs not used
> 
> note: d2 != 0 predicts failure perfectly
>       d2 dropped and 8 obs not used
> 
> note: d5 omitted because of collinearity
> Iteration 0:   log likelihood = -38.411464  
> Iteration 1:   log likelihood = -25.814503  
> Iteration 2:   log likelihood = -25.480135  
> Iteration 3:   log likelihood = -25.478287  
> Iteration 4:   log likelihood = -25.478287  
> 
> Logistic regression                               Number of obs   =         59
>                                                   LR chi2(3)      =      25.87
>                                                   Prob > chi2     =     0.0000
> Log likelihood = -25.478287                       Pseudo R2       =     0.3367
> 
> ------------------------------------------------------------------------------
>      foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>          mpg |   .1310946    .070733     1.85   0.064    -.0075396    .2697287
>           d1 |          0  (omitted)
>           d2 |          0  (omitted)
>           d3 |  -3.136422   1.044601    -3.00   0.003    -5.183803    -1.08904
>           d4 |  -1.119903   .9741478    -1.15   0.250    -3.029198    .7893916
>           d5 |          0  (omitted)
>        _cons |  -1.723275   1.776453    -0.97   0.332    -5.205059    1.758509
> ------------------------------------------------------------------------------
> ***** END:
> 
> Uli already mentioned that this specification reproduces the results from the
> one using factor variables.
> 
> We do not recommend this, but Uli can reproduce the first model specification
> using factor variables notation by explicitly specifying the levels of 'rep78'
> to use:
> 
> ***** BEGIN:
> . logit for mpg i(2/5).rep78
> 
> note: 2.rep78 != 0 predicts failure perfectly
>       2.rep78 dropped and 8 obs not used
> 
> Iteration 0:   log likelihood = -39.273156  
> Iteration 1:   log likelihood = -26.016988  
> Iteration 2:   log likelihood = -25.527683  
> Iteration 3:   log likelihood = -25.487362  
> Iteration 4:   log likelihood = -25.480362  
> Iteration 5:   log likelihood = -25.478768  
> Iteration 6:   log likelihood = -25.478391  
> Iteration 7:   log likelihood = -25.478309  
> Iteration 8:   log likelihood = -25.478292  
> Iteration 9:   log likelihood = -25.478288  
> Iteration 10:  log likelihood = -25.478287  
> 
> Logistic regression                               Number of obs   =         61
>                                                   LR chi2(4)      =      27.59
>                                                   Prob > chi2     =     0.0000
> Log likelihood = -25.478287                       Pseudo R2       =     0.3513
> 
> ------------------------------------------------------------------------------
>      foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>          mpg |   .1310881   .0707293     1.85   0.064    -.0075387    .2697149
>              |
>        rep78 |
>           2  |          0  (empty)
>           3  |   14.28187   2465.084     0.01   0.995    -4817.194    4845.758
>           4  |   16.29835   2465.084     0.01   0.995    -4815.177    4847.774
>           5  |   17.41793   2465.084     0.01   0.994    -4814.058    4848.894
>              |
>        _cons |  -19.14137   2465.084    -0.01   0.994    -4850.618    4812.335
> ------------------------------------------------------------------------------
> ***** END:
> 
> --Jeff
> [email protected]
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- Re: st: Factor variable notation vs. hand made dummy vars
  - From: [email protected] (Jeff Pitblado, StataCorp LP)
Prev by Date: RE: st: creating a text file with a specific format
Next by Date: st: Anonymous nonnies
Previous by thread: Re: st: Factor variable notation vs. hand made dummy vars
Next by thread: Re: st: creating a text file with a specific format
Index(es):
- Date
- Thread