Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Ulrich Kohler <kohler@wzb.eu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Factor variable notation vs. hand made dummy vars |

Date |
Mon, 06 Feb 2012 17:38:27 +0100 |

Jeff, thank you very much. The take away message of this then is (1) take care that you do not use perfect predictors as reference category of a categorcal explantory variable in a logit/probit model (2) as it is cumbersome to search for perfect predictors before deciding about the reference category it is better to use factor variables notation. Uli Am Montag, den 06.02.2012, 10:11 -0600 schrieb Jeff Pitblado, StataCorp LP: > Ulrich Kohler <kohler@wzb.eu> is comparing results from -logit- between two > different specifications of what seem be the same model, but is getting > different results: > > > I cannot replicate the model > > > > . sysuse auto, clear > > . tab rep78, gen(d) > > . logit for mpg d2-d5 > > > > with factor variable notation. I tried > > > > . logit for mpg ib1.rep78 > > > > but results differ. Can anybody explain why? > > > > (Note as an aside that > > > > . logit for mpg d1-d5 > > > > reproduces the factor variables solution, but normally I would not > > specify the model this way) > > Here is the output form Uli's first model: > > ***** BEGIN: > . logit for mpg d2-d5 > > note: d2 != 0 predicts failure perfectly > d2 dropped and 8 obs not used > > Iteration 0: log likelihood = -39.273156 > Iteration 1: log likelihood = -26.016988 > Iteration 2: log likelihood = -25.527683 > Iteration 3: log likelihood = -25.487362 > Iteration 4: log likelihood = -25.480362 > Iteration 5: log likelihood = -25.478768 > Iteration 6: log likelihood = -25.478391 > Iteration 7: log likelihood = -25.478309 > Iteration 8: log likelihood = -25.478292 > Iteration 9: log likelihood = -25.478288 > Iteration 10: log likelihood = -25.478287 > > Logistic regression Number of obs = 61 > LR chi2(4) = 27.59 > Prob > chi2 = 0.0000 > Log likelihood = -25.478287 Pseudo R2 = 0.3513 > > ------------------------------------------------------------------------------ > foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] > -------------+---------------------------------------------------------------- > mpg | .1310881 .0707293 1.85 0.064 -.0075387 .2697149 > d2 | 0 (omitted) > d3 | 14.28187 2465.084 0.01 0.995 -4817.194 4845.758 > d4 | 16.29835 2465.084 0.01 0.995 -4815.177 4847.774 > d5 | 17.41793 2465.084 0.01 0.994 -4814.058 4848.894 > _cons | -19.14137 2465.084 -0.01 0.994 -4850.618 4812.335 > ------------------------------------------------------------------------------ > ***** END: > > Technically, this model should not have converged. The coefficients on the > binary predictors are way too big; the standard errors don't look reasonable > either. > > The problem here is that 'd1' and 'd2' are prefect predicts for 'foreign', but > Uli dropped 'd1' from the list of predictors. Dropping a level from the > indicators of a factor variable is normally a natural thing to want to do. One > of the levels is going to be omitted because of collinearity anyway, so by > dropping you can control which level to treat as the base level for the fitted > coefficient effects of the factor variable. But 'd1' is a perfect predictor, > so -logit- would have dropped it along with 'd2' (and the observations they > indicate) for that reason and then found that it still needed to drop one of > the other 'd#' variables because of collinearity. However by not including > 'd1' in the list of predictors, the observations that 'd1' indicates are left > in the estimation sample, and -logit- is unable to identify that it has a > collinearity problem. > > We can prevent this by adding 'd1' back in to the list of predictors: > > ***** BEGIN: > . logit for mpg d1-d5 > > note: d1 != 0 predicts failure perfectly > d1 dropped and 2 obs not used > > note: d2 != 0 predicts failure perfectly > d2 dropped and 8 obs not used > > note: d5 omitted because of collinearity > Iteration 0: log likelihood = -38.411464 > Iteration 1: log likelihood = -25.814503 > Iteration 2: log likelihood = -25.480135 > Iteration 3: log likelihood = -25.478287 > Iteration 4: log likelihood = -25.478287 > > Logistic regression Number of obs = 59 > LR chi2(3) = 25.87 > Prob > chi2 = 0.0000 > Log likelihood = -25.478287 Pseudo R2 = 0.3367 > > ------------------------------------------------------------------------------ > foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] > -------------+---------------------------------------------------------------- > mpg | .1310946 .070733 1.85 0.064 -.0075396 .2697287 > d1 | 0 (omitted) > d2 | 0 (omitted) > d3 | -3.136422 1.044601 -3.00 0.003 -5.183803 -1.08904 > d4 | -1.119903 .9741478 -1.15 0.250 -3.029198 .7893916 > d5 | 0 (omitted) > _cons | -1.723275 1.776453 -0.97 0.332 -5.205059 1.758509 > ------------------------------------------------------------------------------ > ***** END: > > Uli already mentioned that this specification reproduces the results from the > one using factor variables. > > We do not recommend this, but Uli can reproduce the first model specification > using factor variables notation by explicitly specifying the levels of 'rep78' > to use: > > ***** BEGIN: > . logit for mpg i(2/5).rep78 > > note: 2.rep78 != 0 predicts failure perfectly > 2.rep78 dropped and 8 obs not used > > Iteration 0: log likelihood = -39.273156 > Iteration 1: log likelihood = -26.016988 > Iteration 2: log likelihood = -25.527683 > Iteration 3: log likelihood = -25.487362 > Iteration 4: log likelihood = -25.480362 > Iteration 5: log likelihood = -25.478768 > Iteration 6: log likelihood = -25.478391 > Iteration 7: log likelihood = -25.478309 > Iteration 8: log likelihood = -25.478292 > Iteration 9: log likelihood = -25.478288 > Iteration 10: log likelihood = -25.478287 > > Logistic regression Number of obs = 61 > LR chi2(4) = 27.59 > Prob > chi2 = 0.0000 > Log likelihood = -25.478287 Pseudo R2 = 0.3513 > > ------------------------------------------------------------------------------ > foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] > -------------+---------------------------------------------------------------- > mpg | .1310881 .0707293 1.85 0.064 -.0075387 .2697149 > | > rep78 | > 2 | 0 (empty) > 3 | 14.28187 2465.084 0.01 0.995 -4817.194 4845.758 > 4 | 16.29835 2465.084 0.01 0.995 -4815.177 4847.774 > 5 | 17.41793 2465.084 0.01 0.994 -4814.058 4848.894 > | > _cons | -19.14137 2465.084 -0.01 0.994 -4850.618 4812.335 > ------------------------------------------------------------------------------ > ***** END: > > --Jeff > jpitblado@stata.com > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Factor variable notation vs. hand made dummy vars***From:*jpitblado@stata.com (Jeff Pitblado, StataCorp LP)

- Prev by Date:
**RE: st: creating a text file with a specific format** - Next by Date:
**st: Anonymous nonnies** - Previous by thread:
**Re: st: Factor variable notation vs. hand made dummy vars** - Next by thread:
**Re: st: creating a text file with a specific format** - Index(es):