Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | jpitblado@stata.com (Jeff Pitblado, StataCorp LP) |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Factor variable notation vs. hand made dummy vars |
Date | Mon, 06 Feb 2012 10:11:58 -0600 |
Ulrich Kohler <kohler@wzb.eu> is comparing results from -logit- between two different specifications of what seem be the same model, but is getting different results: > I cannot replicate the model > > . sysuse auto, clear > . tab rep78, gen(d) > . logit for mpg d2-d5 > > with factor variable notation. I tried > > . logit for mpg ib1.rep78 > > but results differ. Can anybody explain why? > > (Note as an aside that > > . logit for mpg d1-d5 > > reproduces the factor variables solution, but normally I would not > specify the model this way) Here is the output form Uli's first model: ***** BEGIN: . logit for mpg d2-d5 note: d2 != 0 predicts failure perfectly d2 dropped and 8 obs not used Iteration 0: log likelihood = -39.273156 Iteration 1: log likelihood = -26.016988 Iteration 2: log likelihood = -25.527683 Iteration 3: log likelihood = -25.487362 Iteration 4: log likelihood = -25.480362 Iteration 5: log likelihood = -25.478768 Iteration 6: log likelihood = -25.478391 Iteration 7: log likelihood = -25.478309 Iteration 8: log likelihood = -25.478292 Iteration 9: log likelihood = -25.478288 Iteration 10: log likelihood = -25.478287 Logistic regression Number of obs = 61 LR chi2(4) = 27.59 Prob > chi2 = 0.0000 Log likelihood = -25.478287 Pseudo R2 = 0.3513 ------------------------------------------------------------------------------ foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .1310881 .0707293 1.85 0.064 -.0075387 .2697149 d2 | 0 (omitted) d3 | 14.28187 2465.084 0.01 0.995 -4817.194 4845.758 d4 | 16.29835 2465.084 0.01 0.995 -4815.177 4847.774 d5 | 17.41793 2465.084 0.01 0.994 -4814.058 4848.894 _cons | -19.14137 2465.084 -0.01 0.994 -4850.618 4812.335 ------------------------------------------------------------------------------ ***** END: Technically, this model should not have converged. The coefficients on the binary predictors are way too big; the standard errors don't look reasonable either. The problem here is that 'd1' and 'd2' are prefect predicts for 'foreign', but Uli dropped 'd1' from the list of predictors. Dropping a level from the indicators of a factor variable is normally a natural thing to want to do. One of the levels is going to be omitted because of collinearity anyway, so by dropping you can control which level to treat as the base level for the fitted coefficient effects of the factor variable. But 'd1' is a perfect predictor, so -logit- would have dropped it along with 'd2' (and the observations they indicate) for that reason and then found that it still needed to drop one of the other 'd#' variables because of collinearity. However by not including 'd1' in the list of predictors, the observations that 'd1' indicates are left in the estimation sample, and -logit- is unable to identify that it has a collinearity problem. We can prevent this by adding 'd1' back in to the list of predictors: ***** BEGIN: . logit for mpg d1-d5 note: d1 != 0 predicts failure perfectly d1 dropped and 2 obs not used note: d2 != 0 predicts failure perfectly d2 dropped and 8 obs not used note: d5 omitted because of collinearity Iteration 0: log likelihood = -38.411464 Iteration 1: log likelihood = -25.814503 Iteration 2: log likelihood = -25.480135 Iteration 3: log likelihood = -25.478287 Iteration 4: log likelihood = -25.478287 Logistic regression Number of obs = 59 LR chi2(3) = 25.87 Prob > chi2 = 0.0000 Log likelihood = -25.478287 Pseudo R2 = 0.3367 ------------------------------------------------------------------------------ foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .1310946 .070733 1.85 0.064 -.0075396 .2697287 d1 | 0 (omitted) d2 | 0 (omitted) d3 | -3.136422 1.044601 -3.00 0.003 -5.183803 -1.08904 d4 | -1.119903 .9741478 -1.15 0.250 -3.029198 .7893916 d5 | 0 (omitted) _cons | -1.723275 1.776453 -0.97 0.332 -5.205059 1.758509 ------------------------------------------------------------------------------ ***** END: Uli already mentioned that this specification reproduces the results from the one using factor variables. We do not recommend this, but Uli can reproduce the first model specification using factor variables notation by explicitly specifying the levels of 'rep78' to use: ***** BEGIN: . logit for mpg i(2/5).rep78 note: 2.rep78 != 0 predicts failure perfectly 2.rep78 dropped and 8 obs not used Iteration 0: log likelihood = -39.273156 Iteration 1: log likelihood = -26.016988 Iteration 2: log likelihood = -25.527683 Iteration 3: log likelihood = -25.487362 Iteration 4: log likelihood = -25.480362 Iteration 5: log likelihood = -25.478768 Iteration 6: log likelihood = -25.478391 Iteration 7: log likelihood = -25.478309 Iteration 8: log likelihood = -25.478292 Iteration 9: log likelihood = -25.478288 Iteration 10: log likelihood = -25.478287 Logistic regression Number of obs = 61 LR chi2(4) = 27.59 Prob > chi2 = 0.0000 Log likelihood = -25.478287 Pseudo R2 = 0.3513 ------------------------------------------------------------------------------ foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .1310881 .0707293 1.85 0.064 -.0075387 .2697149 | rep78 | 2 | 0 (empty) 3 | 14.28187 2465.084 0.01 0.995 -4817.194 4845.758 4 | 16.29835 2465.084 0.01 0.995 -4815.177 4847.774 5 | 17.41793 2465.084 0.01 0.994 -4814.058 4848.894 | _cons | -19.14137 2465.084 -0.01 0.994 -4850.618 4812.335 ------------------------------------------------------------------------------ ***** END: --Jeff jpitblado@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/