  # Re: st: proportion as a dependent variable

 From Roger Newson To statalist@hsphsun2.harvard.edu Subject Re: st: proportion as a dependent variable Date Mon, 14 Jul 2003 14:55:56 +0100

```At 14:06 14/07/03 +0200, Ronnie Babigumira wrote:
```
```I was attending a workshop in which one of the presenters had a regression
in which a dependent variable was a proportion. One of the participants
noted that it was wrong but didnt follow it up with a clear explanation.
```
I assume that Ronnie is talking about a homoskedastic linear regression, not a logistic regression. A homoskedastic linear regression model should not be used for binary data, because the conditional variance of a binary dependent variable is dependent on the conditional mean, ie the conditional probability of a success, given the value of the independent variables. The standard errors for a homoskedastic linear regression are computed assuming that the conditional variance is constant, and therefore will either be too laarge or too small.

If you use Huber variances by using the -robust- option of -regress-, then the standard errors will be correctly estimated, so it would not be actually wrong. However, the regression coefficient would be calculated by an inefficient formula, so the confidence intervals would be wider than necessary. The best way to do linear regression with a binary dependent variable is probably to use the -glm- command with the options

which will calculate regression coefficients equal to binomial probabilities and their differences. For instance, if you type

then the intercept will be a baseline probability that y==1 if x is zero, and the slope will be an incremental probability that y==1, given a unit increase in x. These parameters will be calculated by an efficient formula, so the confidence limits will probably be narrower than those created if you type

regress y x,robust

which is not actually wrong for a binary y, but is inefficient.

If you want the parameters to be probabilities and their ratios, instead of probabilities and their differences, then you can use -glm- with the log link. For instance, if you type

gene byte baseline=1
glm y x baseline,family(binomial) ling(log) robust eform noconst

then the intercept parameter will correspond to the variable -baseline-, and represent a baseline probability that y==1 if x is zero, and the slope parameter will correspond to x, and be a relative risk per unit x, assuming that the probability increases exponentially with x. Alternatively, if you want the parameters to be baseline odds and odds ratios, then you can use the -logit- command.

I hope this helps.

Roger

--
Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom

Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
or 020 7848 6605 International +44 20 7848 6605
Email: roger.newson@kcl.ac.uk
Website: http://www.kcl-phs.org.uk/rogernewson

Opinions expressed are those of the author, not the institution.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/