[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Roger Newson <roger.newson@kcl.ac.uk> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: proportion as a dependent variable |

Date |
Mon, 14 Jul 2003 14:55:56 +0100 |

At 14:06 14/07/03 +0200, Ronnie Babigumira wrote:

I assume that Ronnie is talking about a homoskedastic linear regression, not a logistic regression. A homoskedastic linear regression model should not be used for binary data, because the conditional variance of a binary dependent variable is dependent on the conditional mean, ie the conditional probability of a success, given the value of the independent variables. The standard errors for a homoskedastic linear regression are computed assuming that the conditional variance is constant, and therefore will either be too laarge or too small.I was attending a workshop in which one of the presenters had a regression in which a dependent variable was a proportion. One of the participants noted that it was wrong but didnt follow it up with a clear explanation.

If you use Huber variances by using the -robust- option of -regress-, then the standard errors will be correctly estimated, so it would not be actually wrong. However, the regression coefficient would be calculated by an inefficient formula, so the confidence intervals would be wider than necessary. The best way to do linear regression with a binary dependent variable is probably to use the -glm- command with the options

family(binomial) link(identity) robust

which will calculate regression coefficients equal to binomial probabilities and their differences. For instance, if you type

glm y x,family(binomial) link(identity) robust

then the intercept will be a baseline probability that y==1 if x is zero, and the slope will be an incremental probability that y==1, given a unit increase in x. These parameters will be calculated by an efficient formula, so the confidence limits will probably be narrower than those created if you type

regress y x,robust

which is not actually wrong for a binary y, but is inefficient.

If you want the parameters to be probabilities and their ratios, instead of probabilities and their differences, then you can use -glm- with the log link. For instance, if you type

gene byte baseline=1

glm y x baseline,family(binomial) ling(log) robust eform noconst

then the intercept parameter will correspond to the variable -baseline-, and represent a baseline probability that y==1 if x is zero, and the slope parameter will correspond to x, and be a relative risk per unit x, assuming that the probability increases exponentially with x. Alternatively, if you want the parameters to be baseline odds and odds ratios, then you can use the -logit- command.

I hope this helps.

Roger

--

Roger Newson

Lecturer in Medical Statistics

Department of Public Health Sciences

King's College London

5th Floor, Capital House

42 Weston Street

London SE1 3QD

United Kingdom

Tel: 020 7848 6648 International +44 20 7848 6648

Fax: 020 7848 6620 International +44 20 7848 6620

or 020 7848 6605 International +44 20 7848 6605

Email: roger.newson@kcl.ac.uk

Website: http://www.kcl-phs.org.uk/rogernewson

Opinions expressed are those of the author, not the institution.

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**References**:**st: proportion as a dependent variable***From:*"Ronnie Babigumira" <ronnie.babigumira@ios.nlh.no>

- Prev by Date:
**st: RE: Newey2 R-squared** - Next by Date:
**Re: st: stata journal** - Previous by thread:
**Re: st: RE: proportion as a dependent variable** - Next by thread:
**st: Proportion as a dependent variable** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |