[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: proportion as a dependent variable
At 14:06 14/07/03 +0200, Ronnie Babigumira wrote:
I assume that Ronnie is talking about a homoskedastic linear regression,
not a logistic regression. A homoskedastic linear regression model should
not be used for binary data, because the conditional variance of a binary
dependent variable is dependent on the conditional mean, ie the conditional
probability of a success, given the value of the independent variables. The
standard errors for a homoskedastic linear regression are computed assuming
that the conditional variance is constant, and therefore will either be too
laarge or too small.
I was attending a workshop in which one of the presenters had a regression
in which a dependent variable was a proportion. One of the participants
noted that it was wrong but didnt follow it up with a clear explanation.
If you use Huber variances by using the -robust- option of -regress-, then
the standard errors will be correctly estimated, so it would not be
actually wrong. However, the regression coefficient would be calculated by
an inefficient formula, so the confidence intervals would be wider than
necessary. The best way to do linear regression with a binary dependent
variable is probably to use the -glm- command with the options
family(binomial) link(identity) robust
which will calculate regression coefficients equal to binomial
probabilities and their differences. For instance, if you type
glm y x,family(binomial) link(identity) robust
then the intercept will be a baseline probability that y==1 if x is zero,
and the slope will be an incremental probability that y==1, given a unit
increase in x. These parameters will be calculated by an efficient formula,
so the confidence limits will probably be narrower than those created if
regress y x,robust
which is not actually wrong for a binary y, but is inefficient.
If you want the parameters to be probabilities and their ratios, instead of
probabilities and their differences, then you can use -glm- with the log
link. For instance, if you type
gene byte baseline=1
glm y x baseline,family(binomial) ling(log) robust eform noconst
then the intercept parameter will correspond to the variable -baseline-,
and represent a baseline probability that y==1 if x is zero, and the slope
parameter will correspond to x, and be a relative risk per unit x,
assuming that the probability increases exponentially with x.
Alternatively, if you want the parameters to be baseline odds and odds
ratios, then you can use the -logit- command.
I hope this helps.
Lecturer in Medical Statistics
Department of Public Health Sciences
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
or 020 7848 6605 International +44 20 7848 6605
Opinions expressed are those of the author, not the institution.
* For searches and help try: