Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Proportion as a dependent variable

From   "Nick Cox" <>
To   <>
Subject   st: RE: Proportion as a dependent variable
Date   Thu, 17 Jul 2003 16:29:36 +0100

Andrew J. Vickers

> Ronnie Babigumira asked whether linear regression was
> appropriate for a
> proportion. Many wrote back to point out that proportions involved
> binary data and linear regression is for continuous outcomes. Ronnie
> then clarified that the proportion was a single value
> between 0 and 1
> for each observation, in this case, the percentage of field space
> allocated to new variety maize for each farmer.
> My tuppence, with an open call for comment, is that many areas in
> medical research and psychometrics have similar properties to the
> problem Ronnie raises. For example, pain is often measured
> on a 0 - 100
> scale; quality of life scales such as the SF36 convert
> various numerical
> scores into a proportion of the maximum score to give a
> quality of life
> between 0 and 100. Biostatisticians have used linear
> regression for many
> years without worrying too much about it, unless there was
> a particular
> reason: as Nick Cox put it, it all depends on the data and
> the use to
> which it is being put. If the dependent variable is
> normally distributed
> with a mean of 0.5 and an SD of 0.1, linear regression is
> probably going
> to work fine. If the dependent variable has many 0's and /
> or 1's, as
> might well be the case with the maize data, you might have
> a problem,
> particular that you regression will make out of sample
> predictions. My
> guess is that with the maize data, differences between say,
> 55% and 65%
> aren't neither important nor likely as farmers will plant
> certain whole
> areas with a particular crop. Thus you could categorize the
> data into
> quartiles (0-24.9%, 25%-49.9%, 50% - 74.9%, 75%- 100%) and
> then do an
> ordinal regression.

I was agreeing strongly with Andrew until his very last

Given measured proportions as a response, it would seem far too
pessimistic to degrade the data to ordinal classes.

To repeat and expand a point made earlier, there is a clear
difference between the possibility that individual data points
attain the limits of 0 and 1 (0, 100%) and what the mean
response is doing. Using a logit as link in a generalised
linear model is not affected by the fact that logit(0) and logit(1)
are indeterminate, any more than this is a problem for binary
as handled by -logit- and -probit-.


*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index