# st: RE: Proportion as a dependent variable

 From "Nick Cox" To Subject st: RE: Proportion as a dependent variable Date Thu, 17 Jul 2003 16:29:36 +0100

```Andrew J. Vickers

> Ronnie Babigumira asked whether linear regression was
> appropriate for a
> proportion. Many wrote back to point out that proportions involved
> binary data and linear regression is for continuous outcomes. Ronnie
> then clarified that the proportion was a single value
> between 0 and 1
> for each observation, in this case, the percentage of field space
> allocated to new variety maize for each farmer.
>
> My tuppence, with an open call for comment, is that many areas in
> medical research and psychometrics have similar properties to the
> problem Ronnie raises. For example, pain is often measured
> on a 0 - 100
> scale; quality of life scales such as the SF36 convert
> various numerical
> scores into a proportion of the maximum score to give a
> quality of life
> between 0 and 100. Biostatisticians have used linear
> regression for many
> years without worrying too much about it, unless there was
> a particular
> reason: as Nick Cox put it, it all depends on the data and
> the use to
> which it is being put. If the dependent variable is
> normally distributed
> with a mean of 0.5 and an SD of 0.1, linear regression is
> probably going
> to work fine. If the dependent variable has many 0's and /
> or 1's, as
> might well be the case with the maize data, you might have
> a problem,
> particular that you regression will make out of sample
> predictions. My
> guess is that with the maize data, differences between say,
> 55% and 65%
> aren't neither important nor likely as farmers will plant
> certain whole
> areas with a particular crop. Thus you could categorize the
> data into
> quartiles (0-24.9%, 25%-49.9%, 50% - 74.9%, 75%- 100%) and
> then do an
> ordinal regression.

I was agreeing strongly with Andrew until his very last
suggestion.

Given measured proportions as a response, it would seem far too
pessimistic to degrade the data to ordinal classes.

To repeat and expand a point made earlier, there is a clear
difference between the possibility that individual data points
attain the limits of 0 and 1 (0, 100%) and what the mean
response is doing. Using a logit as link in a generalised
linear model is not affected by the fact that logit(0) and logit(1)
are indeterminate, any more than this is a problem for binary
responses
as handled by -logit- and -probit-.

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```