[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: Proportion as a dependent variable |

Date |
Thu, 17 Jul 2003 16:29:36 +0100 |

Andrew J. Vickers > Ronnie Babigumira asked whether linear regression was > appropriate for a > proportion. Many wrote back to point out that proportions involved > binary data and linear regression is for continuous outcomes. Ronnie > then clarified that the proportion was a single value > between 0 and 1 > for each observation, in this case, the percentage of field space > allocated to new variety maize for each farmer. > > My tuppence, with an open call for comment, is that many areas in > medical research and psychometrics have similar properties to the > problem Ronnie raises. For example, pain is often measured > on a 0 - 100 > scale; quality of life scales such as the SF36 convert > various numerical > scores into a proportion of the maximum score to give a > quality of life > between 0 and 100. Biostatisticians have used linear > regression for many > years without worrying too much about it, unless there was > a particular > reason: as Nick Cox put it, it all depends on the data and > the use to > which it is being put. If the dependent variable is > normally distributed > with a mean of 0.5 and an SD of 0.1, linear regression is > probably going > to work fine. If the dependent variable has many 0's and / > or 1's, as > might well be the case with the maize data, you might have > a problem, > particular that you regression will make out of sample > predictions. My > guess is that with the maize data, differences between say, > 55% and 65% > aren't neither important nor likely as farmers will plant > certain whole > areas with a particular crop. Thus you could categorize the > data into > quartiles (0-24.9%, 25%-49.9%, 50% - 74.9%, 75%- 100%) and > then do an > ordinal regression. I was agreeing strongly with Andrew until his very last suggestion. Given measured proportions as a response, it would seem far too pessimistic to degrade the data to ordinal classes. To repeat and expand a point made earlier, there is a clear difference between the possibility that individual data points attain the limits of 0 and 1 (0, 100%) and what the mean response is doing. Using a logit as link in a generalised linear model is not affected by the fact that logit(0) and logit(1) are indeterminate, any more than this is a problem for binary responses as handled by -logit- and -probit-. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Proportion as a dependent variable***From:*"Vickers, Andrew J./Integrative Medicine" <vickersa@mskcc.org>

- Prev by Date:
**Re: st: PDF Stata 8 manuals** - Next by Date:
**st: Counts with known upper bound and endogeneity** - Previous by thread:
**Re: st: Proportion as a dependent variable** - Next by thread:
**st: Newey2 R-squared** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |