Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: RE: RE: proportion as a dependent variable

From   Roger Newson <>
Subject   RE: st: RE: RE: RE: proportion as a dependent variable
Date   Mon, 14 Jul 2003 22:05:40 +0100

At 20:22 14/07/03 +0200, Ronnie Babigumira wrote:
Dear Laurel, Nick, Roger, Joao Pedro, Giogio and Todd,
Many thanks for your comments. To put things in perspective, the presenter
was studying new maize varieties and sought to identify some socio economic
factors that may explain the adoption of these varieties. All respondents in
her sample grew some maize (traditional, improved or both) so her dependent
variable was area under improved varieties (which would then be handled
easily in a censored regression framework or better still as a corner
solution outcome). However, she argued that the area allocated needed to be
adjusted for total area under maize (if one has 1 acre and allocated 0.5
acres to the maize then in terms of adoption, this should not be the same as
someone with 10 acres  of maize land but also allocates 0.5 acres) hence the
dependent variable was total area under new maize/ total maize area (hence
the proportion).

From Laurels email, it would imply that all the independent variables should
also be divided by the maize area, while Nicks email points out (correctly)
that while the dependent variable lies between 0 and 1, using OLS does not
guarantee that the predicted values of y will lie between 0 and 1 (which is
one of the main arguments against the Linear Probability Model). Roger
points to a binary dependent variable however the dependent variable here is
not quite binary. Joao Pedro suggests something that the presenter actually
did, while I still need to think thru Giorgios suggestion and I am just
going to read thru the paper suggested by Todd

In the light of the "added flesh" to the problem, I would appreciate your
comments on the best way to proceed (for example, would just including the
total maize area as one of the independent variables be a sufficient
If the Y-variable is a proportion rather than a binary variable, then you can still use either -regress- with Huber variances, or -glm- with identity link and binomial family, or even -glm- with log link and binomial family if you want multiplicative effects. The -glm- command will warn you that your Y-variable is not binary, but will still do as it is asked. The main problem with homoskedastic (equal-variance) linear regression is that, if the Y-variable is a proportion, then the conditional variance is not likely to be independent of the conditional mean, because proportions sampled from a distribution with a mean near 0.5 can vary more than proportions sampled from a distribution with a mean near 0 or 1. The -family- option of -glm- simply optimises the estimation under a particular assumption about mean-variance relationship, in order to minimize the width of the confidence intervals if that assumption is true. If you also use the -robust- option, then your standard errors will still be consistent, even if you do not guess the mean-variance relationship right first time. I myself would probably not simply use area under new maize as the Y-variable and area under total maize as an X-variable, because I would expect the effect of total maize area on area under new maize to be multiplicative rather than additive.

I hope this helps.


Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom

Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
or 020 7848 6605 International +44 20 7848 6605

Opinions expressed are those of the author, not the institution.

* For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index