# RE: st: RE: Dependent var is a proportion, with large spike in .95+

 From jverkuilen To Subject RE: st: RE: Dependent var is a proportion, with large spike in .95+ Date Thu, 4 Sep 2008 12:04:47 -0400

```Nick Cox <n.j.cox@durham.ac.uk> wrote:

#My take differs from anybody else! From #what you say, this is not a
#spike. It is just strong skewness.

After results my coauthor sent me last night I am inclined to agree. He fit mixture models to some endpoint skewed DVs. The mixture always went to 0. We are plannng some sims to test this but the big problem is that the mixture of true endpoint and a bimodal beta is hard to distinguish.

#A spike in my book is a big group of #identical values, in this context
#usually lots of exact zeros or exact ones #(or 100%s, naturally).

Interior spikes seem to be the real trouble, e.g., one on 0.5.

#A good approximation is if that you take #logits of a beta-distributed
#variable, the distribution looks bell-
#shaped. That's true even for
#highly skewed betas with modes near 0 #or near 1.

Yes, so long as the distribution is not J- or L-shaped, which can happen with the beta. It can handle those shapes and endpoint bimodality too.

#However, if you have any exact zeros or #ones, you can't take logits, and
#equivalently you can't really fit a beta. #You need either a fudge that
#denies that the zeros or ones really are #that or a mixture model such as
#others are referring to.

Right. The beta likelihhod is relatively insensitive to transformations that pull exact 0 or 1 observations into (0,1). I have gotten to the point I just do it using

Y_new = 1/2n + (1 - 1/2n)*Y_old.

But the choice of cheating factor is ultimately not very important thankfully.

Also I should note that a historgram is a crummy tool for identifying spikes unless the sample size is very large and the spike is distinct. Try the ECDF or the frequency table.

Nick [not Nic]
n.j.cox@durham.ac.uk

Dan Weitzenfeld

I am trying to determine which testing factors drive a proportion
dependent variable, PercentNoise.
In searching the archives, I came across -betafit-, and a link to the
FAQ: "How do you fit a model when the dependent variable is a
proportion?"  In that response, Allen McDowell and Nic Cox write, "In
practice, it is often helpful to look at the frequency distribution: a
marked spike at zero or one may well raise doubt about a single model
fitted to all data."
That describes my situation exactly:  I have a marked spike in my
histogram at the top bin, roughly .95 - 1.  I am wondering how to
account for this.
Does -betafit- take such a possibility into account?
Can someone briefly describe how I could use multiple models to fit
all the data, as implied in the FAQ response?
My fallback is setting a pass/fail bar and converting my proportions
to a binary, then using probit/logit.  But the obvious drawback is
that I am throwing away information by collapsing the continuous
(albeit bounded) proportion variable to a binary.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```