I say wrong to the first part, whether it is based
on a feeling or on logic. (Which is it, Tim?)
I think it's pretty much wired in that Poisson,
negative binomial, etc., really are for counts.
For example, comparison of variance and mean
depends on the fact that both are, as it were,
pure numbers, i.e. lacking in units or dimensions.
If your variable has some units, and is not
counted, then the variance and the mean don't
even have the same dimensions. The fact that
the units seem ill-defined in this example
doesn't clarify the issue here, but I don't
think it removes this difficulty. For example,
suppose out of whim I decide the original units
should be multiplied by 10. Then the mean
goes up by a factor of 10, and the variance by one
of 100. Thus how close the variance/mean ratio
and also how close the distribution is to Poisson, etc.,
depends on whim about the units. Conversely,
there is no whim about counts. They just are.
Median regression is a different ball game.
I wouldn't expect median regression to work
well with a variable with a lumpy distribution,
as this appears to be, but that's more a practical point than
an objection in principle.
Timothy Mak
My feeling is: There's no particular reason we can only use Poisson or
NBin on count data. Surely the important thing is that the distribution
matches, right? In Poisson or NBin regression, we express results in terms
of Incidence Rate Ratio, which I guess only makes sense if you're thinking
of events happening. But what about calling it 'mean ratios', as
effectively they are just that?
I have no backing from any reference or anything, but just thinking
logically (I feel), that is what I would conclude. Richard, you don't
agree with using count-type regression techniques on non-count data. Why
is that?
But concerning Matthew's situation, my feeling is: even NegBin or Poisson
may not give a very good approximation of the distribution. (Thanks
Matthew for sending me the references, BTW). Often weird distribution is
as a result of a mixture of distributions, which is why people have come
up with zero-inflated models, so that would be what I would suggest
Matthew to do first, to see whether the data can be modelled separately,
first with 0 vs greater than 0, and then model the 'greater than 0' data
using whatever means.
On a slightly separate note, has anyone considered the use of median
regression for this sort of data? Using bootstrapping with median
regression does not require any assumption of the distribution of the
dependent variable. Apart from the difficulty in convergence, I can't
think of any disadvantage. Does anyone know of any work that has made use
of it (on questionnaire type data)?
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/