Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Quantile Regression with a skewed and zero-inflateddependent variable?


From   jverkuilen <jverkuilen@gc.cuny.edu>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Quantile Regression with a skewed and zero-inflateddependent variable?
Date   Mon, 4 Aug 2008 11:02:59 -0400

A few things:

(1) QR doesn't like ties so that is where zero inflation gets nasty. But you aren't modeling the lower tail and QR doesn't consider the magnitude of discrepancies if I recall correctly, just the signs. Why not model say the median, the 75% point and the 90% point? (i.e., throw the reviewer a nice juicy bone.) As to whether it turns into a logistic regression problem when you model any given quantile, I don't think so but that would be resolved by considering the likelihood functions and my copy of Koenker's book is not here.    

(2) You can run a non integer response thru zinb or zip. Stata will complain but it will give you answers that aren't nuts... usually. I have done this to impose a flattening constant in an nbreg I ran a while back. 

(3) You could make a zi model for the gamma or ig(?---not sure) using -gllamm-. Partha Deb's mixture program could also do that, I believe. Then have two classes, one a degenerate (or near-degenerate) distribution and the other free. Worth a try. 

 
-----Original Message-----
From: "Allan Garland" <agarland@exchange.hsc.mb.ca>
To: statalist@hsphsun2.harvard.edu
Sent: 8/4/2008 10:03 AM
Subject: st: Quantile Regression with a skewed and zero-inflated dependent variable?

I am working on a problem that involves multivariable modeling of:

Y=represents a time delay that is not only right-skewed but also has a
fairly large probability mass at 0 (i.e. 13% of subjects have Y=0).   

In particular, I'm interested in the independent varibles associated
with unusually long values of Y.  So, I decided to create a QR model of
the 90th conditional percentile of Y.  I did not use a logistic
regression approach (after dichotomizing Y at some arbitrary
unconditional cutpoint that represents a  "long" delay) because of the
known problems with that approach (MacCallum R, Zhang S, Preacher K,
Rucker D. On the practice of dichotomization of quantitative variables.
Psychological Methods 2002;7(1):19-40).  

Here are 2 of the reviewer's comments for this paper:  

1. The real virtue of quantile regression, as argued by its author, is
to explore covariate effect by estimating an entire  family of
conditional quantile functions, albeit this has an  implicit ordinal
aspect [R. Koenker and K. F. Hallock. Quantile  regression.  Journal of
Economic Perspectives 15 (4):143-156,  2001].  There may be heuristic
value in using a selective quantile  regression, but this would seem to
reproduce the problem of logistic regression at a different level.
Moreover, quantile regression would presumably share the difficulty of
linear regression in explicitly modelling covariate effect at zero
probability. Such is not the case with zero-adjusted estimators within
the GLM family, as below.

2. The authors could consider (i) a  count-data approach [Y could be
expressed in integer hours;  fractional hours may be subject to
measurement error] and the various zero-inflated count estimators
available in Stata or (ii) for a continuous data approach, modelling via
zero-adjusted  estimators within generalized linear models (GLM), using,
say, the  inverse-Gaussian or gamma distribution both of which have
found  utility in modelling skewed distributions  [P. de Jong and G. Z.
Heller. Generalized Linear Models for Insurance Data, Cambridge,
UK:Cambridge University Press, 2008]. 

My question is:  Is he correct?  Specifically, I am uncertain about the
validity of the criticisms of using QR that he raises in #1.  I don't
dispute (as he indicates in #2) that alternative statistical approaches
are available for this question, but I still believe that a model of the
90th percentile is legitimate approach to this question for this data
set and variable distribution.  

Thus, I'd appreciate anyone's thoughts on: (a) the reviewer's criticisms
of using QR for this purpose and in this manner, and (b) the "best"
approach to this multivariable modeling problem.

Thanks,
Allan

------------------------------------------------------------------------
--
Allan Garland, MD, MA
Associate Professor of Medicine & 
        Community Health Sciences
University of Manitoba
Health Sciences Center - GF 222
820 Sherbrook Street
Winnipeg, Manitoba R3A 1R9
phone:  204-787-1198
page: 204-935-2166
fax: 204-787-1087
email: agarland@hsc.mb.ca



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index