[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: AW: Sample selection models under zero-truncated negative binomial models

Subject   Re: st: RE: AW: Sample selection models under zero-truncated negative binomial models
Date   Fri, 5 Jun 2009 13:06:27 -0400

I agree with Austin. With the retrospective design, there is no
natural "first" visit or start time.  As such, a single visit isn't a
privileged event. I don't see a role for a time-to-event approach.

I, like the other responders, assumed a prospective design. Posters:
please describe your study designs in detail!  You will save all of us
a lot of unnecessary time!.


On Fri, Jun 5, 2009 at 12:50 PM, Austin Nichols<> wrote:
> Tony <> :
> I suppose the model required depends on what question the poster
> wishes to answer, but there is no clear advantage of a logit or probit
> over a poisson in this case unless you have no interest in the
> variation in positive outcomes or you suspect overdispersion is a
> serious issue even conditional on X which implies you have other count
> models to use; note that heterosk. or measurement error in the binary
> outcome or individual heterogeneity are all a much bigger deal in the
> logit/probit world.
> It may be that having no visits seems different from having one or
> more but having one visit also seems different from having two or
> more.  Where does that reasoning stop?  If your expected number of
> visits conditional on X is 0.01 then odds are you have no visits this
> month; you might have one, but you are very unlikely to have six.  If
> your expected number of visits conditional on X is 1 then odds are
> still good you have no visits this month; you might have one, and you
> are not terribly unlikely to have six.  The reasoning all gets easier
> in a poisson model IMHO.
> A "preponderance of zeros" just means the mean Xb is low, as is to be
> expected.  All too often, the long right tail is predictable from
> various X variables in the data, so conditional on X, the poisson
> variance may be closer to correct; if it isn't, you may need a richer
> model!  Or program up the "Flexible Regression Model for Count Data"
> (Kimberly F. Sellers and Galit Shmueli) with under- and
> overdispersion.
> Above all, why try to implement some kind of selection correction when
> you can just avoid the selection in the first place?
> On Fri, Jun 5, 2009 at 12:28 PM, Lachenbruch, Peter
> <> wrote:
>> I think the situations may be distinct:  having no hospital visits seems different from having one or more.  If these are not part of a mixture distribution (i.e., 0 visits is identifiable) one can estimate the probability of a person having 0 visits and then the count of number of non-zero visits.  If not identifiable, one can use zero-inflated Poisson or zero-inflated negative binomial.
>> The problem seems to separate naturally into the two parts. If you want a mean number of visits you can get it, but I'm unsure of the interpretation since there's a fraction that don't have any visits that is greater than that expected under the Poisson model.  In one dissertation, a student had 95% zeros and the rest were positive.  The idea was to predict costs of hospitalization - this had big implications for insurance companies.  In this case, the likelihood of finding hospitalization in a household survey may also have a preponderance of zeros.
>> Tony
>> Peter A. Lachenbruch
>> Department of Public Health
>> Oregon State University
>> Corvallis, OR 97330
>> Phone: 541-737-3832
>> FAX: 541-737-4001
>> -----Original Message-----
>> From: [] On Behalf Of Austin Nichols
>> Sent: Friday, June 05, 2009 9:15 AM
>> To:
>> Subject: Re: st: RE: AW: Sample selection models under zero-truncated negative binomial models
>> John Ataguba <> :
>> Again, why split the analysis?  If you are interested in the count,
>> use a count model, and then talk about what the results from that
>> model predict about the probability of a nonzero count when you are
>> interested in whether people have any visits.  You don't seem to have
>> any theory requiring "standard logit/probit model" assumptions.
>> -poisson- seems the natural starting point.
>> Why would you drop the zeros when trying to assess how many GP visits
>> a person seems likely to make conditional on X?  Zero is one possible
>> outcome...
>> On Fri, Jun 5, 2009 at 10:03 AM, John Ataguba <> wrote:
>>> Hi Austin,
>>> Specifically, I am not looking at the time dimension of the visits.  The data set is such that I have total number of visits to a GP (General Practitioner) in the past one month collected from a national survey of individuals.  Given that this is a household survey, there are zero visits for some individuals.
>>> One of my objective is to determine the factors that predict positive utilization of GPs.  This is easily implemented using a standard logit/probit model.  The other part is the factors that affect the number of visits to a GP.  Given that the dependent variable is a count variable, the likely candidates are count regression models.  My fear is with how to deal with unobserved heterogeneity and sample selection issues if I limit my analysis to the non-zero visits.  If I use the standard two-part or hurdle model, I do not know if this will account for sample selection in the fashion of Heckman procedure.
>>> I think the class of mixture models (fmm) will be an anternative that I want to explore. I don't know much about them but will be happy to have some brighter ideas.
>>> Regards
>>> Jon
>>> ----- Original Message ----
>>> From: Austin Nichols <>
>>> To:
>>> Sent: Friday, 5 June, 2009 14:27:20
>>> Subject: Re: st: RE: AW: Sample selection models under zero-truncated negative binomial models
>>> Steven--I like this approach in general, but from the original post,
>>> it's not clear that data on the timing of first visit or even time at
>>> risk is on the data--perhaps the poster can clarify?  Also, would you
>>> propose using the predicted hazard in the period of first visit as
>>> some kind of selection correction?  The outcome is visits divided by
>>> time at risk for subsequent visits in your setup, so represents a
>>> fractional outcome (constrained to lie between zero and one) in
>>> theory, though only the zero limit is likely to bind, which makes it
>>> tricky to implement, I would guess--if you are worried about the
>>> nonnormal error distribution and the selection b
>>> Ignoring the possibility of detailed data on times of utilization, why
>>> can't you just run a standard count model on number of visits and use
>>> that to predict probability of at least one visit?  One visit in 10
>>> years is not that different from no visits in 10 years, yeah?  It
>>> makes no sense to me to predict utilization only for those who have
>>> positive utilization and worry about selection etc. instead of just
>>> using the whole sample, including the zeros.  I.e. run a -poisson- to
>>> start with.  If you have a lot of zeros, that can just arise from the
>>> fact that a lot of people have predicted number of visits in the .01
>>> range and number of visits has to be an integer.  Zero inflation or
>>> overdispersion also can arise often from not having the right
>>> specification for the explanatory variables...  but you can also move
>>> to another model in the -glm- or -nbreg- family.
>>> On Tue, Jun 2, 2009 at 1:21 PM, <> wrote:
>>>> A potential problem with Jon's original approach is that the use of
>>>> services is an event with a time dimension--time to first use of
>>>> services.  People might not use services until they need them.
>>>> Instead of a logit model (my preference also),   a survival model for
>>>> the first part might be appropriate.
>>>> With later first-use, the time available for later visits is reduced,
>>>> and  number of visits might be associated with the time from first use
>>>> to the end of observation.  Moreover, people with later first-visits
>>>> (or none) might differ in their degree of  need for subsequent visits.
>>>> To account for unequal follow-up times,  I suggest a supplementary
>>>> analysis in which the outcome for the second part of the hurdle model
>>>> is not the number of visits, but the rate of visits (per unit time at
>>>> risk).
>>>> -Steve.
>>>> On Tue, Jun 2, 2009 at 12:22 PM, Lachenbruch, Peter
>>>> <> wrote:
>>>>> This could also be handled by a two-part or hurdle model.  The 0 vs. non-zero model is given by a probit or logit (my preference) model.  The non-zeros are modeled by the count data or OLS or what have you.  The results can be combined since the likelihood separates (the zero values are identifiable - no visits vs number of visits).
>>>>> -----Original Message-----
>>>>> From: [] On Behalf Of Martin Weiss
>>>>> Sent: Tuesday, June 02, 2009 7:02 AM
>>>>> To:
>>>>> Subject: st: AW: Sample selection models under zero-truncated negative binomial models
>>>>> *************
>>>>> ssc d cmp
>>>>> *************
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von:
>>>>> [] Im Auftrag von John Ataguba
>>>>> Gesendet: Dienstag, 2. Juni 2009 16:00
>>>>> An: Statalist statalist mailing
>>>>> Betreff: st: Sample selection models under zero-truncated negative binomial
>>>>> models
>>>>> Dear colleagues,
>>>>> I want to enquire if it is possible to perform a ztnb (zero-truncated
>>>>> negative binomial) model on a dataset that has the zeros observed in a
>>>>> fashion similar to the heckman sample selection model.
>>>>> Specifically, I have a binary variable on use/non use of outpatient health
>>>>> services and I fitted a standard probit/logit model to observe the factors
>>>>> that predict the probaility of use..  Subsequently, I want to explain the
>>>>> factors the influence the amount of visits to the health facililities. Since
>>>>> this is a count data, I cannot fit the standard Heckman model using the
>>>>> standard two-part procedure in stata command -heckman-.
>>>>> My fear now is that my sample of users will be biased if I fit a ztnb model
>>>>> on only the users given that i have information on the non-users which I
>>>>> used to run the initial probit/logit estimation.
>>>>> Is it possible to generate the inverse of mills' ratio from the probit model
>>>>> and include this in the ztnb model? will this be consistent? etc...
>>>>> Are there any smarter suggestions?  Any reference that has used the similar
>>>>> sample selection form will be appreciated.

*   For searches and help try:

© Copyright 1996–2022 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index