Stata: Data Analysis and Statistical Software
   >> Home >> Resources & support >> FAQs >> Raw count data with evidence of overdispersion & “excess zeros”

My raw data contain evidence of both overdispersion and “excess zeros”. Is a zero-inflated negative binomial model the only count data model that can account for both the overdispersion and “excess zeros”?

Title   My raw count data contains evidence of both overdispersion and “excess zeros”
Author David M. Drukker, StataCorp
Date February 2000; minor revisions July 2007

Note: This FAQ uses Stata 10 syntax. However, the advice provided here is still valid for newer versions of Stata.

Short answer

The short answer is no. Either unobserved heterogeneity or a process that has separate mechanisms for generating zero and nonzero counts can produce both overdispersion and “excess zeros” in the raw data. A simple negative binomial model, nbreg, a zero-inflated Poisson model, zip, and a zero-inflated negative binomial model, zinb are all candidates for count data with these characteristics. It is important to keep in mind, however, that very different probability models underlie these models. In particular, negative binomial models model between-subject heterogeneity. Zero-inflated models have different probability models for the zero and nonzero counts.

There are Wald and likelihood-ratio (LR) tests for evaluating the relative fits of zip and zinb, and there is a Vuong test for choosing between nbreg and zinb.

Longer answer

We will consider an extended example.

Here we wish to model consultation rates in general practice according to social class. The dataset contains the following variables for each patient:

 nconany is the number of consultation
 
 sclass: 1=social class I/II; 2= social class IIIN; 3= social class
 IIIM; 4= social class IV/V

 pyr: proportion of year registered with practice

summarize presents evidence that there is overdispersion in the raw data.

 . summarize nconany
  
 Variable |     Obs        Mean   Std. Dev.       Min        Max
 ---------+-----------------------------------------------------
  nconany |   59080    2.280383   3.351723          0         98  

tabulate presents evidence of zero-inflation in the number of consultations:

  . tabulate nconany
        
  nconany |      Freq.     Percent        Cum.
  --------+-----------------------------------
        0 |      19937       33.75       33.75
        1 |      12442       21.06       54.81
        2 |       8433       14.27       69.08
        3 |       5632        9.53       78.61
        4 |       3701        6.26       84.88

The first table shows that the unconditional variance of the count variable is larger than the mean. This result indicates that a researcher may want to estimate a model other than the Poisson model in which the two are constrained to be equal. There are several options. Either a negative binomial model or a zero-inflated Poisson or a zero-inflated negative binomial model could account for this overdispersion. A nice facet of the negative binomial model is that the Poisson model is nested within it. When the estimated parameter alpha is zero, the conditional mean is equal to the conditional variance and the negative binomial model reduces to the Poisson model. (See Long [1997] and Cameron and Trivedi [1998] for the details of this nesting and for further interpretation of alpha.)

The second table shows that just over a third of the counts are zeros. Both Long (1997) and Cameron and Trivedi (1998) note that the unobserved heterogeneity that can cause overdispersion can also cause there to be “excess zeros”. In fact, Cameron and Trivedi (1998) review related work by other authors that shows that for certain mixture models, the heterogeneity that gives rise to the overdispersion will always raise the proportion of zeros.

Now, let’s fit a zero-inflated Poisson model to this data using zip. The output from this estimation is

 . xi: zip nconany i.sclass, inflate(i.sclass) irr exposure(pyr) nolog
 i.sclass          _Isclass_1-4        (naturally coded; _Isclass_1 omitted)
  
 Zero-inflated Poisson regression                  Number of obs   =      59080
                                                   Nonzero obs     =      39143
                                                   Zero obs        =      19937
 
 Inflation model = logit                           LR chi2(3)      =    1037.06
 Log likelihood  = -140326.7                       Prob > chi2     =     0.0000

 ------------------------------------------------------------------------------
              |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 nconany      |
   _Isclass_2 |   1.074051    .011087    6.921   0.000     1.052539    1.096002
   _Isclass_3 |   1.168243   .0089174   20.372   0.000     1.150895    1.185852
   _Isclass_4 |   1.287653   .0105419   30.881   0.000     1.267156    1.308481
          pyr | (exposure)
 -------------+----------------------------------------------------------------
 inflate      |
   _Isclass_2 |  -.0307179   .0328074   -0.936   0.349     -.0950192   .0335833
   _Isclass_3 |  -.0580417   .0244904   -2.370   0.018      -.106042  -.0100414
   _Isclass_4 |  -.0452335   .0269985   -1.675   0.094     -.0981496   .0076826
        _cons |  -.7961444   .0184429  -43.168   0.000     -.8322919  -.7599969
 ------------------------------------------------------------------------------

The zip model does not allow for between-subject heterogeneity. nbreg will model the between-subject heterogeneity, but it will enforce the same process for the zero and nonzero counts.

For these data, the output is

  . xi: nbreg nconany i.sclass, irr exposure(pyr) nolog
  i.sclass          _Isclass_1-4        (naturally coded; _Isclass_1 omitted)
   
  Negative binomial regression                      Number of obs   =      59080
                                                    LR chi2(3)      =     327.27
  Dispersion     = mean                             Prob > chi2     =     0.0000
  Log likelihood = -118859.37                       Pseudo R2       =     0.0014
 
 
  ------------------------------------------------------------------------------
       nconany |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
    _Iagegrp_2 |   1.258561   .0259803    11.14   0.000     1.208657    1.310526
    _Isclass_2 |   1.084091   .0201579     4.342  0.000     1.045293    1.124328
    _Isclass_3 |   1.182971   .0164207    12.105  0.000     1.151221    1.215597
    _Isclass_4 |    1.30598   .0201062    17.340  0.000     1.267161    1.345987
           pyr | (exposure)
  -------------+----------------------------------------------------------------
      /lnalpha |   .2679188   .0088858                       .2505029   .2853346
  -------------+----------------------------------------------------------------
         alpha |   1.307241   .0116159   112.539  0.000       1.284671  1.330207
  ------------------------------------------------------------------------------
  Likelihood-ratio test of alpha=0:  chibar2(1) = 80508.81 Prob>=chibar2 = 0.000

Here some issues become more complicated. It is true that zip does not allow for between-subject heterogeneity. However, the overdispersion in the raw data and the significance of alpha in the nbreg output could be the result of a process that gave rise to the zero inflation. Long (1997) notes on page 244 that in a ZIP model, the conditional variance of the count variable is larger than the conditional mean as long as the value of the cumulative distribution function of the xb, the linear combination of the coefficients and the data, in the inflation equation is not zero. This value is zero only when this linear combination is negative infinity. In particular, if all the coefficients in the inflation equation are zero, then this value is one-half.

Either the ZIP model or the negative binomial model could account for both the overdispersion and the “excess zeros” in the raw data. Furthermore, both zip and nbreg produce results that seem well behaved. At this point, we might want a test of nonnested models to compare the ZIP with the negative binomial model. Stata does not have a command to perform this test out of the box. There may be assumptions that would permit a Hausman test of this hypothesis. These assumptions would probably be rather arbitrary and very strong. There is a Vuong (1989) test for comparing these two models; however, it is not yet implemented in Stata.

If we suspect that there is a separate process for the zero and nonzero counts and for between-subject heterogeneity, then we would want to try zinb. This method can be seen in the output below. For the data at hand, the estimates of the coefficients in the inflation equation have very large standard errors.

 . xi: zinb nconany i.sclass, inflate(i.sclass) irr exposure(pyr) vuong
 i.sclass          _Isclass_1-4        (naturally coded; _Isclass_1 omitted)
        
 Zero-inflated negative binomial regression        Number of obs   =      59080
                                                   Nonzero obs     =      39143
                                                   Zero obs        =      19937
 
 Inflation model = logit                           LR chi2(3)      =     324.56
 Log likelihood  = -118858.5                       Prob > chi2     =     0.0000

 ------------------------------------------------------------------------------
              |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
 nconany      |
   _Isclass_2 |   1.084098   .0201135    4.352   0.000     1.045384    1.124245
   _Isclass_3 |   1.183006   .0163847   12.134   0.000     1.151324    1.215559
   _Isclass_4 |   1.319803   .0227859   16.072   0.000      1.27589    1.365226
          pyr | (exposure)
 -------------+----------------------------------------------------------------
 inflate      |
   _Isclass_2 |  -10.70419   10872.28   -0.001   0.999    -21319.99    21298.58
   _Isclass_3 |  -8.182399   2243.279   -0.004   0.997    -4404.929    4388.564
   _Isclass_4 |   9.078406   31.94348    0.284   0.776    -53.52967    71.68648
        _cons |  -13.64615   31.93554   -0.427   0.669    -76.23867    48.94637
 -------------+----------------------------------------------------------------
     /lnalpha |   .2618876   .0099614   26.290   0.000     .2423636    .2814115
 -------------+----------------------------------------------------------------
        alpha |    1.29938   .0129436                      1.274257    1.324999
 ------------------------------------------------------------------------------
 Vuong test of zinb vs. standard negative binomial: z = 0.625     Pr>z = 0.7339

The output indicates that the Vuong test does not favor either model. Vuong (1989) developed some general tests of nonnested models. Greene (1994) adapts one of these tests to the cases ZIP versus Poisson and zero-inflated negative binomial versus negative binomial models. This test has been implemented in Stata. As described in Long (1997), this statistic has a standard normal distribution with large positive values favoring the zero-inflated model and with large negative values favoring the nonzero-inflated version (negative binomial in this case). Values close to zero in absolute value favor neither model. The value of .7339 does not favor either model. The very large standard errors on the coefficients in the inflation equation, however, do imply a definite lack of fit of the zero-inflated negative binomial model.

Even if one were to transform the confidence intervals to Bonferroni confidence intervals, it would appear that essentially any value of the cumulative distribution function of the linear combination of the coefficients of the inflation equation and the data is possible. The large negative IRRs at the lower bound of the confidence intervals of the inflation equation mean that xb takes on large negative values when evaluated at this lower bound. The distribution function of these is essentially zero. When this value is zero, the zero-inflated negative binomial model reduces to the negative binomial.

This example has illustrated the ability of all three models to account for overdispersion and excess zeros in the raw data. While the analysis is not conclusive, it would seem that the data favor either a negative binomial model or a zero-inflated Poisson model.

References

Cameron, A. C. and P. K. Trivedi. 1998.
Regression Analysis of Count Data. Cambridge: Cambridge University Press.
Greene, W. H. 1994.
Accounting for excess zeros and sample selection in poisson and negative binomial regression models.
Working paper, Stern School of Business, NYU EC-94-10.
Long, J. S. 1997.
Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage.
Vuong, Q. H. 1989.
Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57: 307–333.
Bookmark and Share 
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
Like us on Facebook Follow us on Twitter Follow us on LinkedIn Google+ Watch us on YouTube
Follow us
© Copyright 1996–2013 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index   |   View mobile site