Search
   >> Home >> Resources & support >> FAQs >> Raw count data with evidence of overdispersion & “excess zeros”

My raw data contain evidence of both overdispersion and “excess zeros”. Is a zero-inflated negative binomial model the only count data model that can account for both the overdispersion and “excess zeros”?

Title   My raw count data contains evidence of both overdispersion and “excess zeros”
Author David M. Drukker, StataCorp
Date February 2000; minor revisions July 2007

Note: This FAQ uses Stata 10 syntax. However, the advice provided here is still valid for newer versions of Stata.

Short answer

The short answer is no. Either unobserved heterogeneity or a process that has separate mechanisms for generating zero and nonzero counts can produce both overdispersion and “excess zeros” in the raw data. A simple negative binomial model, nbreg, a zero-inflated Poisson model, zip, and a zero-inflated negative binomial model, zinb are all candidates for count data with these characteristics. It is important to keep in mind, however, that very different probability models underlie these models. In particular, negative binomial models model between-subject heterogeneity. Zero-inflated models have different probability models for the zero and nonzero counts.

There are Wald and likelihood-ratio (LR) tests for evaluating the relative fits of zip and zinb, and there is a Vuong test for choosing between nbreg and zinb.

Longer answer

We will consider an extended example.

Here we wish to model consultation rates in general practice according to social class. The dataset contains the following variables for each patient:

 nconany is the number of consultation
 
 sclass: 1=social class I/II; 2= social class IIIN; 3= social class
 IIIM; 4= social class IV/V

 pyr: proportion of year registered with practice

summarize presents evidence that there is overdispersion in the raw data.

 . summarize nconany
  
Variable Obs Mean Std. Dev. Min Max
nconany 59080 2.280383 3.351723 0 98

tabulate presents evidence of zero-inflation in the number of consultations:

  . tabulate nconany

nconany Freq. Percent Cum.
0 19937 33.75 33.75
1 12442 21.06 54.81
2 8433 14.27 69.08
3 5632 9.53 78.61
4 3701 6.26 84.88

The first table shows that the unconditional variance of the count variable is larger than the mean. This result indicates that a researcher may want to estimate a model other than the Poisson model in which the two are constrained to be equal. There are several options. Either a negative binomial model or a zero-inflated Poisson or a zero-inflated negative binomial model could account for this overdispersion. A nice facet of the negative binomial model is that the Poisson model is nested within it. When the estimated parameter alpha is zero, the conditional mean is equal to the conditional variance and the negative binomial model reduces to the Poisson model. (See Long [1997] and Cameron and Trivedi [1998] for the details of this nesting and for further interpretation of alpha.)

The second table shows that just over a third of the counts are zeros. Both Long (1997) and Cameron and Trivedi (1998) note that the unobserved heterogeneity that can cause overdispersion can also cause there to be “excess zeros”. In fact, Cameron and Trivedi (1998) review related work by other authors that shows that for certain mixture models, the heterogeneity that gives rise to the overdispersion will always raise the proportion of zeros.

Now, let’s fit a zero-inflated Poisson model to this data using zip. The output from this estimation is

 . xi: zip nconany i.sclass, inflate(i.sclass) irr exposure(pyr) nolog
 i.sclass          _Isclass_1-4        (naturally coded; _Isclass_1 omitted)
  
 Zero-inflated Poisson regression                  Number of obs   =      59080
                                                   Nonzero obs     =      39143
                                                   Zero obs        =      19937
 
 Inflation model = logit                           LR chi2(3)      =    1037.06
 Log likelihood  = -140326.7                       Prob > chi2     =     0.0000

IRR Std. Err. z P>|z| [95% Conf. Interval]
nconany
_Isclass_2 1.074051 .011087 6.921 0.000 1.052539 1.096002
_Isclass_3 1.168243 .0089174 20.372 0.000 1.150895 1.185852
_Isclass_4 1.287653 .0105419 30.881 0.000 1.267156 1.308481
pyr (exposure)
inflate
_Isclass_2 -.0307179 .0328074 -0.936 0.349 -.0950192 .0335833
_Isclass_3 -.0580417 .0244904 -2.370 0.018 -.106042 -.0100414
_Isclass_4 -.0452335 .0269985 -1.675 0.094 -.0981496 .0076826
_cons -.7961444 .0184429 -43.168 0.000 -.8322919 -.7599969

The zip model does not allow for between-subject heterogeneity. nbreg will model the between-subject heterogeneity, but it will enforce the same process for the zero and nonzero counts.

For these data, the output is

  . xi: nbreg nconany i.sclass, irr exposure(pyr) nolog
  i.sclass          _Isclass_1-4        (naturally coded; _Isclass_1 omitted)
   
  Negative binomial regression                      Number of obs   =      59080
                                                    LR chi2(3)      =     327.27
  Dispersion     = mean                             Prob > chi2     =     0.0000
  Log likelihood = -118859.37                       Pseudo R2       =     0.0014
 
 
nconany IRR Std. Err. z P>|z| [95% Conf. Interval]
_Iagegrp_2 1.258561 .0259803 11.14 0.000 1.208657 1.310526
_Isclass_2 1.084091 .0201579 4.342 0.000 1.045293 1.124328
_Isclass_3 1.182971 .0164207 12.105 0.000 1.151221 1.215597
_Isclass_4 1.30598 .0201062 17.340 0.000 1.267161 1.345987
pyr (exposure)
/lnalpha .2679188 .0088858 .2505029 .2853346
alpha 1.307241 .0116159 112.539 0.000 1.284671 1.330207
Likelihood-ratio test of alpha=0: chibar2(1) = 80508.81 Prob>=chibar2 = 0.000

Here some issues become more complicated. It is true that zip does not allow for between-subject heterogeneity. However, the overdispersion in the raw data and the significance of alpha in the nbreg output could be the result of a process that gave rise to the zero inflation. Long (1997) notes on page 244 that in a ZIP model, the conditional variance of the count variable is larger than the conditional mean as long as the value of the cumulative distribution function of the xb, the linear combination of the coefficients and the data, in the inflation equation is not zero. This value is zero only when this linear combination is negative infinity. In particular, if all the coefficients in the inflation equation are zero, then this value is one-half.

Either the ZIP model or the negative binomial model could account for both the overdispersion and the “excess zeros” in the raw data. Furthermore, both zip and nbreg produce results that seem well behaved. At this point, we might want a test of nonnested models to compare the ZIP with the negative binomial model. Stata does not have a command to perform this test out of the box. There may be assumptions that would permit a Hausman test of this hypothesis. These assumptions would probably be rather arbitrary and very strong. There is a Vuong (1989) test for comparing these two models; however, it is not yet implemented in Stata.

If we suspect that there is a separate process for the zero and nonzero counts and for between-subject heterogeneity, then we would want to try zinb. This method can be seen in the output below. For the data at hand, the estimates of the coefficients in the inflation equation have very large standard errors.

 . xi: zinb nconany i.sclass, inflate(i.sclass) irr exposure(pyr) vuong
 i.sclass          _Isclass_1-4        (naturally coded; _Isclass_1 omitted)
        
 Zero-inflated negative binomial regression        Number of obs   =      59080
                                                   Nonzero obs     =      39143
                                                   Zero obs        =      19937
 
 Inflation model = logit                           LR chi2(3)      =     324.56
 Log likelihood  = -118858.5                       Prob > chi2     =     0.0000

IRR Std. Err. z P>|z| [95% Conf. Interval]
nconany
_Isclass_2 1.084098 .0201135 4.352 0.000 1.045384 1.124245
_Isclass_3 1.183006 .0163847 12.134 0.000 1.151324 1.215559
_Isclass_4 1.319803 .0227859 16.072 0.000 1.27589 1.365226
pyr (exposure)
inflate
_Isclass_2 -10.70419 10872.28 -0.001 0.999 -21319.99 21298.58
_Isclass_3 -8.182399 2243.279 -0.004 0.997 -4404.929 4388.564
_Isclass_4 9.078406 31.94348 0.284 0.776 -53.52967 71.68648
_cons -13.64615 31.93554 -0.427 0.669 -76.23867 48.94637
/lnalpha .2618876 .0099614 26.290 0.000 .2423636 .2814115
alpha 1.29938 .0129436 1.274257 1.324999
Vuong test of zinb vs. standard negative binomial: z = 0.625 Pr>z = 0.7339

The output indicates that the Vuong test does not favor either model. Vuong (1989) developed some general tests of nonnested models. Greene (1994) adapts one of these tests to the cases ZIP versus Poisson and zero-inflated negative binomial versus negative binomial models. This test has been implemented in Stata. As described in Long (1997), this statistic has a standard normal distribution with large positive values favoring the zero-inflated model and with large negative values favoring the nonzero-inflated version (negative binomial in this case). Values close to zero in absolute value favor neither model. The value of .7339 does not favor either model. The very large standard errors on the coefficients in the inflation equation, however, do imply a definite lack of fit of the zero-inflated negative binomial model.

Even if one were to transform the confidence intervals to Bonferroni confidence intervals, it would appear that essentially any value of the cumulative distribution function of the linear combination of the coefficients of the inflation equation and the data is possible. The large negative IRRs at the lower bound of the confidence intervals of the inflation equation mean that xb takes on large negative values when evaluated at this lower bound. The distribution function of these is essentially zero. When this value is zero, the zero-inflated negative binomial model reduces to the negative binomial.

This example has illustrated the ability of all three models to account for overdispersion and excess zeros in the raw data. While the analysis is not conclusive, it would seem that the data favor either a negative binomial model or a zero-inflated Poisson model.

References

Cameron, A. C. and P. K. Trivedi. 1998.
Regression Analysis of Count Data. Cambridge: Cambridge University Press.
Greene, W. H. 1994.
Accounting for excess zeros and sample selection in poisson and negative binomial regression models.
Working paper, Stern School of Business, NYU EC-94-10.
Long, J. S. 1997.
Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage.
Vuong, Q. H. 1989.
Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57: 307–333.
The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube