Title | My raw count data contains evidence of both overdispersion and “excess zeros” | |
Author | David M. Drukker, StataCorp | |
Date | February 2000; minor revisions July 2007 |
Note: This FAQ uses Stata 10 syntax. However, the advice provided here is still valid for newer versions of Stata.
The short answer is no. Either unobserved heterogeneity or a process that has separate mechanisms for generating zero and nonzero counts can produce both overdispersion and “excess zeros” in the raw data. A simple negative binomial model, nbreg, a zero-inflated Poisson model, zip, and a zero-inflated negative binomial model, zinb are all candidates for count data with these characteristics. It is important to keep in mind, however, that very different probability models underlie these models. In particular, negative binomial models model between-subject heterogeneity. Zero-inflated models have different probability models for the zero and nonzero counts.
There are Wald and likelihood-ratio (LR) tests for evaluating the relative fits of zip and zinb, and there is a Vuong test for choosing between nbreg and zinb.
We will consider an extended example.
Here we wish to model consultation rates in general practice according to social class. The dataset contains the following variables for each patient:
nconany is the number of consultation sclass: 1=social class I/II; 2= social class IIIN; 3= social class IIIM; 4= social class IV/V pyr: proportion of year registered with practice
summarize presents evidence that there is overdispersion in the raw data.
. summarize nconany
Variable | Obs Mean Std. Dev. Min Max | |
nconany | 59080 2.280383 3.351723 0 98 |
tabulate presents evidence of zero-inflation in the number of consultations:
. tabulate nconany
nconany | Freq. Percent Cum. | |
0 | 19937 33.75 33.75 | |
1 | 12442 21.06 54.81 | |
2 | 8433 14.27 69.08 | |
3 | 5632 9.53 78.61 | |
4 | 3701 6.26 84.88 |
The first table shows that the unconditional variance of the count variable is larger than the mean. This result indicates that a researcher may want to estimate a model other than the Poisson model in which the two are constrained to be equal. There are several options. Either a negative binomial model or a zero-inflated Poisson or a zero-inflated negative binomial model could account for this overdispersion. A nice facet of the negative binomial model is that the Poisson model is nested within it. When the estimated parameter alpha is zero, the conditional mean is equal to the conditional variance and the negative binomial model reduces to the Poisson model. (See Long [1997] and Cameron and Trivedi [1998] for the details of this nesting and for further interpretation of alpha.)
The second table shows that just over a third of the counts are zeros. Both Long (1997) and Cameron and Trivedi (1998) note that the unobserved heterogeneity that can cause overdispersion can also cause there to be “excess zeros”. In fact, Cameron and Trivedi (1998) review related work by other authors that shows that for certain mixture models, the heterogeneity that gives rise to the overdispersion will always raise the proportion of zeros.
Now, let’s fit a zero-inflated Poisson model to this data using zip. The output from this estimation is
. xi: zip nconany i.sclass, inflate(i.sclass) irr exposure(pyr) nolog i.sclass _Isclass_1-4 (naturally coded; _Isclass_1 omitted) Zero-inflated Poisson regression Number of obs = 59080 Nonzero obs = 39143 Zero obs = 19937 Inflation model = logit LR chi2(3) = 1037.06 Log likelihood = -140326.7 Prob > chi2 = 0.0000
IRR Std. Err. z P>|z| [95% Conf. Interval] | ||
nconany | ||
_Isclass_2 | 1.074051 .011087 6.921 0.000 1.052539 1.096002 | |
_Isclass_3 | 1.168243 .0089174 20.372 0.000 1.150895 1.185852 | |
_Isclass_4 | 1.287653 .0105419 30.881 0.000 1.267156 1.308481 | |
pyr | (exposure) | |
inflate | ||
_Isclass_2 | -.0307179 .0328074 -0.936 0.349 -.0950192 .0335833 | |
_Isclass_3 | -.0580417 .0244904 -2.370 0.018 -.106042 -.0100414 | |
_Isclass_4 | -.0452335 .0269985 -1.675 0.094 -.0981496 .0076826 | |
_cons | -.7961444 .0184429 -43.168 0.000 -.8322919 -.7599969 | |
The zip model does not allow for between-subject heterogeneity. nbreg will model the between-subject heterogeneity, but it will enforce the same process for the zero and nonzero counts.
For these data, the output is
. xi: nbreg nconany i.sclass, irr exposure(pyr) nolog i.sclass _Isclass_1-4 (naturally coded; _Isclass_1 omitted) Negative binomial regression Number of obs = 59080 LR chi2(3) = 327.27 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -118859.37 Pseudo R2 = 0.0014
nconany | IRR Std. Err. z P>|z| [95% Conf. Interval] | |
_Iagegrp_2 | 1.258561 .0259803 11.14 0.000 1.208657 1.310526 | |
_Isclass_2 | 1.084091 .0201579 4.342 0.000 1.045293 1.124328 | |
_Isclass_3 | 1.182971 .0164207 12.105 0.000 1.151221 1.215597 | |
_Isclass_4 | 1.30598 .0201062 17.340 0.000 1.267161 1.345987 | |
pyr | (exposure) | |
/lnalpha | .2679188 .0088858 .2505029 .2853346 | |
alpha | 1.307241 .0116159 112.539 0.000 1.284671 1.330207 | |
Here some issues become more complicated. It is true that zip does not allow for between-subject heterogeneity. However, the overdispersion in the raw data and the significance of alpha in the nbreg output could be the result of a process that gave rise to the zero inflation. Long (1997) notes on page 244 that in a ZIP model, the conditional variance of the count variable is larger than the conditional mean as long as the value of the cumulative distribution function of the xb, the linear combination of the coefficients and the data, in the inflation equation is not zero. This value is zero only when this linear combination is negative infinity. In particular, if all the coefficients in the inflation equation are zero, then this value is one-half.
Either the ZIP model or the negative binomial model could account for both the overdispersion and the “excess zeros” in the raw data. Furthermore, both zip and nbreg produce results that seem well behaved. At this point, we might want a test of nonnested models to compare the ZIP with the negative binomial model. Stata does not have a command to perform this test out of the box. There may be assumptions that would permit a Hausman test of this hypothesis. These assumptions would probably be rather arbitrary and very strong. There is a Vuong (1989) test for comparing these two models; however, it is not yet implemented in Stata.
If we suspect that there is a separate process for the zero and nonzero counts and for between-subject heterogeneity, then we would want to try zinb. This method can be seen in the output below. For the data at hand, the estimates of the coefficients in the inflation equation have very large standard errors.
. xi: zinb nconany i.sclass, inflate(i.sclass) irr exposure(pyr) vuong i.sclass _Isclass_1-4 (naturally coded; _Isclass_1 omitted) Zero-inflated negative binomial regression Number of obs = 59080 Nonzero obs = 39143 Zero obs = 19937 Inflation model = logit LR chi2(3) = 324.56 Log likelihood = -118858.5 Prob > chi2 = 0.0000
IRR Std. Err. z P>|z| [95% Conf. Interval] | ||
nconany | ||
_Isclass_2 | 1.084098 .0201135 4.352 0.000 1.045384 1.124245 | |
_Isclass_3 | 1.183006 .0163847 12.134 0.000 1.151324 1.215559 | |
_Isclass_4 | 1.319803 .0227859 16.072 0.000 1.27589 1.365226 | |
pyr | (exposure) | |
inflate | ||
_Isclass_2 | -10.70419 10872.28 -0.001 0.999 -21319.99 21298.58 | |
_Isclass_3 | -8.182399 2243.279 -0.004 0.997 -4404.929 4388.564 | |
_Isclass_4 | 9.078406 31.94348 0.284 0.776 -53.52967 71.68648 | |
_cons | -13.64615 31.93554 -0.427 0.669 -76.23867 48.94637 | |
/lnalpha | .2618876 .0099614 26.290 0.000 .2423636 .2814115 | |
alpha | 1.29938 .0129436 1.274257 1.324999 | |
The output indicates that the Vuong test does not favor either model. Vuong (1989) developed some general tests of nonnested models. Greene (1994) adapts one of these tests to the cases ZIP versus Poisson and zero-inflated negative binomial versus negative binomial models. This test has been implemented in Stata. As described in Long (1997), this statistic has a standard normal distribution with large positive values favoring the zero-inflated model and with large negative values favoring the nonzero-inflated version (negative binomial in this case). Values close to zero in absolute value favor neither model. The value of .7339 does not favor either model. The very large standard errors on the coefficients in the inflation equation, however, do imply a definite lack of fit of the zero-inflated negative binomial model.
Even if one were to transform the confidence intervals to Bonferroni confidence intervals, it would appear that essentially any value of the cumulative distribution function of the linear combination of the coefficients of the inflation equation and the data is possible. The large negative IRRs at the lower bound of the confidence intervals of the inflation equation mean that xb takes on large negative values when evaluated at this lower bound. The distribution function of these is essentially zero. When this value is zero, the zero-inflated negative binomial model reduces to the negative binomial.
This example has illustrated the ability of all three models to account for overdispersion and excess zeros in the raw data. While the analysis is not conclusive, it would seem that the data favor either a negative binomial model or a zero-inflated Poisson model.