My raw data contain evidence of both overdispersion and “excess
zeros”. Is a zero-inflated negative binomial model the only count
data model that can account for both the overdispersion and “excess
zeros”?
|
Title
|
|
My raw count data contains evidence of both overdispersion and “excess zeros”
|
|
Author
|
David M. Drukker, StataCorp
|
|
Date
|
February 2000; minor revisions July 2007
|
Note: This FAQ uses Stata 10 syntax. However, the advice provided here is still valid for
newer versions of Stata.
Short answer
The short answer is no. Either unobserved heterogeneity or a process that
has separate mechanisms for generating zero and nonzero counts can produce
both overdispersion and “excess zeros” in the raw data. A
simple negative binomial model,
nbreg, a
zero-inflated Poisson model,
zip, and a
zero-inflated negative binomial model,
zinb are all
candidates for count data with these characteristics. It is important to
keep in mind, however, that very different probability models underlie these
models. In particular, negative binomial models model between-subject
heterogeneity. Zero-inflated models have different probability models for
the zero and nonzero counts.
There are Wald and likelihood-ratio (LR) tests for evaluating the relative
fits of zip and zinb, and
there is a Vuong test for choosing between nbreg
and zinb.
Longer answer
We will consider an extended example.
Here we wish to model consultation rates in general practice according to
social class. The dataset contains the following variables for each
patient:
nconany is the number of consultation
sclass: 1=social class I/II; 2= social class IIIN; 3= social class
IIIM; 4= social class IV/V
pyr: proportion of year registered with practice
summarize
presents evidence that there is overdispersion in the raw data.
. summarize nconany
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
nconany | 59080 2.280383 3.351723 0 98
tabulate
presents evidence of zero-inflation in the number of consultations:
. tabulate nconany
nconany | Freq. Percent Cum.
--------+-----------------------------------
0 | 19937 33.75 33.75
1 | 12442 21.06 54.81
2 | 8433 14.27 69.08
3 | 5632 9.53 78.61
4 | 3701 6.26 84.88
The first table shows that the unconditional variance of the count variable
is larger than the mean. This result indicates that a researcher may want
to estimate a model other than the Poisson model in which the two are
constrained to be equal. There are several options. Either a negative
binomial model or a zero-inflated Poisson or a zero-inflated negative
binomial model could account for this overdispersion. A nice facet of the
negative binomial model is that the Poisson model is nested within it. When
the estimated parameter alpha is zero, the conditional mean is equal to the
conditional variance and the negative binomial model reduces to the Poisson
model. (See Long [1997] and Cameron and Trivedi [1998] for the details of
this nesting and for further interpretation of alpha.)
The second table shows that just over a third of the counts are zeros. Both
Long (1997) and Cameron and Trivedi (1998) note that the unobserved
heterogeneity that can cause overdispersion can also cause there to be
“excess zeros”. In fact, Cameron and Trivedi (1998) review
related work by other authors that shows that for certain mixture models,
the heterogeneity that gives rise to the overdispersion will always raise
the proportion of zeros.
Now, let’s fit a zero-inflated Poisson model to this data using
zip. The output from this estimation is
. xi: zip nconany i.sclass, inflate(i.sclass) irr exposure(pyr) nolog
i.sclass _Isclass_1-4 (naturally coded; _Isclass_1 omitted)
Zero-inflated Poisson regression Number of obs = 59080
Nonzero obs = 39143
Zero obs = 19937
Inflation model = logit LR chi2(3) = 1037.06
Log likelihood = -140326.7 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| IRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nconany |
_Isclass_2 | 1.074051 .011087 6.921 0.000 1.052539 1.096002
_Isclass_3 | 1.168243 .0089174 20.372 0.000 1.150895 1.185852
_Isclass_4 | 1.287653 .0105419 30.881 0.000 1.267156 1.308481
pyr | (exposure)
-------------+----------------------------------------------------------------
inflate |
_Isclass_2 | -.0307179 .0328074 -0.936 0.349 -.0950192 .0335833
_Isclass_3 | -.0580417 .0244904 -2.370 0.018 -.106042 -.0100414
_Isclass_4 | -.0452335 .0269985 -1.675 0.094 -.0981496 .0076826
_cons | -.7961444 .0184429 -43.168 0.000 -.8322919 -.7599969
------------------------------------------------------------------------------
The zip model does not allow for between-subject
heterogeneity. nbreg will model the between-subject
heterogeneity, but it will enforce the same process for the zero and nonzero
counts.
For these data, the output is
. xi: nbreg nconany i.sclass, irr exposure(pyr) nolog
i.sclass _Isclass_1-4 (naturally coded; _Isclass_1 omitted)
Negative binomial regression Number of obs = 59080
LR chi2(3) = 327.27
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -118859.37 Pseudo R2 = 0.0014
------------------------------------------------------------------------------
nconany | IRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iagegrp_2 | 1.258561 .0259803 11.14 0.000 1.208657 1.310526
_Isclass_2 | 1.084091 .0201579 4.342 0.000 1.045293 1.124328
_Isclass_3 | 1.182971 .0164207 12.105 0.000 1.151221 1.215597
_Isclass_4 | 1.30598 .0201062 17.340 0.000 1.267161 1.345987
pyr | (exposure)
-------------+----------------------------------------------------------------
/lnalpha | .2679188 .0088858 .2505029 .2853346
-------------+----------------------------------------------------------------
alpha | 1.307241 .0116159 112.539 0.000 1.284671 1.330207
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(1) = 80508.81 Prob>=chibar2 = 0.000
Here some issues become more complicated. It is true that
zip does not allow for between-subject
heterogeneity. However, the overdispersion in the raw data and the
significance of alpha in the nbreg output could be
the result of a process that gave rise to the zero inflation. Long (1997)
notes on page 244 that in a ZIP model, the conditional variance of the count
variable is larger than the conditional mean as long as the value of the
cumulative distribution function of the xb,
the linear combination of the coefficients and the data, in the inflation
equation is not zero. This value is zero only when this linear combination
is negative infinity. In particular, if all the coefficients in the
inflation equation are zero, then this value is one-half.
Either the ZIP model or the negative binomial model could account for both
the overdispersion and the “excess zeros” in the raw data.
Furthermore, both zip and
nbreg produce results that seem well behaved. At
this point, we might want a test of nonnested models to compare the ZIP with
the negative binomial model. Stata does not have a command to perform this
test out of the box. There may be assumptions that would permit a Hausman
test of this hypothesis. These assumptions would probably be rather
arbitrary and very strong. There is a Vuong (1989) test for comparing these
two models; however, it is not yet implemented in Stata.
If we suspect that there is a separate process for the zero and nonzero
counts and for between-subject heterogeneity, then we would want to try
zinb. This method can be seen in the output below.
For the data at hand, the estimates of the coefficients in the inflation
equation have very large standard errors.
. xi: zinb nconany i.sclass, inflate(i.sclass) irr exposure(pyr) vuong
i.sclass _Isclass_1-4 (naturally coded; _Isclass_1 omitted)
Zero-inflated negative binomial regression Number of obs = 59080
Nonzero obs = 39143
Zero obs = 19937
Inflation model = logit LR chi2(3) = 324.56
Log likelihood = -118858.5 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| IRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nconany |
_Isclass_2 | 1.084098 .0201135 4.352 0.000 1.045384 1.124245
_Isclass_3 | 1.183006 .0163847 12.134 0.000 1.151324 1.215559
_Isclass_4 | 1.319803 .0227859 16.072 0.000 1.27589 1.365226
pyr | (exposure)
-------------+----------------------------------------------------------------
inflate |
_Isclass_2 | -10.70419 10872.28 -0.001 0.999 -21319.99 21298.58
_Isclass_3 | -8.182399 2243.279 -0.004 0.997 -4404.929 4388.564
_Isclass_4 | 9.078406 31.94348 0.284 0.776 -53.52967 71.68648
_cons | -13.64615 31.93554 -0.427 0.669 -76.23867 48.94637
-------------+----------------------------------------------------------------
/lnalpha | .2618876 .0099614 26.290 0.000 .2423636 .2814115
-------------+----------------------------------------------------------------
alpha | 1.29938 .0129436 1.274257 1.324999
------------------------------------------------------------------------------
Vuong test of zinb vs. standard negative binomial: z = 0.625 Pr>z = 0.7339
The output indicates that the Vuong test does not favor either model. Vuong
(1989) developed some general tests of nonnested models. Greene (1994)
adapts one of these tests to the cases ZIP versus Poisson and zero-inflated
negative binomial versus negative binomial models. This test has been
implemented in Stata. As described in Long (1997), this statistic has a
standard normal distribution with large positive values favoring the
zero-inflated model and with large negative values favoring the
nonzero-inflated version (negative binomial in this case). Values close to
zero in absolute value favor neither model. The value of .7339 does not
favor either model. The very large standard errors on the coefficients in
the inflation equation, however, do imply a definite lack of fit of the
zero-inflated negative binomial model.
Even if one were to transform the confidence intervals to Bonferroni
confidence intervals, it would appear that essentially any value of the
cumulative distribution function of the linear combination of the
coefficients of the inflation equation and the data is possible. The large
negative IRRs at the lower bound of the confidence intervals of the
inflation equation mean that xb takes on
large negative values when evaluated at this lower bound. The distribution
function of these is essentially zero. When this value is zero, the
zero-inflated negative binomial model reduces to the negative binomial.
This example has illustrated the ability of all three models to account for
overdispersion and excess zeros in the raw data. While the analysis is not
conclusive, it would seem that the data favor either a negative binomial
model or a zero-inflated Poisson model.
References
- Cameron, A. C. and P. K. Trivedi. 1998.
-
Regression Analysis of Count Data.
Cambridge: Cambridge University Press.
- Greene, W. H. 1994.
- Accounting for excess zeros and sample selection
in poisson and negative binomial regression models.
- Working paper, Stern
School of Business, NYU EC-94-10.
- Long, J. S. 1997.
-
Regression Models for Categorical and Limited Dependent Variables.
Thousand Oaks, CA: Sage.
- Vuong, Q. H. 1989.
- Likelihood ratio tests for model selection and
non-nested hypotheses. Econometrica 57: 307–333.
|