Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

Re: st: Count data model with underdispersion

 From Maarten Buis To statalist@hsphsun2.harvard.edu Subject Re: st: Count data model with underdispersion Date Tue, 24 Jan 2012 11:44:47 +0100

```>> On Mon, Jan 23, 2012 at 11:43 AM, F Tollnek wrote:
>>> Because there are uncertainties about the people who reported "zero"
>>> children, I have to truncate the data at the zero values. As often
>>> proposed, I used the Zero Truncated Poisson model for analyzing my
>>> data, but the problem is that there is significant underdispersion given.

On Mon, Jan 23, 2012 at 1:44 PM, F Tollnek wrote:
> thank you very much for your answer! I might also calculate a model, where I
> keep the persons with zero children. The problem is that the data stems from
> the 18th century and sometimes it is not clear if the people didn't have
> children or if they just were not reported. Also, there are couples with
> children, but from another marriage, where the number of children is set to
> zero; as we only would like to analyse families with children of both
> spouses, we have to leave out the couples with "zero" children at some
> point.

The problem is that -tpoisson- still needs to "fill in" some value for
the count of couples with 0 kids.  It does not do so by changing the
data, but by including that estimated count in the likelihood
function. It estimates those 0 counts by assuming that the number of
kids follow a Poisson distribution. Because of that the truncated
Poisson regression is much more vulnerable to deviations from the
assumed distribution than a regular Poisson regression. It is
substantively unlikely that the number of kids follow a Poisson
distribution, and you already found substantial underdispersion. As a
consequence, you have, by removing the 0 kids couples and estimating a
truncated poisson, replaced one set of problematic 0s with another set
of problematic 0s. Adding a -vce(robust)- option is not going help
with that, as that only changes the standard errors.

To illustrate this problem I have below added a simulation. In it I
created a variable number of kids (y) using a sequential logit process
such that it depends on one binary variable (x). From it I derive the
expected number of kids conditional on x, and if the model is correct
than the coefficient of x should be the logarithm of the ratio of
these two means. Even though the data where not created using a
Poisson process, the bias of -poisson- is negligible and the empirical
coverage of the 95% confidence interval almost  exactly matches the
nominal 95%. This cannot be said of the -tpoisson- model; there is a
severe bias and (as a consequence) the coverage of the confidence
interval is nowhere near the nominal 95%.

This simulation uses -simsum- to tabulate the results. You can install
it by typing in Stata -ssc install simsum-. It is described in: Ian R.
White (2010) simsum: Analyses of simulation studies including Monte
Carlo error, The Stata Journal, 10(3): 369-385.

*--------------------- begin simulation ---------------------------
set seed 12345

tempname p ip Ey0 Ey1

// expected number of kids given x==1
scalar `p'  = invlogit(1)   // probability of getting the ith child
scalar `ip' = invlogit(-1)  // probability of not getting the ith child
scalar `Ey1' = `ip'*0 +                          ///
`p'*`ip'*1 +                      ///
`p'*`p'*`ip'*2 +                  ///
`p'*`p'*`p'*`ip'*3 +              ///
`p'*`p'*`p'*`p'*`ip'*4 +          ///
`p'*`p'*`p'*`p'* `p'*`ip'*5 +     ///
`p'*`p'*`p'*`p'* `p'*`p'*`ip'*6 + ///
`p'*`p'*`p'*`p'* `p'*`p'* `p'*7

// expected number of kids given x==0
scalar `p'  = .5 // invlogit(0)=.5
scalar `ip' = .5
scalar `Ey0' = `ip'*0 +                          ///
`p'*`ip'*1 +                      ///
`p'*`p'*`ip'*2 +                  ///
`p'*`p'*`p'*`ip'*3 +              ///
`p'*`p'*`p'*`p'*`ip'*4 +          ///
`p'*`p'*`p'*`p'* `p'*`ip'*5 +     ///
`p'*`p'*`p'*`p'* `p'*`p'*`ip'*6 + ///
`p'*`p'*`p'*`p'* `p'*`p'* `p'*7

di as txt "the true rate ratio is: "  as result `Ey1'/`Ey0'
di as txt "the true coefficient is: " as result ln(`Ey1'/`Ey0')

program drop _all
program define sim, rclass
// generate data
drop _all
set obs 500
gen x = runiform() < .5
gen byte y = 0
forvalues i = 1/7 {
replace y = `i' if runiform() < invlogit(0 + 1*x) & y == `=`i'-1'
}

// estimate the poisson model
poisson y x, vce(robust)
return scalar bp = _b[x]
return scalar sp = _se[x]

// estimate the truncated poisson model
tpoisson y x if y > 0 , vce(robust)
return scalar bt = _b[x]
return scalar st = _se[x]
end

// repeat this 20,000 times and collect
// the resulting coefficients and standard errors
simulate bp=r(bp) sp=r(sp) bt=r(bt) st=r(st), reps(20000): sim

// tabulate the results
simsum b*, true(`=ln(`Ey1'/`Ey0')') se(s*) mcse format(%7.0g) ///
bsims sesims bias cover

// graph the results
twoway kdensity bp, xline(`=ln(`Ey1'/`Ey0')') || kdensity bt
*---------------------- end simulation ----------------------------
(For more on examples I sent to the Statalist see:
http://www.maartenbuis.nl/example_faq )

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```