Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Count data model with underdispersion

From	Maarten Buis <[email protected]>
To	[email protected]
Subject	Re: st: Count data model with underdispersion
Date	Tue, 24 Jan 2012 11:44:47 +0100

>> On Mon, Jan 23, 2012 at 11:43 AM, F Tollnek wrote:
>>> Because there are uncertainties about the people who reported "zero"
>>> children, I have to truncate the data at the zero values. As often
>>> proposed, I used the Zero Truncated Poisson model for analyzing my
>>> data, but the problem is that there is significant underdispersion given.

On Mon, Jan 23, 2012 at 1:44 PM, F Tollnek wrote:
> thank you very much for your answer! I might also calculate a model, where I
> keep the persons with zero children. The problem is that the data stems from
> the 18th century and sometimes it is not clear if the people didn't have
> children or if they just were not reported. Also, there are couples with
> children, but from another marriage, where the number of children is set to
> zero; as we only would like to analyse families with children of both
> spouses, we have to leave out the couples with "zero" children at some
> point.

The problem is that -tpoisson- still needs to "fill in" some value for
the count of couples with 0 kids.  It does not do so by changing the
data, but by including that estimated count in the likelihood
function. It estimates those 0 counts by assuming that the number of
kids follow a Poisson distribution. Because of that the truncated
Poisson regression is much more vulnerable to deviations from the
assumed distribution than a regular Poisson regression. It is
substantively unlikely that the number of kids follow a Poisson
distribution, and you already found substantial underdispersion. As a
consequence, you have, by removing the 0 kids couples and estimating a
truncated poisson, replaced one set of problematic 0s with another set
of problematic 0s. Adding a -vce(robust)- option is not going help
with that, as that only changes the standard errors.

To illustrate this problem I have below added a simulation. In it I
created a variable number of kids (y) using a sequential logit process
such that it depends on one binary variable (x). From it I derive the
expected number of kids conditional on x, and if the model is correct
than the coefficient of x should be the logarithm of the ratio of
these two means. Even though the data where not created using a
Poisson process, the bias of -poisson- is negligible and the empirical
coverage of the 95% confidence interval almost  exactly matches the
nominal 95%. This cannot be said of the -tpoisson- model; there is a
severe bias and (as a consequence) the coverage of the confidence
interval is nowhere near the nominal 95%.

This simulation uses -simsum- to tabulate the results. You can install
it by typing in Stata -ssc install simsum-. It is described in: Ian R.
White (2010) simsum: Analyses of simulation studies including Monte
Carlo error, The Stata Journal, 10(3): 369-385.

*--------------------- begin simulation ---------------------------
set seed 12345

tempname p ip Ey0 Ey1

// expected number of kids given x==1
scalar `p'  = invlogit(1)   // probability of getting the ith child
scalar `ip' = invlogit(-1)  // probability of not getting the ith child
scalar `Ey1' = `ip'*0 +                          ///
               `p'*`ip'*1 +                      ///
               `p'*`p'*`ip'*2 +                  ///
               `p'*`p'*`p'*`ip'*3 +              ///
               `p'*`p'*`p'*`p'*`ip'*4 +          ///
               `p'*`p'*`p'*`p'* `p'*`ip'*5 +     ///
               `p'*`p'*`p'*`p'* `p'*`p'*`ip'*6 + ///
               `p'*`p'*`p'*`p'* `p'*`p'* `p'*7

// expected number of kids given x==0			
scalar `p'  = .5 // invlogit(0)=.5
scalar `ip' = .5
scalar `Ey0' = `ip'*0 +                          ///
               `p'*`ip'*1 +                      ///
               `p'*`p'*`ip'*2 +                  ///
               `p'*`p'*`p'*`ip'*3 +              ///
               `p'*`p'*`p'*`p'*`ip'*4 +          ///
               `p'*`p'*`p'*`p'* `p'*`ip'*5 +     ///
               `p'*`p'*`p'*`p'* `p'*`p'*`ip'*6 + ///
               `p'*`p'*`p'*`p'* `p'*`p'* `p'*7

di as txt "the true rate ratio is: "  as result `Ey1'/`Ey0'
di as txt "the true coefficient is: " as result ln(`Ey1'/`Ey0')

program drop _all
program define sim, rclass
	// generate data
	drop _all
	set obs 500
	gen x = runiform() < .5
	gen byte y = 0
	forvalues i = 1/7 {
		replace y = `i' if runiform() < invlogit(0 + 1*x) & y == `=`i'-1'
	}

	// estimate the poisson model
	poisson y x, vce(robust)
	return scalar bp = _b[x]
	return scalar sp = _se[x]

	// estimate the truncated poisson model
	tpoisson y x if y > 0 , vce(robust)
	return scalar bt = _b[x]
	return scalar st = _se[x]
end

// repeat this 20,000 times and collect
// the resulting coefficients and standard errors
simulate bp=r(bp) sp=r(sp) bt=r(bt) st=r(st), reps(20000): sim

// tabulate the results
simsum b*, true(`=ln(`Ey1'/`Ey0')') se(s*) mcse format(%7.0g) ///
           bsims sesims bias cover

// graph the results		
twoway kdensity bp, xline(`=ln(`Ey1'/`Ey0')') || kdensity bt
*---------------------- end simulation ----------------------------
(For more on examples I sent to the Statalist see:
http://www.maartenbuis.nl/example_faq )		

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Count data model with underdispersion
  - From: F Tollnek <[email protected]>
- Re: st: Count data model with underdispersion
  - From: Maarten Buis <[email protected]>
- Re: st: Count data model with underdispersion
  - From: F Tollnek <[email protected]>

Prev by Date: Re: st: ice error
Next by Date: st: Help with parametric frailty models
Previous by thread: Re: st: Count data model with underdispersion
Next by thread: st: scientific notation turn off
Index(es):
- Date
- Thread