Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Maarten Buis <maartenlbuis@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Count data model with underdispersion |

Date |
Tue, 24 Jan 2012 11:44:47 +0100 |

>> On Mon, Jan 23, 2012 at 11:43 AM, F Tollnek wrote: >>> Because there are uncertainties about the people who reported "zero" >>> children, I have to truncate the data at the zero values. As often >>> proposed, I used the Zero Truncated Poisson model for analyzing my >>> data, but the problem is that there is significant underdispersion given. On Mon, Jan 23, 2012 at 1:44 PM, F Tollnek wrote: > thank you very much for your answer! I might also calculate a model, where I > keep the persons with zero children. The problem is that the data stems from > the 18th century and sometimes it is not clear if the people didn't have > children or if they just were not reported. Also, there are couples with > children, but from another marriage, where the number of children is set to > zero; as we only would like to analyse families with children of both > spouses, we have to leave out the couples with "zero" children at some > point. The problem is that -tpoisson- still needs to "fill in" some value for the count of couples with 0 kids. It does not do so by changing the data, but by including that estimated count in the likelihood function. It estimates those 0 counts by assuming that the number of kids follow a Poisson distribution. Because of that the truncated Poisson regression is much more vulnerable to deviations from the assumed distribution than a regular Poisson regression. It is substantively unlikely that the number of kids follow a Poisson distribution, and you already found substantial underdispersion. As a consequence, you have, by removing the 0 kids couples and estimating a truncated poisson, replaced one set of problematic 0s with another set of problematic 0s. Adding a -vce(robust)- option is not going help with that, as that only changes the standard errors. To illustrate this problem I have below added a simulation. In it I created a variable number of kids (y) using a sequential logit process such that it depends on one binary variable (x). From it I derive the expected number of kids conditional on x, and if the model is correct than the coefficient of x should be the logarithm of the ratio of these two means. Even though the data where not created using a Poisson process, the bias of -poisson- is negligible and the empirical coverage of the 95% confidence interval almost exactly matches the nominal 95%. This cannot be said of the -tpoisson- model; there is a severe bias and (as a consequence) the coverage of the confidence interval is nowhere near the nominal 95%. This simulation uses -simsum- to tabulate the results. You can install it by typing in Stata -ssc install simsum-. It is described in: Ian R. White (2010) simsum: Analyses of simulation studies including Monte Carlo error, The Stata Journal, 10(3): 369-385. *--------------------- begin simulation --------------------------- set seed 12345 tempname p ip Ey0 Ey1 // expected number of kids given x==1 scalar `p' = invlogit(1) // probability of getting the ith child scalar `ip' = invlogit(-1) // probability of not getting the ith child scalar `Ey1' = `ip'*0 + /// `p'*`ip'*1 + /// `p'*`p'*`ip'*2 + /// `p'*`p'*`p'*`ip'*3 + /// `p'*`p'*`p'*`p'*`ip'*4 + /// `p'*`p'*`p'*`p'* `p'*`ip'*5 + /// `p'*`p'*`p'*`p'* `p'*`p'*`ip'*6 + /// `p'*`p'*`p'*`p'* `p'*`p'* `p'*7 // expected number of kids given x==0 scalar `p' = .5 // invlogit(0)=.5 scalar `ip' = .5 scalar `Ey0' = `ip'*0 + /// `p'*`ip'*1 + /// `p'*`p'*`ip'*2 + /// `p'*`p'*`p'*`ip'*3 + /// `p'*`p'*`p'*`p'*`ip'*4 + /// `p'*`p'*`p'*`p'* `p'*`ip'*5 + /// `p'*`p'*`p'*`p'* `p'*`p'*`ip'*6 + /// `p'*`p'*`p'*`p'* `p'*`p'* `p'*7 di as txt "the true rate ratio is: " as result `Ey1'/`Ey0' di as txt "the true coefficient is: " as result ln(`Ey1'/`Ey0') program drop _all program define sim, rclass // generate data drop _all set obs 500 gen x = runiform() < .5 gen byte y = 0 forvalues i = 1/7 { replace y = `i' if runiform() < invlogit(0 + 1*x) & y == `=`i'-1' } // estimate the poisson model poisson y x, vce(robust) return scalar bp = _b[x] return scalar sp = _se[x] // estimate the truncated poisson model tpoisson y x if y > 0 , vce(robust) return scalar bt = _b[x] return scalar st = _se[x] end // repeat this 20,000 times and collect // the resulting coefficients and standard errors simulate bp=r(bp) sp=r(sp) bt=r(bt) st=r(st), reps(20000): sim // tabulate the results simsum b*, true(`=ln(`Ey1'/`Ey0')') se(s*) mcse format(%7.0g) /// bsims sesims bias cover // graph the results twoway kdensity bp, xline(`=ln(`Ey1'/`Ey0')') || kdensity bt *---------------------- end simulation ---------------------------- (For more on examples I sent to the Statalist see: http://www.maartenbuis.nl/example_faq ) -------------------------- Maarten L. Buis Institut fuer Soziologie Universitaet Tuebingen Wilhelmstrasse 36 72074 Tuebingen Germany http://www.maartenbuis.nl -------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Count data model with underdispersion***From:*F Tollnek <f.tollnek@student.uni-tuebingen.de>

**Re: st: Count data model with underdispersion***From:*Maarten Buis <maartenlbuis@gmail.com>

**Re: st: Count data model with underdispersion***From:*F Tollnek <f.tollnek@student.uni-tuebingen.de>

- Prev by Date:
**Re: st: ice error** - Next by Date:
**st: Help with parametric frailty models** - Previous by thread:
**Re: st: Count data model with underdispersion** - Next by thread:
**st: scientific notation turn off** - Index(es):