What's this about?
Poisson regression is used when the dependent variable is a count from
a Poisson process.
Outcomes can be left-censored if they are not observed when they are
below a certain level and can be right-censored if are not observed
when they are above another level.
New command cpoisson fits Poisson regression models on count
data and allows the counts to be left-censored, right-censored, or
both. The censoring can be at constant values, or it can differ across
An example of a right-censored count outcome is the number of cars in a
family, where data might be top-coded at 3 or more.
An example of a left-censored count outcome is the number of cookie
boxes sold by Girl Scouts if the first outcome value recorded is
10 or fewer boxes.
Left- and right-censoring combined is also known as interval-censoring.
Distinguish between censored and truncated. With censored outcomes, it
is the outcomes that are not observed even though the observation is in
our data; we observe the other values for the person. In truncated
data, it is the observation that is entirely missing from our data. Stata
has an estimator for truncated Poisson data, see [R] tpoisson.
Let's see it work
Below we study the number of car accidents a person has during a year.
The number recorded is 0, 1, 2, or 3, and 3 means 3 or more accidents.
The number is right-censored.
We will model the determinants of accidents as the number of previous
accidents, whether the driver is a parent, and the number of traffic
tickets the driver received during the previous year.
. cpoisson accidents i.past i.parent i.ntickets, ul(3) irr
initial: log likelihood = -3352.1349
rescale: log likelihood = -3352.1349
Iteration 0: log likelihood = -3352.1349
Iteration 1: log likelihood = -3348.7553
Iteration 2: log likelihood = -3348.737
Iteration 3: log likelihood = -3348.737
Censored Poisson regression Number of obs = 3,000
LR chi2(8) = 312.86
Log likelihood = -3348.737 Prob > chi2 = 0.0000
| accidents || || IRR Std. Err. z P>|z| [95% Conf. Interval]|
| 1.past || || 1.325026 .1026562 3.63 0.000 1.138355 1.542308|
| 1.parent || || .644582 .0288632 -9.81 0.000 .5904226 .7037095|
| || || |
| ntickets || || |
| 1 || || 1.028125 .0511614 0.56 0.577 .9325849 1.133452|
| 2 || || 1.094165 .0783527 1.26 0.209 .950886 1.259032|
| 3 || || 3.015248 .241031 13.81 0.000 2.577984 3.526679|
| 4 || || 2.615793 .4061223 6.19 0.000 1.929513 3.546166|
| 5 || || 4.317464 1.580035 4.00 0.000 2.107268 8.845809|
| 6 || || 2.339149 1.655636 1.20 0.230 .5842281 9.365548|
| || || |
| _cons || || .8550119 .026662 -5.02 0.000 .8043201 .9088984|
| 0 left-censored observations
2,827 uncensored observations
173 right-censored observations|
We interpret the model coefficients (or incidence-rate ratios) as if the
censoring had not occurred. That is to say, as though we had seen all of the
We find that past accidents predict more future accidents, that being a
parent predicts fewer future accidents, and that the number of tickets
generally predicts more future accidents, although having just 1 or 2 tickets
has little significance.
Because of the censoring, we do not know which of the people
coded as having 3 accidents really had exactly 3 accidents, or which
We can, however, now make predictions of the expected uncensored number of
accidents and the probabilities of any specified number of accidents,
including values greater than 3.
We wonder, what are the chances anyone had more than 3 accidents in
our data? Our data were officially top-coded, but were they practically
top-coded? We can obtain each driver's probability of having
four or more accidents by typing
. predict fourplus, pr(4,.)
We now have the probability that each driver in our sample had four or more
accidents. To get the expected number of drivers who had 4 or more
accidents, we simply sum these probabilities
. total fourplus
Total estimation Number of obs = 3,000
| || || Total Std. Err. [95% Conf. Interval]|
| fourplus || || 52.32738 2.496397 47.43256 57.2222|
We expect 52.3 drivers in our data had more than 3 accidents,
and top-coding almost certainly affected our data.
Almost certainly? We have a standard error above, but the standard
error and confidence interval do not account for the probabilities
having themselves been estimated. If we use margins to perform
the computation, it will produce the correct standard error and confidence
. margins , expression(3000*predict(pr(4,.)))
Predictive margins Number of obs = 3,000
Model VCE : OIM
Expression : 3000*predict(pr(4,.))
| || || Delta-method|
| || || Margin Std. Err. z P>|z| [95% Conf. Interval]|
| _cons || || 52.32738 4.377615 11.95 0.000 43.74741 60.90735|
margins wants to report a mean, so we had to trick it into giving
us a total by multiplying the probabilities by our sample size of 3000.
With such a small standard error and a lower bound of 43.7 on our
confidence interval, we can definitively say, or at least as
definitively as any statistician can say, that top-coding affected our
Tell me more
Read more about censored Poisson models in Stata Base Reference Manual;
see [R] cpoisson.