Home  /  Products  /  Features  /  Poisson models with sample selection

<-  See Stata's other features


  • Endogenous sample selection, aka

    • Missing on unobservables

    • Missing not at random (MNAR)

  • Incidence rate ratios (IRRs)

  • Robust, cluster–robust, and bootstrap standard errors

  • Support for survey data

  • Advanced inference

    • Make inferences about:

      • Expected count

      • Probability of any count

      • Incidence rates

      • How covariates affect expected counts, incidence rates, or probability of a count

    • Make inferences for groups or individuals:

      • Full population

      • Subpopulations

      • Expected results for specific covariate values

    • Profile plots of counts, probabilities, and effects with CIs

Poisson regression is often used to model count outcomes, such as the number of patents that firms were granted, the number of times people visited the doctor, or the number of times unfortunate Prussian soldiers died by being kicked by horses.

With observational data, we do not always see the outcome for all subjects. This is different from observing zero events; we simply have no information at all about the outcome. Why? Surveys have nonresponse. Firms may prefer trade secrets to patent applications. And so on. We might expect the outcomes of those we observe and those we do not observe to be different. This kind of missingness is called sample selection, or more correctly, endogenous sample selection. It is also called missing not at random (MNAR).

Stata command heckpoisson fits models to count data and produces estimates as though the sample selection did not occur. That is to say, it fits models that let you make inferences about the whole population, not just those who would be observed.

Let's see it work

We are interested in how a firm's investment in research and development (R&D) increases the amount of innovation. We want to control for those firms that are in the information technology (IT) sector, because we suspect such firms have a higher rate of innovation regardless of investment. We measure innovation as the number of patents granted (patents), R&D investment in thousands of dollars (investment), and an indicator for IT firms (i.firmtype).

We would like to type

. poisson patents investment i.firmtype

and make our inferences about the impact of R&D investment and firm type on patents. There is, however, a problem. Many firms did not apply for any patents. We assume that some did not make any patent-worthy discoveries and that would just be the zeros in our Poisson distribution. But some firms might not even file for patents because they prefer to keep innovations as trade secrets.

We suspect that firms who choose to keep trade secrets rather than file for patents are inherently different from those who regularly file for patents. Specifically, we think their choice to keep trade secrets is not independent of their expected number of patents, if they were to apply for patents.

We want to understand how investment affects overall innovation in the population of all firms, not just the expected number of patents obtained by firms who regularly apply for patents. We need to account for the non-random missingness induced by those firms that choose to keep trade secrets. We need to model the sample selection (missingness) process.

We think that a propensity to apply for patents is affected by firm size size in addition to investment and i.firmtype. Access to lawyers and such would depend on firm size. Whether a firm has ever applied for a patent, which we use as an indicator of participation in the patent process, is recorded in applied. Just 55% of our sample has ever applied for a patent.

We fit our Poisson model for patents adding a model for those who apply for patents,

. heckpoisson patents investment i.firmtype,
     select(applied = investment size i.firmtype)

Poisson regression with endogenous selection    Number of obs     =     10,000
(25 quadrature points)                                Selected    =      5,575
                                                      Nonselected =      4,425

                                                Wald chi2(2)      =     443.90
Log likelihood = -17440.44                      Prob > chi2       =     0.0000

patents Coef. Std. Err. z P>|z| [95% Conf. Interval]
investment .497821 .0507866 9.80 0.000 .398281 .597361
IT sector .5833501 .0300366 19.42 0.000 .5244795 .6422207
_cons -1.855143 .208204 -8.91 0.000 -2.263216 -1.447071
investment .1369954 .0447339 3.06 0.002 .0493185 .2246723
size .2774201 .0469132 5.91 0.000 .1854718 .3693683
IT sector .2750208 .0277032 9.93 0.000 .2207236 .329318
_cons -1.660778 .2631227 -6.31 0.000 -2.176489 -1.145066
/athrho 1.161677 .2847896 4.08 0.000 .6034999 1.719855
/lnsigma -.3029685 .0499674 -6.06 0.000 -.4009028 -.2050342
rho .8215857 .0925557 .5395353 .9378455
sigma .7386224 .036907 .6697151 .8146195
Wald test of indep. eqns. (rho = 0): chi2(1) = 16.64 Prob > chi2 = 0.0000

The first part of the output reports the coefficients of the Poisson model for number of patents granted. The second reports the coefficients of the selection model. The coefficients reported in the first part of the output are applicable to 100% of the population, not just the 55% who participate in the patent process.

The footer presents a test of the correlation between the errors of the selection and outcome equations. If there were no correlation, we could fit a simple Poisson model to the 55% sample, and those results would be equally applicable to the entire population. The test's null hypothesis is that of no correlation, and it is rejected. We did need to account for sample selection.

Results from Poisson models are often reported as incidence rate ratios. To see them, we could type

. heckpoisson, irr
(output omitted)

Had we reported these results, we would see that the IRR for IT firms is about 1.8, meaning that the expected number of patents in the IT sector is 1.8 times the expected number in the other sectors.

Perhaps more interestingly, we can use margins to estimate the expected number of patents for IT and non-IT firms over a range of R&D investment levels.

. margins tech , at(expenditure=(.5(.5)4))
(output omitted)

The output is fairly long, so we will plot the results on a graph,

Among other things that we could read off of this graph, we see that IT firms expect to achieve one patent per year at an investment level of about 2 million. Other types of firms require just over 3 million in investment before they can expect one patent per year.

Tell me more

Read more about Heckman selection models for count outcomes in [R] heckpoisson.