- Endogenous sample selection, aka
- Missing on unobservables
- Missing not at random (MNAR)

- Incidence rate ratios (IRRs)
- Robust, cluster–robust, and bootstrap standard errors
- Support for survey data
- Advanced inference
- Make inferences about:
- Expected count
- Probability of any count
- Incidence rates
- How covariates affect expected counts, incidence rates, or probability of a count

- Make inferences for groups or individuals:
- Full population
- Subpopulations
- Expected results for specific covariate values

- Profile plots of counts, probabilities, and effects with CIs

- Make inferences about:

Poisson regression is often used to model count outcomes, such as the number of patents that firms were granted, the number of times people visited the doctor, or the number of times unfortunate Prussian soldiers died by being kicked by horses.

With observational data, we do not always see the outcome for all subjects. This is different from observing zero events; we simply have no information at all about the outcome. Why? Surveys have nonresponse. Firms may prefer trade secrets to patent applications. And so on. We might expect the outcomes of those we observe and those we do not observe to be different. This kind of missingness is called sample selection, or more correctly, endogenous sample selection. It is also called missing not at random (MNAR).

Stata command **heckpoisson** fits models to count data and produces
estimates as though the sample selection did not occur. That is to say, it
fits models that let you make inferences about the whole population, not just
those who would be observed.

We are interested in how a firm's investment in research and development (R&D)
increases the amount of innovation. We want to control for those firms that
are in the information technology (IT) sector, because we suspect such firms
have a higher rate of innovation regardless of investment. We measure
innovation as the number of patents granted (**patents**),
R&D investment in thousands of dollars (**investment**), and an
indicator for IT firms (**i.firmtype**).

We would like to type

.poisson patents investment i.firmtype

and make our inferences about the impact of R&D investment and firm type on patents. There is, however, a problem. Many firms did not apply for any patents. We assume that some did not make any patent-worthy discoveries and that would just be the zeros in our Poisson distribution. But some firms might not even file for patents because they prefer to keep innovations as trade secrets.

We suspect that firms who choose to keep trade secrets rather than file for patents are inherently different from those who regularly file for patents. Specifically, we think their choice to keep trade secrets is not independent of their expected number of patents, if they were to apply for patents.

We want to understand how investment affects overall innovation in the population of all firms, not just the expected number of patents obtained by firms who regularly apply for patents. We need to account for the non-random missingness induced by those firms that choose to keep trade secrets. We need to model the sample selection (missingness) process.

We think that a propensity to apply for patents is affected by firm size
**size** in addition to **investment** and **i.firmtype**. Access to
lawyers and such would depend on firm size. Whether a firm has ever applied
for a patent, which we use as an indicator of participation in the patent
process, is recorded in **applied**. Just 55% of our sample has ever
applied for a patent.

We fit our Poisson model for patents adding a model for those who apply for patents,

.heckpoisson patents investment i.firmtype, select(applied = investment size i.firmtype)Poisson regression with endogenous selection Number of obs = 10,000 (25 quadrature points) Selected = 5,575 Nonselected = 4,425 Wald chi2(2) = 443.90 Log likelihood = -17440.44 Prob > chi2 = 0.0000

patents | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |

patents | ||

investment | .497821 .0507866 9.80 0.000 .398281 .597361 | |

firmtype | ||

IT sector | .5833501 .0300366 19.42 0.000 .5244795 .6422207 | |

_cons | -1.855143 .208204 -8.91 0.000 -2.263216 -1.447071 | |

applied | ||

investment | .1369954 .0447339 3.06 0.002 .0493185 .2246723 | |

size | .2774201 .0469132 5.91 0.000 .1854718 .3693683 | |

firmtype | ||

IT sector | .2750208 .0277032 9.93 0.000 .2207236 .329318 | |

_cons | -1.660778 .2631227 -6.31 0.000 -2.176489 -1.145066 | |

/athrho | 1.161677 .2847896 4.08 0.000 .6034999 1.719855 | |

/lnsigma | -.3029685 .0499674 -6.06 0.000 -.4009028 -.2050342 | |

rho | .8215857 .0925557 .5395353 .9378455 | |

sigma | .7386224 .036907 .6697151 .8146195 | |

The first part of the output reports the coefficients of the Poisson model for number of patents granted. The second reports the coefficients of the selection model. The coefficients reported in the first part of the output are applicable to 100% of the population, not just the 55% who participate in the patent process.

The footer presents a test of the correlation between the errors of the selection and outcome equations. If there were no correlation, we could fit a simple Poisson model to the 55% sample, and those results would be equally applicable to the entire population. The test's null hypothesis is that of no correlation, and it is rejected. We did need to account for sample selection.

Results from Poisson models are often reported as incidence rate ratios. To see them, we could type

.heckpoisson, irr(output omitted)

Had we reported these results, we would see that the IRR for IT firms is about 1.8, meaning that the expected number of patents in the IT sector is 1.8 times the expected number in the other sectors.

Perhaps more interestingly, we can use **margins** to estimate the expected
number of patents for IT and non-IT firms over a range of R&D investment
levels.

.margins tech , at(expenditure=(.5(.5)4))(output omitted)

The output is fairly long, so we will plot the results on a graph,

Among other things that we could read off of this graph, we see that IT firms expect to achieve one patent per year at an investment level of about 2 million. Other types of firms require just over 3 million in investment before they can expect one patent per year.

Read more about Heckman selection models for count outcomes in [R] heckpoisson.