Poisson models with sample selection

Order

Watch video demo

<- See Stata's other features

Highlights

Endogenous sample selection, aka

Missing on unobservables
Missing not at random (MNAR)

Incidence rate ratios (IRRs)
Robust, cluster–robust, and bootstrap standard errors
Support for survey data
Advanced inference

Make inferences about:

Expected count
Probability of any count
Incidence rates
How covariates affect expected counts, incidence rates, or probability of a count

Make inferences for groups or individuals:

Full population
Subpopulations
Expected results for specific covariate values

Profile plots of counts, probabilities, and effects with CIs

Poisson regression is often used to model count outcomes, such as the number of patents that firms were granted, the number of times people visited the doctor, or the number of times unfortunate Prussian soldiers died by being kicked by horses.

With observational data, we do not always see the outcome for all subjects. This is different from observing zero events; we simply have no information at all about the outcome. Why? Surveys have nonresponse. Firms may prefer trade secrets to patent applications. And so on. We might expect the outcomes of those we observe and those we do not observe to be different. This kind of missingness is called sample selection, or more correctly, endogenous sample selection. It is also called missing not at random (MNAR).

Stata command heckpoisson fits models to count data and produces estimates as though the sample selection did not occur. That is to say, it fits models that let you make inferences about the whole population, not just those who would be observed.

Let's see it work

We are interested in how a firm's investment in research and development (R&D) increases the amount of innovation. We want to control for those firms that are in the information technology (IT) sector, because we suspect such firms have a higher rate of innovation regardless of investment. We measure innovation as the number of patents granted (patents), R&D investment in thousands of dollars (investment), and an indicator for IT firms (i.firmtype).

We would like to type

. poisson patents investment i.firmtype

and make our inferences about the impact of R&D investment and firm type on patents. There is, however, a problem. Many firms did not apply for any patents. We assume that some did not make any patent-worthy discoveries and that would just be the zeros in our Poisson distribution. But some firms might not even file for patents because they prefer to keep innovations as trade secrets.

We suspect that firms who choose to keep trade secrets rather than file for patents are inherently different from those who regularly file for patents. Specifically, we think their choice to keep trade secrets is not independent of their expected number of patents, if they were to apply for patents.

We want to understand how investment affects overall innovation in the population of all firms, not just the expected number of patents obtained by firms who regularly apply for patents. We need to account for the non-random missingness induced by those firms that choose to keep trade secrets. We need to model the sample selection (missingness) process.

We think that a propensity to apply for patents is affected by firm size size in addition to investment and i.firmtype. Access to lawyers and such would depend on firm size. Whether a firm has ever applied for a patent, which we use as an indicator of participation in the patent process, is recorded in applied. Just 55% of our sample has ever applied for a patent.

We fit our Poisson model for patents adding a model for those who apply for patents,

. heckpoisson patents investment i.firmtype,
     select(applied = investment size i.firmtype)

Poisson regression with endogenous selection    Number of obs     =     10,000
(25 quadrature points)                                Selected    =      5,575
                                                      Nonselected =      4,425

                                                Wald chi2(2)      =     443.90
Log likelihood = -17440.44                      Prob > chi2       =     0.0000


     patents        Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
   
patents       
  investment      .497821   .0507866     9.80   0.000      .398281     .597361
              
    firmtype  
  IT sector      .5833501   .0300366    19.42   0.000     .5244795    .6422207
       _cons    -1.855143    .208204    -8.91   0.000    -2.263216   -1.447071
   
applied       
  investment     .1369954   .0447339     3.06   0.002     .0493185    .2246723
        size     .2774201   .0469132     5.91   0.000     .1854718    .3693683
              
    firmtype  
  IT sector      .2750208   .0277032     9.93   0.000     .2207236     .329318
       _cons    -1.660778   .2631227    -6.31   0.000    -2.176489   -1.145066
   
     /athrho     1.161677   .2847896     4.08   0.000     .6034999    1.719855
    /lnsigma    -.3029685   .0499674    -6.06   0.000    -.4009028   -.2050342
   
         rho     .8215857   .0925557                      .5395353    .9378455
       sigma     .7386224    .036907                      .6697151    .8146195

Wald test of indep. eqns. (rho = 0): chi2(1) =    16.64   Prob > chi2 = 0.0000

The first part of the output reports the coefficients of the Poisson model for number of patents granted. The second reports the coefficients of the selection model. The coefficients reported in the first part of the output are applicable to 100% of the population, not just the 55% who participate in the patent process.

The footer presents a test of the correlation between the errors of the selection and outcome equations. If there were no correlation, we could fit a simple Poisson model to the 55% sample, and those results would be equally applicable to the entire population. The test's null hypothesis is that of no correlation, and it is rejected. We did need to account for sample selection.

Results from Poisson models are often reported as incidence rate ratios. To see them, we could type

. heckpoisson, irr
(output omitted)

Had we reported these results, we would see that the IRR for IT firms is about 1.8, meaning that the expected number of patents in the IT sector is 1.8 times the expected number in the other sectors.

Perhaps more interestingly, we can use margins to estimate the expected number of patents for IT and non-IT firms over a range of R&D investment levels.

. margins tech , at(expenditure=(.5(.5)4))
(output omitted)

The output is fairly long, so we will plot the results on a graph,

Among other things that we could read off of this graph, we see that IT firms expect to achieve one patent per year at an investment level of about 2 million. Other types of firms require just over 3 million in investment before they can expect one patent per year.

Tell me more

Read more about Heckman selection models for count outcomes in [R] heckpoisson.

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies


patents		Coef. Std. Err. z P>\|z\| [95% Conf. Interval]

patents
investment		.497821 .0507866 9.80 0.000 .398281 .597361

firmtype
IT sector		.5833501 .0300366 19.42 0.000 .5244795 .6422207
_cons		-1.855143 .208204 -8.91 0.000 -2.263216 -1.447071


applied
investment		.1369954 .0447339 3.06 0.002 .0493185 .2246723
size		.2774201 .0469132 5.91 0.000 .1854718 .3693683

firmtype
IT sector		.2750208 .0277032 9.93 0.000 .2207236 .329318
_cons		-1.660778 .2631227 -6.31 0.000 -2.176489 -1.145066


/athrho		1.161677 .2847896 4.08 0.000 .6034999 1.719855
/lnsigma		-.3029685 .0499674 -6.06 0.000 -.4009028 -.2050342


rho		.8215857 .0925557 .5395353 .9378455
sigma		.7386224 .036907 .6697151 .8146195