We often cannot observe or measure an outcome over its full range. Tests for detecting a toxin often require the toxin to exceed a threshold before it can be detected—left-censoring. Patients’ weights will be censored at the upper limit of the scale used to weigh them—right-censoring.
Related to left- and right-censoring are interval measurements, or interval censoring. Income can be surveyed in ranges ($0 to $10,000, $10,001 to $30,000, $30,001 to $60,000, $60,001 and up), or patient weight can be recorded in ranges (0–80 pounds, 81–120 pounds, 121–150 pounds, 151–180 pounds, 181–220 pounds, 221–250 pounds, over 250 pounds).
Stata has long been able to estimate regression models with censored outcomes. tobit can estimate models with left- or right-censoring at fixed values. intreg can estimate models with interval measurements or censoring that varies across observations.
You can estimate models with censored or interval-measured Gaussian outcomes that also include Heckman-style selection, endogenous treatments to obtain average treatment effects (ATEs), covariate measurement error, and unobserved components. You can include endogenous regressors in any part of the models. You can also estimate these models in a panel-data or multilevel-data context with random effects (intercepts) and random coefficients in any part or all parts of the model. All of these models can be estimated as parts of larger multivariate systems. Censored or interval-measured outcomes can even participate in endogenous switching models.
Imagine we have data on incomes. These data are often top coded, or censored at an upper limit, to increase reporting rates. If that limit were $150,000, we could estimate a regression model of income on education and age by typing
. tobit income education age, ul(150000)
(We might prefer log income, but for simplicity, we will use income here.)
All these features are obtained using Stata’s generalized structural equation modeling command—gsem. The equivalent gsem command is
. gsem income <- education age, family(gaussian, rcensored(150000))
We can introduce an endogenous covariate, say, weeks worked, by adding an equation for weeks with instruments (z1 and z2) and a common unobserved component (UC) with identifying constraints specified using @:
. gsem (income <- education age weeks UC, family(gaussian, rcensored(150000))) (weeks <- education age z1 z2 UC@1 , var(UC@1))
If we have panel data with repeated measurements on individuals (id), we can introduce a random effect (intercept) into the income model by adding RE[id]:
. gsem (income <- education age weeks UC RE[id], family(gaussian, rcensored(150000))) (weeks <- education age z1 z2 UC@1)
We can even add a random coefficient on age by interacting a random latent variable (RC[id]) with age:
. gsem (income <- education age c.age#RC[id] weeks UC RE[id], family(gaussian, rcensored(150000))) (weeks <- education age z1 z2 UC@1 , var(UC@1))
Handling Heckman-style selection in the gsem framework requires a bit of setup. An example using uncensored outcome can be seen in the Structural Equation Modeling Reference Manual. For censored outcomes, you merely need to add the suboption lcensored() or rcensored() to the family() option. For interval-measured data, add the suboption ldepvar() or udepvar() to specify the lower or upper bound of the interval. The dependent variable specifies the other bound.
An endogenous treatment-effects example without censoring can be seen in the Structural Equation Modeling Reference Manual. Again just add lcensored() or rcensored() to family() if the outcome is censored. For interval-measured data, add the suboption ldepvar() or udepvar() to specify the lower or upper bound of the interval. The dependent variable specifies the other bound.
You can use either the commands shown above or Stata’s SEM Builder to create and fit these models.
Stata provides everything you could want with censored outcomes.
See Stata’s Structural Equation Modeling Reference Manual.