Home  /  Products  /  Features  /  Extended regression models for panel-data/multilevel models

<-  See Stata's other features


  • Regression combining common complications

    • Endogenous covariates

    • Sample selection

    • Nonrandom treatment assignment

      • Exogenous based on observed variables

      • Endogenous involving unobservable variables

    • Correlation of observations within panels or groups

      • Longitudinal data

      • Random effects

  • Outcome types

    • Continuous

    • Interval-measured (interval-censored)

    • Binary

    • Ordinal

  • Endogenous covariate types

    • Continuous

    • Binary

    • Ordinal

    • Interactions with other covariates

    • Quadratic and other polynomial forms

  • Treatment effects/Causal inference

    • Binary and ordinal treatment

    • Average treatment effects (ATEs)

    • ATEs on the treated (ATETs)

    • ATEs on the untreated (ATEUs)

    • Potential-outcome means (POMs)

    • ATEs, ATETs, ATEUs, and POMs for

      • Full population

      • Subpopulations

      • Expected values for specific covariate values

  • Advanced inferences

    • Inference statistics

      • Expected means and probabilities

      • Marginal effects and contrasts

      • Average structural functions (ASFs)

      • More ...

    • Conditional analysis–specify values of all covariates

    • Population-averaged–specify values of some covariates, or no covariates, and average (margin) over the rest

    • Tests against zero, tests of equality, CIs, and more

    • Inferences and plots over groups

    • Estimate parameters in the log scale, then predict mean in the exponential scale, accounting for endogeneity

Extended Regression Models (ERMs) fit models with problems. Stata's ERMs now support panel data.

By models, we mean linear regression and interval regression for continuous outcomes, probit for binary outcomes, and ordered probit for ordered outcomes.

By problems, we mean any combination of endogenous and exogenous sample selection, endogenous covariates (unobserved confounders), and nonrandom treatment assignment.

ERMs handle yet another problem—panel data (also known as longitudinal data or two-level multilevel data).

Random effects are included in each equation by default. Random effects are correlated; you can omit specific random effects, and you can test the correlations.

Other commands in Stata can fit models with any of the problems listed. ERMs can handle any combination of the above problems and can fit models with continuous, interval, binary, and multiple outcomes.

Discipline disambiguation

The problems that ERMs handle go by different names in different disciplines. Here is a list of the synonyms.

  • Endogenous and exogenous sample selection

    • Trials with informative dropout

    • Outcomes missing not at random (MNAR)

    • Nonignorable nonresponse

    • Selection on unobservables

    • Heckman selection

  • Endogenous covariates (unobserved confounders)

    • Bias due to unmeasured confounding

    • Simultaneous causality in linear models

    • Measurement error

    • Causal inference

  • Nonrandom treatment assignment

    • Causal inference

    • Average causal effects (ACEs)

    • Average treatment effects (ATEs)

  • Panel data

    • Longitudinal data

    • Two-level multilevel data

ERM syntax and workflow

You should be interested in ERMs' features if you fit cross-sectional time-series models, two-level multilevel models, or panel-data models.

Say you are interested in modeling wages and have repeated observations on individuals over the years 2011–2018. You might model their wages as a linear function of age, age squared, and education. There are two traditional ways you could fit this model in Stata:

. xtreg  wage  c.age##c.age ed
. meglm  wage  c.age##c.age ed || :id

xtreg is Stata's command for handling panel data.

meglm is Stata's command for handling multilevel and hierarchical data.

Both work because panel data are a special case of multilevel data. Panel data are multilevel data with two levels.

Or you could fit the model with Stata's ERMs xteregress command:

. xteregress  wage  c.age##c.age ed

All will produce equivalent results, and all will incorporate individual heterogeneity, a.k.a. random effects. It does not matter which command you use if your data are as we have described them.

If, however, your data suffer from any of the problems we listed above, you need to use xteregress. We say problems with your data, but whether the problems are specific to your data or a feature of reality makes no difference when it comes to fitting the model. The sources of the problems — they can vary problem by problem — matter for how you interpret and test the results. Another feature of ERMs is that they provide the tools you need for interpretation.

Consider panel data and the model

. xteregress  wage  c.age##c.age ed

To show you how easy ERMs are to use, we are going to sequentially introduce problems and solve them. By solve, we mean that we will obtain estimates that any of the above commands would have produced if only the data did not have the problems we are about to add.

We are about to play fast and loose with problems and their solutions. Forgive us. We write software. We want to show you how easy ERMs make it to implement solutions. For the problems that you find, you will certainly be more thoughtful about the solutions that you implement than we are about to be.

First problem: Endogeneity

It would certainly be reasonable to suspect that educational attainment is correlated with unobservable components not included in the data and perhaps not even conceptually observable in reality. And it would be reasonable to assume that those same unobservables might positively affect wages. For instance, having a good home background might prepare one better for the world, leading to higher educational attainment and higher wages. Anyway, whatever the cause of the problem, you decide to model each person's educational attainment as a function of their mother's and father's educational attainment, med and fed. Here is how you would do that:

. xteregress wage c.age##c.age, endogenous(ed = med fed)

What we did was move ed from the list of exogenous variables before the comma and into the endogenous() option. ed will still be included in the model for wage but as an endogenous variable. Its reported coefficient will be adjusted for the confounding or endogeneity, just as if the data did not have the problem.

First problem solved.

Second problem: Endogenous sample selection

You observe wages in your data only for those who work. What if those who do not work would have a wage that was systematically higher or lower? It could go either way. People with higher wages would find working more enticing. On the other hand, people with lower wages will have a greater need to work. You decide to model whether a person works as a function of age, education, and minimum wage so that the bias, whichever way it goes, can be washed away.

. xteregress wage c.age##c.age, endogenous(ed = med fed) select(working = age ed minwage)

Second problem solved.

And this second problem was solved even though education affects the choice to work and is itself endogenous!

Third problem: Endogenous treatment assignment

Some people in the data belong to unions. Union effects on wages and membership might not be random. If it is not, we have endogenous treatment assignment. Assume that. The solution would be to model union membership.

. xteregress wage c.age##c.age, endogenous(ed = med fed) select(working = age ed minwage)
     entreat(union = age i.urban i.occupation)

Third problem solved.

We will stop now. We have only scratched the surface of what ERMs can do.

Let's see it work

We discussed four models:

. xteregress wage c.age##c.age ed

. xteregress wage c.age##c.age, endogenous(ed = med fed)

. xteregress wage c.age##c.age, endogenous(ed = med fed) select(working = age ed minwage)

. xteregress wage c.age##c.age, endogenous(ed = med fed) select(working = age ed minwage)
     entreat(union = age i.urban i.occupation)

Here is the second one:

. xteregress wage c.age##c.age, endogenous(ed = fed med)

Extended linear regression                         Number of obs    =    5,099
Group variable: id                                 Number of groups =    1,000

                                                   Obs per group:
                                                                min =        1
                                                                avg =      5.1
                                                                max =        8

Integration method: mvaghermite                    Integration pts. =        7

                                                   Wald chi2(3)     = 14787.14
Log likelihood = -15983.191                        Prob > chi2      =   0.0000

Coefficient Std. err. z P>|z| [95% conf. interval]
age .1025838 .012142 8.45 0.000 .0787859 .1263818
c.age#c.age -.0005417 .000144 -3.76 0.000 -.0008241 -.0002594
ed .9842219 .0084475 116.51 0.000 .9676651 1.000779
_cons 3.459407 .2931703 11.80 0.000 2.884804 4.034011
fed .3019521 .0025062 120.48 0.000 .2970401 .3068642
med .7991331 .0023521 339.75 0.000 .794523 .8037432
_cons -.5437278 .0526922 -10.32 0.000 -.6470027 -.440453
var(e.wage) 2.637018 .0582987 2.525195 2.753793
var(e.ed) .3053737 .0067357 .2924533 .318865
e.wage) .1002845 .0157035 6.39 0.000 .0694199 .1309574
var(wage[id]) 13.32898 .6229608 12.16226 14.60763
var(ed[id]) .3917931 .0204081 .353768 .4339053
wage[id]) .8821957 .010383 84.97 0.000 .8601199 .900973

There are six sections in the above output.

  1. wage. This section shows the wage equation. Here, ed is washed of its endogeneity.

  2. ed. This section shows the ed equation used to wash ed of its endogeneity.

The remaining sections present information about the residuals and random effects. So let's tell you how to read the encoded names.

var(...) means variance of ...
corr(..., ...) means correlation of ... and ...

e.wage is the residual on the wage equation
e.ed is the residual on the ed equation

wage[id] is the random effect in the wage equation
ed[id] is the random effect in the ed equation

e.wage and e.ed are the overall residuals. They vary over time and by person.

The random effects wage[id] and ed[id] are residuals too, but they are a different kind of residual. They vary across persons, but are constant within persons. That is why they include the subscript [id]. id is the person identification number.

We wondered whether education was endogenous. Were we right to worry? corr(e.ed,e.wage) and corr(ed[id],wage[id]) can answer that question. Correlation in either is endogeneity but of different kinds.

The striking correlation is 0.88 for the random effects. It is huge. It is whoppingly significant at 85 standard deviations away from what might be observed randomly. And it means that unobservables that increase wages increase educational attainment.

Tell me more

Learn more about Stata's ERMs for panel data and ERMs features in general.

Read more about ERMs in the Extended Regression Models Reference Manual.