Home  /  Stata News  /  Vol 40 No 5  /  In the spotlight: Machine learning unmasks group where interventions fail
The Stata News

«Back to main page

In the spotlight: Machine learning unmasks group where interventions fail

Curious about how treatment effects vary across observations, subpopulations, or covariates? Don't want to rely on parametric assumptions to perform causal inference?

Stata’s new cate command for conditional average treatment-effects (CATE) analysis allows you to estimate heterogeneous treatment effects without assuming a functional form for the treatment or outcome models. This means you don't need to worry about biasing your treatment-effect estimates by misspecifying the auxiliary models.

In this spotlight article, we illustrate how to estimate the impact of a job-training program on months of employment, distinguishing its effect on Swiss citizens and noncitizens. We start with a fully parametric analysis and then exploit cate's machine learning features to uncover more nuanced relationships in our data.

Exploring treatment-effect heterogeneity

Our analysis is inspired by Knaus (2020) and uses an excerpt from the SWISSUbase database, which contains information on 54,152 economically active individuals in the German-speaking cantons of Switzerland. We are interested in measuring the impact of a job-training program introduced in 2003 on participants’ total months of employment over a three-year survey period. We are particularly interested in the program’s effect on immigrant workers.

We start by loading and describing the dataset.

. use swissemploy, clear
(SWISSUbase extract, 2003)

. describe

Contains data from swissemploy.dta
 Observations:        54,152                  SWISSUbase extract, 2003
    Variables:            11                  29 Oct 2025 13:05
Variable Storage Display Value
name type format label Variable label
canton_moth_t~e byte %8.0g Noncitizen has mother tongue in canton's language city byte %8.0g City category cw_cooperative byte %8.0g Caseworker cooperative cw_own_ue byte %8.0g Caseworker has own unemployment experience employability byte %11.0g EMPLOY Employability assessed by the caseworker female byte %8.0g FEMALE Job seeker is female married byte %9.0g MARRIED Job seeker is married past_income float %9.0g Past income in CHF (insured income) swiss byte %13.0g SWISS Job seeker is Swiss citizen employed_mo float %9.0g Months of employment in the following 36 months training float %11.0g TREAT Job skills training
Sorted by:

As a first step, we generate a bar plot showing the average total months of employment by citizenship.

. graph bar (mean) employed_mo, over(training) over(swiss) ytitle("Average months of employment") 
     title("Job-training program impacts employment")
ml1.svg

This graph suggests that the job-training program affects subpopulations differently. For Swiss citizens, the average total number of months of employment is higher among those who participated in the job training program than among those who did not. For noncitizens, we see the opposite pattern.

CATE, parametrically

To estimate the training program’s causal effect on employment and explore potential heterogeneity in treatment effects, we use cate to estimate the overall average treatment effect (ATE), the individualized average treatment effect (IATE), and the group average treatment effect (GATE).

We model our outcome variable, total months of employment (employed_mo), as a function of citizenship, past income, gender, marital status, employability, city size, and native language. Because caseworkers determine whether a job seeker is assigned to the program, we also include caseworker traits, such as unemployment experience and cooperativeness, as controls. We store all of these variables in the global macro covariates,

. global covariates i.swiss past_income i.female i.married i.employability i.city i.canton_moth_tongue 
     i.cw_cooperative i.cw_own_ue

We also include these variables in our model for the binary treatment variable, participation in the job training program (training), and in our model for the CATE function.

To do this with the cate command, we include the covariates macro in the first set of parentheses, following the dependent variable. CATE estimation requires that all the covariates in the CATE function are also included in the treatment and outcome models, so cate automatically incorporates them into both. We specify aipw to use the augmented inverse-probability weighted (AIPW) estimator. To construct a fully parametric specification, we specify omethod(regress) to fit the outcome model using linear regression, tmethod(logit) to fit the treatment model using logistic regression, and cmethod(regress) to fit the CATE model using linear regression. The rseed() option ensures reproducibility in the cross-fitting process.

. cate aipw (employed_mo $covariates) (training), rseed(1234) nolog 
     omethod(regress) tmethod(logit) cmethod(regress)

Conditional average treatment effects     Number of observations       = 54,152
Estimator:       Augmented IPW            Number of folds in cross-fit =     10
Outcome model:   Linear regression        Number of outcome controls   =     19
Treatment model: Logit                    Number of treatment controls =     19
CATE model:      Linear regression        Number of CATE variables     =     19

Robust
employed_mo Coefficient std. err. z P>|z| [95% conf. interval]
ATE
training
(Treated
vs
Not treated) 1.082079 .5020138 2.16 0.031 .0981495 2.066008
POmean
training
Not treated 17.00564 .0586647 289.88 0.000 16.89066 17.12062

The results suggest that if everyone in the population participated in the training program, the average employment duration would be about one month longer than if no one participated. If no one participated in the training program, the average employment duration would be around 17 months.

In addition to the ATE, we can estimate and plot the distribution of IATE predictions to display heterogeneity in treatment effects across individuals. It shows two peaks around 0 and 2 months, with a range from about –4 to 6.

. categraph histogram, title("Parametric IATE predictions") color(stred%60) 
     saving(catehist_para, replace)
ml2.svg

Next we estimate the GATEs by citizenship. The GATE summarizes the IATE within each subgroup, allowing us to distinguish the treatment effects across subpopulations.

. cate, group(swiss) reestimate nolog

Conditional average treatment effects     Number of observations       = 54,152
Estimator:       Augmented IPW            Number of folds in cross-fit =     10
Outcome model:   Linear regression        Number of outcome controls   =     19
Treatment model: Logit                    Number of treatment controls =     19
CATE model:      Linear regression        Number of CATE variables     =     19

Robust
employed_mo Coefficient std. err. z P>|z| [95% conf. interval]
GATE
swiss
Noncitizen -.1479488 .4473204 -0.33 0.741 -1.024681 .728783
Swiss citizen 1.799268 .7506771 2.40 0.017 .3279679 3.270568
ATE
training
(Treated
vs
Not treated) 1.082079 .5020138 2.16 0.031 .0981495 2.066008
POmean
training
Not treated 17.00564 .0586647 289.88 0.000 16.89066 17.12062

The GATE for Swiss citizens is approximately 1.8 months, but the GATE for noncitizens is close to 0. The postestimation command estat gatetest provides evidence that the treatment effects differ across groups.

. estat gatetest

Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous

 ( 1)  [GATE]0bn.swiss - [GATE]1.swiss = 0

    chi2(1) =   4.97
Prob > chi2 = 0.0259

Beyond estimating the GATE, we can estimate how treatment effects change over a continuous variable and graph the results using the postestimation command estat series.

. estat series past_income, graph(cateopts(lcolor(stred) mcolor(stred%30)) 
     ciopts(fcolor(stred%10)))

Computing approximating function

Minimizing cross-validation criterion

Iteration 0:  Cross-validation criterion =  13648.76

Computing average derivatives

Nonparametric series regression for IATE
Cubic B-spline estimation                  Number of obs      =         54,152
Criterion: cross-validation                Number of knots    =              1

Robust
Effect std. err. z P>|z| [95% conf. interval]
past_income 4.96e-06 .0000373 0.13 0.894 -.0000681 .000078
Note: Effect estimates are averages of derivatives.
ml3.svg

The graph indicates that past income affects the treatment effect only if it surpasses about 40,000 Swiss francs. Between 40,000 and 60,000 Swiss francs, treatment effects increase with past income before declining sharply.

So far, we have assumed that we know the relationship between the treatment and the covariates. If the parametric assumptions we made are incorrect, our causal estimates may be subject to misspecification bias.

CATE, unchained

To guard our estimates against misspecification bias, we fit a fully nonparametric model, keeping the same AIPW estimator and covariates, but using a random forest for the outcome, treatment, and CATE models. Because we have so few covariates, we limit the mean number of variables split at each node to 3. We also use out-of-bag prediction, which is generally faster than the default cross-fitting method.

. cate aipw (employed_mo $covariates) (training), rseed(1234) nolog omethod(rforest, 
     splitmeanvars(3)) tmethod(rforest, splitmeanvars(3)) cmethod(rforest, 
     splitmeanvars(3)) oob

Conditional average treatment effects     Number of observations       = 54,152
Estimator:       Augmented IPW            Number of folds in cross-fit =      1
Outcome model:   Random forest            Number of outcome controls   =     19
Treatment model: Random forest            Number of treatment controls =     19
CATE model:      Random forest            Number of CATE variables     =     19

Robust
employed_mo Coefficient std. err. z P>|z| [95% conf. interval]
ATE
training
(Treated
vs
Not treated) .3607137 .2934781 1.23 0.219 -.2144928 .9359202
POmean
training
Not treated 17.02672 .0586521 290.30 0.000 16.91176 17.14167

Interestingly, the ATE is roughly 0.6 months lower compared with the parametric model. The average potential outcome without the program remains the same.

To study treatment-effect heterogeneity, we then plot the distribution of the IATE and compare it with the parametric IATE distribution. The nonparametric IATE distribution is more concentrated, with a large peak around 1 and a range from about –4 to 4. The reduced spread suggests less heterogeneity in the estimated treatment effects.

. categraph histogram, title("Nonparametric IATE predictions") saving(catehist_nonpara, replace) nodraw

. graph combine catehist_nonpara catehist_para, ycommon xcommon
ml4.svg

Next we estimate the GATEs by citizenship.

. cate, group(swiss) reestimate nolog

Conditional average treatment effects    Number of observations       = 54,152
Estimator:       Augmented IPW           Number of folds in cross-fit =      1
Outcome model:   Random forest           Number of outcome controls   =     19
Treatment model: Random forest           Number of treatment controls =     19
CATE model:      Random forest           Number of CATE variables     =     19

Robust
employed_mo Coefficient std. err. z P>|z| [95% conf. interval]
GATE
swiss
Noncitizen -1.503144 .4681074 -3.21 0.001 -2.420618 -.5856706
Swiss citizen 1.447469 .3758458 3.85 0.000 .7108249 2.184114
ATE
training
(Treated
vs
Not treated) .3607137 .2934781 1.23 0.219 -.2144928 .9359202
POmean
training
Not treated 17.02672 .0586521 290.30 0.000 16.91176 17.14167

While the parametric model failed to show that the job training program had any effect on noncitizens, the machine learning approach reveals that the program might have actually harmed this group. The effects for the two groups are in opposite directions, which led to the overall ATE reported by the previous cate command being close to 0. The estat gatetest command provides further evidence of differences in treatment effects between citizens and noncitizens.

. estat gatetest

Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous

 ( 1)  [GATE]0bn.swiss - [GATE]1.swiss = 0

    chi2(1) =  24.16
Prob > chi2 = 0.0000

Final words

Stata’s new cate command lets you explore treatment-effect heterogeneity without assuming any fixed functional form in your model. In this example, we saw how making these assumptions led to different conclusions regarding the impact of a job training program on the noncitizen group. Using machine learning algorithms, cate can model complex dependencies in your data, guarding your causal estimates from potential misspecification bias. To learn more about cate, see [CAUSAL] cate.

You can also perform CATE analysis with machine learning using h2oml. Learn how.

References

Lechner, M., M. Knaus, M. Huber, M. Frölich, S. Behncke, G. Mellace, and A. Strittmatter. Evaluations of Swiss Active Labor Market Policies. SWISSUbase. https://www.swissubase.ch/en/catalogue/studies/13867/16652/overview.

Knaus, M. C. 2022. Double machine learning–based programme evaluation under unconfoundedness. Econometrics Journal 25: 602–627. https://arxiv.org/pdf/2003.03191.

— Lingyi Li
Staff Econometrician

— Eduardo Garcia Echeverri
Senior Econometrician

«Back to main page