In the spotlight: Machine learning unmasks group where interventions fail

Curious about how treatment effects vary across observations, subpopulations, or covariates? Don't want to rely on parametric assumptions to perform causal inference?

Stata’s new cate command for conditional average treatment-effects (CATE) analysis allows you to estimate heterogeneous treatment effects without assuming a functional form for the treatment or outcome models. This means you don't need to worry about biasing your treatment-effect estimates by misspecifying the auxiliary models.

In this spotlight article, we illustrate how to estimate the impact of a job-training program on months of employment, distinguishing its effect on Swiss citizens and noncitizens. We start with a fully parametric analysis and then exploit cate's machine learning features to uncover more nuanced relationships in our data.

Exploring treatment-effect heterogeneity

Our analysis is inspired by Knaus (2020) and uses an excerpt from the SWISSUbase database, which contains information on 54,152 economically active individuals in the German-speaking cantons of Switzerland. We are interested in measuring the impact of a job-training program introduced in 2003 on participants’ total months of employment over a three-year survey period. We are particularly interested in the program’s effect on immigrant workers.

We start by loading and describing the dataset.

. use swissemploy, clear
(SWISSUbase extract, 2003)

. describe

Contains data from swissemploy.dta
 Observations:        54,152                  SWISSUbase extract, 2003
    Variables:            11                  29 Oct 2025 13:05


Variable      Storage   Display    Value                    
    name         type    format    label      Variable label

canton_moth_t~e byte    %8.0g                 Noncitizen has mother tongue in
                                                canton's language
city            byte    %8.0g                 City category
cw_cooperative  byte    %8.0g                 Caseworker cooperative
cw_own_ue       byte    %8.0g                 Caseworker has own unemployment
                                                experience
employability   byte    %11.0g     EMPLOY     Employability assessed by the
                                                caseworker
female          byte    %8.0g      FEMALE     Job seeker is female
married         byte    %9.0g      MARRIED    Job seeker is married
past_income     float   %9.0g                 Past income in CHF (insured income)
swiss           byte    %13.0g     SWISS      Job seeker is Swiss citizen
employed_mo     float   %9.0g                 Months of employment in the following
                                              36 months
training        float   %11.0g     TREAT      Job skills training

Sorted by:

As a first step, we generate a bar plot showing the average total months of employment by citizenship.

. graph bar (mean) employed_mo, over(training) over(swiss) ytitle("Average months of employment") 
     title("Job-training program impacts employment")

This graph suggests that the job-training program affects subpopulations differently. For Swiss citizens, the average total number of months of employment is higher among those who participated in the job training program than among those who did not. For noncitizens, we see the opposite pattern.

CATE, parametrically

To estimate the training program’s causal effect on employment and explore potential heterogeneity in treatment effects, we use cate to estimate the overall average treatment effect (ATE), the individualized average treatment effect (IATE), and the group average treatment effect (GATE).

We model our outcome variable, total months of employment (employed_mo), as a function of citizenship, past income, gender, marital status, employability, city size, and native language. Because caseworkers determine whether a job seeker is assigned to the program, we also include caseworker traits, such as unemployment experience and cooperativeness, as controls. We store all of these variables in the global macro covariates,

. global covariates i.swiss past_income i.female i.married i.employability i.city i.canton_moth_tongue 
     i.cw_cooperative i.cw_own_ue

We also include these variables in our model for the binary treatment variable, participation in the job training program (training), and in our model for the CATE function.

To do this with the cate command, we include the covariates macro in the first set of parentheses, following the dependent variable. CATE estimation requires that all the covariates in the CATE function are also included in the treatment and outcome models, so cate automatically incorporates them into both. We specify aipw to use the augmented inverse-probability weighted (AIPW) estimator. To construct a fully parametric specification, we specify omethod(regress) to fit the outcome model using linear regression, tmethod(logit) to fit the treatment model using logistic regression, and cmethod(regress) to fit the CATE model using linear regression. The rseed() option ensures reproducibility in the cross-fitting process.

. cate aipw (employed_mo $covariates) (training), rseed(1234) nolog 
     omethod(regress) tmethod(logit) cmethod(regress)

Conditional average treatment effects     Number of observations       = 54,152
Estimator:       Augmented IPW            Number of folds in cross-fit =     10
Outcome model:   Linear regression        Number of outcome controls   =     19
Treatment model: Logit                    Number of treatment controls =     19
CATE model:      Linear regression        Number of CATE variables     =     19



                              Robust                                           
  employed_mo   Coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
ATE                                                                            
     training                                                                  
    (Treated                                                                   
          vs                                                                   
Not treated)      1.082079   .5020138     2.16   0.031     .0981495    2.066008

POmean                                                                         
     training                                                                  
 Not treated      17.00564   .0586647   289.88   0.000     16.89066    17.12062

The results suggest that if everyone in the population participated in the training program, the average employment duration would be about one month longer than if no one participated. If no one participated in the training program, the average employment duration would be around 17 months.

In addition to the ATE, we can estimate and plot the distribution of IATE predictions to display heterogeneity in treatment effects across individuals. It shows two peaks around 0 and 2 months, with a range from about –4 to 6.

. categraph histogram, title("Parametric IATE predictions") color(stred%60) 
     saving(catehist_para, replace)

Next we estimate the GATEs by citizenship. The GATE summarizes the IATE within each subgroup, allowing us to distinguish the treatment effects across subpopulations.

. cate, group(swiss) reestimate nolog

Conditional average treatment effects     Number of observations       = 54,152
Estimator:       Augmented IPW            Number of folds in cross-fit =     10
Outcome model:   Linear regression        Number of outcome controls   =     19
Treatment model: Logit                    Number of treatment controls =     19
CATE model:      Linear regression        Number of CATE variables     =     19



                               Robust                                           
   employed_mo   Coefficient  std. err.      z    P>|z|     [95% conf. interval]

GATE            
         swiss  
   Noncitizen     -.1479488   .4473204    -0.33   0.741    -1.024681     .728783
Swiss citizen      1.799268   .7506771     2.40   0.017     .3279679    3.270568

ATE             
      training  
     (Treated   
           vs   
 Not treated)      1.082079   .5020138     2.16   0.031     .0981495    2.066008

POmean          
      training  
  Not treated      17.00564   .0586647   289.88   0.000     16.89066    17.12062

The GATE for Swiss citizens is approximately 1.8 months, but the GATE for noncitizens is close to 0. The postestimation command estat gatetest provides evidence that the treatment effects differ across groups.

. estat gatetest

Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous

 ( 1)  [GATE]0bn.swiss - [GATE]1.swiss = 0

    chi2(1) =   4.97
Prob > chi2 = 0.0259

Beyond estimating the GATE, we can estimate how treatment effects change over a continuous variable and graph the results using the postestimation command estat series.

. estat series past_income, graph(cateopts(lcolor(stred) mcolor(stred%30)) 
     ciopts(fcolor(stred%10)))

Computing approximating function

Minimizing cross-validation criterion

Iteration 0:  Cross-validation criterion =  13648.76

Computing average derivatives

Nonparametric series regression for IATE
Cubic B-spline estimation                  Number of obs      =         54,152
Criterion: cross-validation                Number of knots    =              1



                             Robust                                           
                   Effect   std. err.      z    P>|z|     [95% conf. interval]

 past_income     4.96e-06   .0000373     0.13   0.894    -.0000681     .000078

Note: Effect estimates are averages of derivatives.

The graph indicates that past income affects the treatment effect only if it surpasses about 40,000 Swiss francs. Between 40,000 and 60,000 Swiss francs, treatment effects increase with past income before declining sharply.

So far, we have assumed that we know the relationship between the treatment and the covariates. If the parametric assumptions we made are incorrect, our causal estimates may be subject to misspecification bias.

CATE, unchained

To guard our estimates against misspecification bias, we fit a fully nonparametric model, keeping the same AIPW estimator and covariates, but using a random forest for the outcome, treatment, and CATE models. Because we have so few covariates, we limit the mean number of variables split at each node to 3. We also use out-of-bag prediction, which is generally faster than the default cross-fitting method.

. cate aipw (employed_mo $covariates) (training), rseed(1234) nolog omethod(rforest, 
     splitmeanvars(3)) tmethod(rforest, splitmeanvars(3)) cmethod(rforest, 
     splitmeanvars(3)) oob

Conditional average treatment effects     Number of observations       = 54,152
Estimator:       Augmented IPW            Number of folds in cross-fit =      1
Outcome model:   Random forest            Number of outcome controls   =     19
Treatment model: Random forest            Number of treatment controls =     19
CATE model:      Random forest            Number of CATE variables     =     19



                              Robust                                           
  employed_mo   Coefficient  std. err.      z    P>|z|     [95% conf. interval]

ATE                                                                            
     training                                                                  
    (Treated                                                                   
          vs                                                                   
Not treated)      .3607137   .2934781     1.23   0.219    -.2144928    .9359202

POmean                                                                         
     training                                                                  
 Not treated      17.02672   .0586521   290.30   0.000     16.91176    17.14167

Interestingly, the ATE is roughly 0.6 months lower compared with the parametric model. The average potential outcome without the program remains the same.

To study treatment-effect heterogeneity, we then plot the distribution of the IATE and compare it with the parametric IATE distribution. The nonparametric IATE distribution is more concentrated, with a large peak around 1 and a range from about –4 to 4. The reduced spread suggests less heterogeneity in the estimated treatment effects.

. categraph histogram, title("Nonparametric IATE predictions") saving(catehist_nonpara, replace) nodraw

. graph combine catehist_nonpara catehist_para, ycommon xcommon

Next we estimate the GATEs by citizenship.

. cate, group(swiss) reestimate nolog

Conditional average treatment effects    Number of observations       = 54,152
Estimator:       Augmented IPW           Number of folds in cross-fit =      1
Outcome model:   Random forest           Number of outcome controls   =     19
Treatment model: Random forest           Number of treatment controls =     19
CATE model:      Random forest           Number of CATE variables     =     19



                              Robust                                           
  employed_mo   Coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
GATE                                                                           
        swiss                                                                  
   Noncitizen    -1.503144   .4681074    -3.21   0.001    -2.420618   -.5856706
Swiss citizen     1.447469   .3758458     3.85   0.000     .7108249    2.184114

ATE            
     training  
    (Treated   
          vs   
Not treated)      .3607137   .2934781     1.23   0.219    -.2144928    .9359202

POmean         
     training  
 Not treated      17.02672   .0586521   290.30   0.000     16.91176    17.14167

While the parametric model failed to show that the job training program had any effect on noncitizens, the machine learning approach reveals that the program might have actually harmed this group. The effects for the two groups are in opposite directions, which led to the overall ATE reported by the previous cate command being close to 0. The estat gatetest command provides further evidence of differences in treatment effects between citizens and noncitizens.

. estat gatetest

Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous

 ( 1)  [GATE]0bn.swiss - [GATE]1.swiss = 0

    chi2(1) =  24.16
Prob > chi2 = 0.0000

Final words

Stata’s new cate command lets you explore treatment-effect heterogeneity without assuming any fixed functional form in your model. In this example, we saw how making these assumptions led to different conclusions regarding the impact of a job training program on the noncitizen group. Using machine learning algorithms, cate can model complex dependencies in your data, guarding your causal estimates from potential misspecification bias. To learn more about cate, see [CAUSAL] cate.

You can also perform CATE analysis with machine learning using h2oml. Learn how.

References

Lechner, M., M. Knaus, M. Huber, M. Frölich, S. Behncke, G. Mellace, and A. Strittmatter. Evaluations of Swiss Active Labor Market Policies. SWISSUbase. https://www.swissubase.ch/en/catalogue/studies/13867/16652/overview.

Knaus, M. C. 2022. Double machine learning–based programme evaluation under unconfoundedness. Econometrics Journal 25: 602–627. https://arxiv.org/pdf/2003.03191.

— Lingyi Li
Staff Econometrician

— Eduardo Garcia Echeverri
Senior Econometrician

«Back to main page