In the spotlight: A footballer ensemble of decision trees

Do you have a large dataset? Does it have complex, possibly nonlinear relationships? Are you unsure which predictors are the most important? If you answered “yes” to any of these questions, then machine learning might be the right choice for you and your data.

Fortunately, you don't have to be a programmer to effectively use these methods. Powered by H2O, you can now perform machine learning within Stata by using the new h2oml suite of commands. The streamlined H2O integration and graphical interface make it easy to perform gradient boosting machine (GBM) and random forest (RF) for regression, binary classification, and multiclass classification.

In this spotlight article, we illustrate a basic H2O workflow by analyzing how on-field performance metrics predict football players’ average market value—a key factor in player transfers, contract negotiations, and overall valuation in the football industry. We will train two models for predictive comparison and interpret predictor influence using our selected model.

The data-exploration warm-up

Our analysis begins with a dataset merged from two different sources (GitHub and Kaggle) that contains over 1,000 observations of different players from 2023. You can obtain these data with the following command.

. use https://www.stata.com/users/lil/fifa, clear

We generate box plots to visualize the distribution of market value in millions of euros by position.

. graph hbox mil_average_market_value, over(position) title("Market value distribution by position") 
     ytitle("Market value (in millions)")

Wondering who that dot is way to the right of everyone else? That’s Kylian Mbappé, the French forward. At the end of this article, we’ll investigate why he is worth so much.

Because of the outliers and right skew, we opt to apply a logarithmic transformation to stabilize the variance, which may help our models generalize more accurately later on.

. generate ln_average_market_value = log(average_market_value)

Getting H2O ready for kickoff

Let’s demonstrate how to begin working with H2O from Stata. First, h2o init starts a new H2O cluster, establishing the connection between Stata and H2O. Next, we transfer the current Stata dataset into an H2O frame named fifa and make it the current, active frame for subsequent operations.

. h2o init
(output omitted)

. _h2oframe put, into(fifa) current

Progress (%): 0 100

Now that we have established our H2O environment, we begin preparing our data for machine learning. We use _h2oframe toenum to convert string variables to categorical (enumerated) types. To verify that this encoding is done correctly, we run _h2oframe describe.

. _h2oframe toenum position nationality league_rank, replace

. _h2oframe describe

          Rows:      1108
          Cols:        21


 
Column          Type        Missing     Zeros      +Inf      -Inf  Cardinality 
 
name            string            0         0         0         0             
position        enum              0       269         0         0            4
age             int               0         0         0         0             
height          int               0         0         0         0             
league_rank     enum              0       281         0         0            5
average_marke~e real              0         0         0         0             
highest_marke~e int               0         0         0         0             
total_played_~s int               0         0         0         0             
average_minut~d real              0         0         0         0             
average_assis~e real              0       419         0         0             
total_assists   int               0       419         0         0             
assist_per_mi~e real              0       419         0         0             
average_goals~e real              0       414         0         0             
total_goals     int               0       414         0         0             
goals_per_min~e real              0       414         0         0             
total_yellow_~s int               0       163         0         0             
team_win_ratio  real              0         1         0         0             
data_year       int               0         0         0         0             
nationality     enum              0         6         0         0           73
mil_average_m~e real              0         0         0         0             
ln_average_ma~e real              0         0         0         0

The final step is to divide data into training and testing sets using _h2oframe split. We use a standard train–test split, allocating 80% of our data for training and reserving the remaining 20% for testing, and set a random seed for reproducibility.

. _h2oframe split fifa, into(train test) split (0.8, 0.2) rseed(19)

Machine learning play in action

We are ready to begin training! First, we change frames to make the training dataset the working frame.

. _h2oframe change train

We define the global macro predictors to store the variables to be used by our models. The variables include player demographics, league information, performance metrics, and other behavioral and team success indicators.

. global predictors position age height nationality league_rank average_minutes_played 
     average_goals_per_game average_assists_per_game total_yellow_cards team_win_ratio

We begin by implementing a random forest regression, applying three-fold cross-validation with the cv() option, and ensuring reproducibility with the h2orseed() option.

. h2oml rfregress ln_average_market_value $predictors, cv(3) h2orseed(19)

Progress (%): 0 28.9 100

Random forest regression using H2O

Response: ln_average_market_value
Frame:                                 Number of observations:
  Training: train                                  Training =    877
                                           Cross-validation =    877
Cross-validation: Random               Number of folds      =      3

Model parameters

Number of trees      =   50
              actual =   50
Tree depth:                            Pred. sampling value =     -1
           Input max =   20            Sampling rate        =   .632
                 min =   17            No. of bins cat.     =  1,024
                 avg = 18.9            No. of bins root     =  1,024
                 max =   20            No. of bins cont.    =     20
Min. obs. leaf split =    1            Min. split thresh.   = .00001

Metric summary



                             Cross-
    Metric     Training  validation
   
  Deviance     .6068654    .6541131
       MSE     .6068654    .6541131
      RMSE     .7790156    .8087726
     RMSLE     .0480274    .0497248
       MAE     .6133148    .6394611
 R-squared     .6423362    .6144901

For this example, we focus primarily on mean squared error (MSE) as the key indicator of overall goodness of fit. Using this model, we get an MSE of 0.65 as our base level of performance using cross-validation. Let’s see if we can lower this value through tuning.

Tuning is the process of tweaking our model by adjusting hyperparameters. For demonstration purposes, we will tune only the number of trees by comparing models with 20 to 80 trees. To see a full list of tunable options for RF, see [H2OML] h2oml rf.

. h2oml rfregress ln_average_market_value $predictors, cv(3) h2orseed(19) 
     ntrees(20(10)80)

Progress (%): 0 100

Random forest regression using H2O

Response: ln_average_market_value
Frame:                                 Number of observations:
  Training: train                                  Training =    877
                                           Cross-validation =    877
Cross-validation: Random               Number of folds      =      3

Tuning information for hyperparameters

Method: Cartesian
Metric: Deviance



                                        Grid values                 
 Hyperparameters            Minimum         Maximum         Selected
   
 Number of trees                 20              80               70


Model parameters

Number of trees      =   70
              actual =   70
Tree depth:                            Pred. sampling value =     -1
           Input max =   20            Sampling rate        =   .632
                 min =   17            No. of bins cat.     =  1,024
                 avg = 18.9            No. of bins root     =  1,024
                 max =   20            No. of bins cont.    =     20
Min. obs. leaf split =    1            Min. split thresh.   = .00001

Metric summary


                             Cross-
    Metric     Training  validation
   
  Deviance      .587266    .6434246
       MSE      .587266    .6434246
      RMSE     .7663328    .8021375
     RMSLE     .0472597    .0493562
       MAE     .6040548    .6320858
 R-squared     .6538873    .6207895

We modestly lower the MSE to 0.64. In practice, we would continue tuning to find the model with the lowest cross-validation MSE. Let’s go ahead and store this model.

. h2omlest store rf

Next, we train a gradient boosting regression using the same three-fold cross-validation, random-number seed, and tree grid and store this model. To see a full list of tunable options for GBM, see [H2OML] h2oml gbm.

. h2oml gbregress ln_average_market_value $predictors, cv(3) h2orseed(19) 
     ntrees(20(10)80)

Progress (%): 0 100

Gradient boosting regression using H2O

Response: ln_average_market_value
Loss:     Gaussian
Frame:                                 Number of observations:
  Training: train                                 Training =    877
                                          Cross-validation =    877
Cross-validation: Random               Number of folds     =      3

Tuning information for hyperparameters

Method: Cartesian
Metric: Deviance



                                        Grid values                
 Hyperparameters           Minimum          Maximum        Selected
   
 Number of trees                20               80              80



Model parameters

Number of trees      =  80             Learning rate       =     .1
              actual =  80             Learning rate decay =      1
Tree depth:                            Pred. sampling rate =      1
           Input max =   5             Sampling rate       =      1
                 min =   5             No. of bins cat.    =  1,024
                 avg = 5.0             No. of bins root    =  1,024
                 max =   5             No. of bins cont.   =     20
Min. obs. leaf split =  10             Min. split thresh.  = .00001

Metric summary


                             Cross-
    Metric     Training  validation
   
  Deviance     .1025126    .6337925
       MSE     .1025126    .6337925
      RMSE     .3201759    .7961109
     RMSLE     .0198991    .0492887
       MAE       .23022    .6083482
 R-squared     .9395829    .6264663

. h2omlest store gbm

To help us choose a model, we evaluate its predictive performance on the testing sample using the h2oml postestimation commands. First, we restore each model using h2omlest restore. Then, we activate the testing dataset for evaluation with h2omlpostestframe test. Once both models are prepared, we run h2omlgof to compare their performance metrics on the testing data.

. h2omlest restore rf
(results rf are active now)

. h2omlpostestframe test
(testing frame test is now active for h2oml postestimation)

. h2omlest restore gbm
(results gbm are active now)

. h2omlpostestframe test
(testing frame test is now active for h2oml postestimation)

. h2omlgof rf gbm

Performance metrics for model comparison using H2O
Testing frame: test


                                rf        gbm
   
Testing                                      
  No. of observations          231        231
             Deviance     .5692276   .5008833
                  MSE     .5692276   .5008833
                 RMSE     .7544718   .7077311
                RMSLE     .0458549   .0430169
                  MAE     .6169705    .567296
            R-squared     .6051586   .6525652

GBM has the lowest MSE in the testing set, demonstrating its superior predictive performance. If we were interested in making predictions in the testing data, we could do that now using h2omlpredict after restoring our chosen model.

. h2omlest restore gbm
(results gbm are active now)

. h2omlpredict gbm_pred_value

Progress (%): 0 100

Scoring insights with variable importance and SHAP

Now that we have selected the best-performing model, we examine the influence of its predictors by plotting variable importance using h2omlgraph varimp. In tree-based models, variable importance is measured by the total reduction in MSE resulting from splits on each predictor.

. h2omlgraph varimp

The variable importance plot identifies the top three predictors of market value as team_win_ratio, age, and nationality. In contrast, variables such as player position and height have small importance.

For further interpretation, we now turn to SHAP (SHapley Additive exPlanations) values to understand how each predictor contributes to each player's predicted market value. Let’s look at the SHAP contributions for our most valuable player, Kylian Mbappé, using h2omlgraph shapvalues. His predicted average market value (log-transformed) is 18.65, a difference of 3.06 units from the average prediction across the training dataset, 15.59. We use SHAP values to explain this difference.

. h2omlgraph shapvalues, obs(590) title("SHAP values for Kylian Mbappé") frame(fifa)

Together, team win ratio (1.3), average goals per game (0.47), and average minutes played (0.42) contribute over two-thirds to the increase in predicted market value from baseline to Mbappé, demonstrating that they are the key drivers of his high valuation.

Lastly, to gain insights at the sample level, we can generate a SHAP summary plot, also known as a beeswarm plot, using the h2omlgraph shapsummary command.

. h2omlgraph shapsummary, frame(fifa)

Progress (%): 0 100

In this plot, each observation has a dot for each predictor, with its horizontal position indicating the predictor’s SHAP value for that observation. High observed values of the predictor are represented by a red color, and low values are represented by blue. Predictors are ranked on the y axis by their overall SHAP importance.

Team win ratio emerges as the strongest predictor, with higher ratios consistently boosting player market value. Age shows the inverse pattern—youth adds value while older age reduces it, reflecting the premium on younger players.

While individual predictor contributions vary by player, the SHAP summary results broadly align with the variable importance rankings, indicating team success, age, and performance metrics as the most influential drivers of player valuation.

Final whistle

Stata’s machine learning tools via H2O provide both high predictive accuracy and valuable interpretability through detailed assessments of variable importance. To learn more about what machine learning can do for you in Stata, see [H2OML] h2oml.

References

Ahmed, M. 2023. Football players data. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6960429.

Ocak, M., and H. Bal. 2023. Fifa-overall-prediction (commit 141de05). GitHub. https://github.com/m0cak/Fifa-Overall-Prediction.

— Lingyi Li
Staff Econometrician

— Neel Gopal
Associate Software Developer

— Sarah Lenox
Software Developer II

«Back to main page