In the spotlight: A footballer ensemble of decision trees
Do you have a large dataset? Does it have complex, possibly nonlinear relationships? Are you unsure which predictors are the most important? If you answered “yes” to any of these questions, then machine learning might be the right choice for you and your data.
Fortunately, you don't have to be a programmer to effectively use these methods. Powered by H2O, you can now perform machine learning within Stata by using the new h2oml suite of commands. The streamlined H2O integration and graphical interface make it easy to perform gradient boosting machine (GBM) and random forest (RF) for regression, binary classification, and multiclass classification.
In this spotlight article, we illustrate a basic H2O workflow by analyzing how on-field performance metrics predict football players’ average market value—a key factor in player transfers, contract negotiations, and overall valuation in the football industry. We will train two models for predictive comparison and interpret predictor influence using our selected model.
The data-exploration warm-up
Our analysis begins with a dataset merged from two different sources (GitHub and Kaggle) that contains over 1,000 observations of different players from 2023. You can obtain these data with the following command.
. use https://www.stata.com/users/lil/fifa, clear
We generate box plots to visualize the distribution of market value in millions of euros by position.
. graph hbox mil_average_market_value, over(position) title("Market value distribution by position") ytitle("Market value (in millions)")
Wondering who that dot is way to the right of everyone else? That’s Kylian Mbappé, the French forward. At the end of this article, we’ll investigate why he is worth so much.
Because of the outliers and right skew, we opt to apply a logarithmic transformation to stabilize the variance, which may help our models generalize more accurately later on.
. generate ln_average_market_value = log(average_market_value)
Getting H2O ready for kickoff
Let’s demonstrate how to begin working with H2O from Stata. First, h2o init starts a new H2O cluster, establishing the connection between Stata and H2O. Next, we transfer the current Stata dataset into an H2O frame named fifa and make it the current, active frame for subsequent operations.
. h2o init (output omitted) . _h2oframe put, into(fifa) current Progress (%): 0 100
Now that we have established our H2O environment, we begin preparing our data for machine learning. We use _h2oframe toenum to convert string variables to categorical (enumerated) types. To verify that this encoding is done correctly, we run _h2oframe describe.
. _h2oframe toenum position nationality league_rank, replace . _h2oframe describe Rows: 1108 Cols: 21
Column Type Missing Zeros +Inf -Inf Cardinality |
name string 0 0 0 0 position enum 0 269 0 0 4 age int 0 0 0 0 height int 0 0 0 0 league_rank enum 0 281 0 0 5 average_marke~e real 0 0 0 0 highest_marke~e int 0 0 0 0 total_played_~s int 0 0 0 0 average_minut~d real 0 0 0 0 average_assis~e real 0 419 0 0 total_assists int 0 419 0 0 assist_per_mi~e real 0 419 0 0 average_goals~e real 0 414 0 0 total_goals int 0 414 0 0 goals_per_min~e real 0 414 0 0 total_yellow_~s int 0 163 0 0 team_win_ratio real 0 1 0 0 data_year int 0 0 0 0 nationality enum 0 6 0 0 73 mil_average_m~e real 0 0 0 0 ln_average_ma~e real 0 0 0 0 |
The final step is to divide data into training and testing sets using _h2oframe split. We use a standard train–test split, allocating 80% of our data for training and reserving the remaining 20% for testing, and set a random seed for reproducibility.
. _h2oframe split fifa, into(train test) split (0.8, 0.2) rseed(19)
Machine learning play in action
We are ready to begin training! First, we change frames to make the training dataset the working frame.
. _h2oframe change train
We define the global macro predictors to store the variables to be used by our models. The variables include player demographics, league information, performance metrics, and other behavioral and team success indicators.
. global predictors position age height nationality league_rank average_minutes_played average_goals_per_game average_assists_per_game total_yellow_cards team_win_ratio
We begin by implementing a random forest regression, applying three-fold cross-validation with the cv() option, and ensuring reproducibility with the h2orseed() option.
. h2oml rfregress ln_average_market_value $predictors, cv(3) h2orseed(19) Progress (%): 0 28.9 100 Random forest regression using H2O Response: ln_average_market_value Frame: Number of observations: Training: train Training = 877 Cross-validation = 877 Cross-validation: Random Number of folds = 3 Model parameters Number of trees = 50 actual = 50 Tree depth: Pred. sampling value = -1 Input max = 20 Sampling rate = .632 min = 17 No. of bins cat. = 1,024 avg = 18.9 No. of bins root = 1,024 max = 20 No. of bins cont. = 20 Min. obs. leaf split = 1 Min. split thresh. = .00001 Metric summary
Cross- | ||
Metric | Training validation | |
Deviance | .6068654 .6541131 | |
MSE | .6068654 .6541131 | |
RMSE | .7790156 .8087726 | |
RMSLE | .0480274 .0497248 | |
MAE | .6133148 .6394611 | |
R-squared | .6423362 .6144901 | |
For this example, we focus primarily on mean squared error (MSE) as the key indicator of overall goodness of fit. Using this model, we get an MSE of 0.65 as our base level of performance using cross-validation. Let’s see if we can lower this value through tuning.
Tuning is the process of tweaking our model by adjusting hyperparameters. For demonstration purposes, we will tune only the number of trees by comparing models with 20 to 80 trees. To see a full list of tunable options for RF, see [H2OML] h2oml rf.
. h2oml rfregress ln_average_market_value $predictors, cv(3) h2orseed(19) ntrees(20(10)80) Progress (%): 0 100 Random forest regression using H2O Response: ln_average_market_value Frame: Number of observations: Training: train Training = 877 Cross-validation = 877 Cross-validation: Random Number of folds = 3 Tuning information for hyperparameters Method: Cartesian Metric: Deviance
Grid values | ||
Hyperparameters | Minimum Maximum Selected | |
Number of trees | 20 80 70 | |
Cross- | ||
Metric | Training validation | |
Deviance | .587266 .6434246 | |
MSE | .587266 .6434246 | |
RMSE | .7663328 .8021375 | |
RMSLE | .0472597 .0493562 | |
MAE | .6040548 .6320858 | |
R-squared | .6538873 .6207895 | |
We modestly lower the MSE to 0.64. In practice, we would continue tuning to find the model with the lowest cross-validation MSE. Let’s go ahead and store this model.
. h2omlest store rf
Next, we train a gradient boosting regression using the same three-fold cross-validation, random-number seed, and tree grid and store this model. To see a full list of tunable options for GBM, see [H2OML] h2oml gbm.
. h2oml gbregress ln_average_market_value $predictors, cv(3) h2orseed(19) ntrees(20(10)80) Progress (%): 0 100 Gradient boosting regression using H2O Response: ln_average_market_value Loss: Gaussian Frame: Number of observations: Training: train Training = 877 Cross-validation = 877 Cross-validation: Random Number of folds = 3 Tuning information for hyperparameters Method: Cartesian Metric: Deviance
Grid values | ||
Hyperparameters | Minimum Maximum Selected | |
Number of trees | 20 80 80 | |
Cross- | ||
Metric | Training validation | |
Deviance | .1025126 .6337925 | |
MSE | .1025126 .6337925 | |
RMSE | .3201759 .7961109 | |
RMSLE | .0198991 .0492887 | |
MAE | .23022 .6083482 | |
R-squared | .9395829 .6264663 | |
. h2omlest store gbm
To help us choose a model, we evaluate its predictive performance on the testing sample using the h2oml postestimation commands. First, we restore each model using h2omlest restore. Then, we activate the testing dataset for evaluation with h2omlpostestframe test. Once both models are prepared, we run h2omlgof to compare their performance metrics on the testing data.
. h2omlest restore rf (results rf are active now) . h2omlpostestframe test (testing frame test is now active for h2oml postestimation) . h2omlest restore gbm (results gbm are active now) . h2omlpostestframe test (testing frame test is now active for h2oml postestimation) . h2omlgof rf gbm Performance metrics for model comparison using H2O Testing frame: test
rf gbm | ||
Testing | ||
No. of observations | 231 231 | |
Deviance | .5692276 .5008833 | |
MSE | .5692276 .5008833 | |
RMSE | .7544718 .7077311 | |
RMSLE | .0458549 .0430169 | |
MAE | .6169705 .567296 | |
R-squared | .6051586 .6525652 | |
GBM has the lowest MSE in the testing set, demonstrating its superior predictive performance. If we were interested in making predictions in the testing data, we could do that now using h2omlpredict after restoring our chosen model.
. h2omlest restore gbm (results gbm are active now) . h2omlpredict gbm_pred_value Progress (%): 0 100
Scoring insights with variable importance and SHAP
Now that we have selected the best-performing model, we examine the influence of its predictors by plotting variable importance using h2omlgraph varimp. In tree-based models, variable importance is measured by the total reduction in MSE resulting from splits on each predictor.
. h2omlgraph varimp
The variable importance plot identifies the top three predictors of market value as team_win_ratio, age, and nationality. In contrast, variables such as player position and height have small importance.
For further interpretation, we now turn to SHAP (SHapley Additive exPlanations) values to understand how each predictor contributes to each player's predicted market value. Let’s look at the SHAP contributions for our most valuable player, Kylian Mbappé, using h2omlgraph shapvalues. His predicted average market value (log-transformed) is 18.65, a difference of 3.06 units from the average prediction across the training dataset, 15.59. We use SHAP values to explain this difference.
. h2omlgraph shapvalues, obs(590) title("SHAP values for Kylian Mbappé") frame(fifa)
Together, team win ratio (1.3), average goals per game (0.47), and average minutes played (0.42) contribute over two-thirds to the increase in predicted market value from baseline to Mbappé, demonstrating that they are the key drivers of his high valuation.
Lastly, to gain insights at the sample level, we can generate a SHAP summary plot, also known as a beeswarm plot, using the h2omlgraph shapsummary command.
. h2omlgraph shapsummary, frame(fifa) Progress (%): 0 100
In this plot, each observation has a dot for each predictor, with its horizontal position indicating the predictor’s SHAP value for that observation. High observed values of the predictor are represented by a red color, and low values are represented by blue. Predictors are ranked on the y axis by their overall SHAP importance.
Team win ratio emerges as the strongest predictor, with higher ratios consistently boosting player market value. Age shows the inverse pattern—youth adds value while older age reduces it, reflecting the premium on younger players.
While individual predictor contributions vary by player, the SHAP summary results broadly align with the variable importance rankings, indicating team success, age, and performance metrics as the most influential drivers of player valuation.
Final whistle
Stata’s machine learning tools via H2O provide both high predictive accuracy and valuable interpretability through detailed assessments of variable importance. To learn more about what machine learning can do for you in Stata, see [H2OML] h2oml.
References
Ahmed, M. 2023. Football players data. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6960429.
Ocak, M., and H. Bal. 2023. Fifa-overall-prediction (commit 141de05). GitHub. https://github.com/m0cak/Fifa-Overall-Prediction.
— Lingyi Li
Staff Econometrician
— Neel Gopal
Associate Software Developer
— Sarah Lenox
Software Developer II