The 2019 Spanish Stata Conference was held on 17 October in Madrid at Lexington Madrid.
Introduction to lasso models
Abstract: The increasing availability of high-dimensional data and increasing interest in more realistic functional forms have sparked a renewed interest in automated methods for selecting the covariates to include in a model. I discuss the promises and perils of model selection and pay special attention to estimators that provide reliable inference after model selection. I will demonstrate how to use Stata 16's new features for double selection, partialing out, and cross-fit partialing out to estimate the effects of variables of interest while using lasso methods to select control variables.
David M. Drukker
Simultaneous hierarchical summary ROC analysis of cohort and case-control diagnostic accuracy studies
Abstract: The hierarchical summary ROC (HSROC) model (Rutter and Gatsonis 2001) is one of two statistically rigorous multilevel or mixed-effects models recommended for diagnostic test accuracy meta-analysis by the Cochrane Collaboration. The original parameterization of the HSROC model does not incorporate the difference in log-likelihood expressions between cohort (prevalence-dependent) and case-control (prevalence-independent) diagnostic test data (Ma X 2016). Using publicly available data regarding meta-analysis of gadolinium-enhanced MRI for detecting lymph node metastases, I intend to show in this presentation how bayesmh and its myriad postestimation commands, the cond(), function and substitutable expressions in Stata facilitate estimation, graphical depiction, and interrogation of simultaneous HSROC modeling of cohort and case-control diagnostic test accuracy studies.
Ben Adarkwa Dwamena
University of Michigan
Evaluating the out-of-sample prediction performance of panel-data models
Abstract: We have developed four new commands that allow one to evaluate the out-of-sample prediction performance of panel-data models in their time-series and cross-individual dimensions separately, also with separate procedures for different types of dependent variables—either continuous or dichotomous variables (xtreg_oust, xtreg_ousi, xtlogit_oust, and xtlogit_ousi). The time-series procedures exclude a number of time periods defined by the user from the estimation sample for each individual in the panel. Similarly, the cross-individual procedures exclude a group of individuals (for example, countries) defined by the user from the estimation sample (including all their observations throughout time). Then, for the remaining subsamples, they fit the specified models and use the resulting parameters to forecast the dependent variable (or the probability of a positive outcome) in the unused periods or individuals. The unused time-period or individual sets are then recursively reduced by one period in every subsequent step or in a random or ordered fashion, and the estimation and forecasting evaluation are repeated until there are no more periods ahead or more individuals that could be evaluated. In the continuous cases, the model's forecasting performance is reported both in absolute terms (RMSE) and relative to an AR1 model by a U-theil ratio. In the dichotomous case, the prediction performance is evaluated based on the area under the receiver operator characteristic (ROC) statistic evaluated in both the training sample and the out of sample. Despite their given names, the procedures allow one to choose different estimation methods, including some dynamic methodologies, and could also be used in a time-series or a cross-section dataset only. They also allow evaluating the model's forecasting performance for one particular individual or for a defined group of individuals instead of the whole panel.
Alfonso Ugarte Ruiz
tripod: A postestimation command for internal validation of predictive logistic regression models
Abstract: The TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) standards for predictive model reporting in research include internal model validation. Bootstrap techniques are the most appropriate procedure for internal validation, because they use all the data used in the development of the model and allow optimism to be quantified.
Objective: Provide researchers with a tool implemented in Stata as a postestimation command for internal bootstrap validation of logistic regression models.
Methods: The validation method follows the following algorithm:
Conclusions: This tool makes the internal validation methods more accessible to researchers and allows better reporting of predictive models according to the TRIPOD standards.
Borja M. Fernández-Félix
Instituto Ramón y Cajal de Investigación Sanitaria
discretize: Command to convert a continuous instrument into a dummy variable for instrumental-variables estimation
Abstract: The instrumental variables (IV) method is a standard econometric approach to address endogeneity issues. Many instruments rely on cross-sectional variation produced by a dummy variable that is discretized from a continuous variable. Converting a continuous variable into a binary instrument provides a simple tool to evaluate the IV strategy and the identification assumptions. Unfortunately, the construction of the binary instrument often appears to be arbitrary, which may raise concerns about the robustness of the second-stage results. We propose a data-driven procedure to build this discrete instrument, implemented in a command called discretize. The boundaries of the discrete variable are chosen to maximize the F statistic in the first stage. This procedure has two main advantages. First, it minimizes the weak-instrument problem, which can arise in the case of incorrect functional specification in the first stage. Second, it offers a transparent, data-driven procedure to select an instrument that does not depend on arbitrary decisions. Several options are available with the command to graphically check the robustness of the estimates. The presentation also includes an illustration of its usefulness with an example that relates the raise of violent crime in city centers and the process of suburbanization. The endogenous relation is solved using lead poisoning as an instrument.
Colegio Universitario de Estudios Financieros
Université catholique de Louvain
University of New South Wales
crtrees: An implementation of classification and regression trees (CART) and Random Forests in Stata
Abstract: crtrees performs classification trees, regression trees, (Breiman et. al. 1984) and Random Forests (Breiman 2001 and Scornet et al. 2015). Classification and regression trees consist of three algorithms: tree growing, tree pruning, and finding the honest tree. The Random Forests algorithm is an ensemble method that implements tree growing for many random subsets of the data and the splitting variables set. Random Forests can be implemented both for classification and for regression applications.
Universidad Carlos III de Madrid
Performing meta-analysis with Stata
Abstract: Meta-analysis provides a theoretical framework to integrate and analyze empirical evidence from multiple studies. It has been applied to many areas of research, such as econometrics, education, psychology, and medicine. The new suite of commands meta provides is an integrated framework to address the different aspects of our meta-analysis simply. I'll discuss how to prepare and summarize our data, address heterogeneity using random-effects models, extend these models to the use of meta-regression, and use postestimation commands to perform statistical tests and assess possible issues on our data.
Airports' managerial practices and efficiency
Abstract: Technical efficiency analysis has limited utility for policymakers and managers unless sources of inefficiency are identified. Apart from decisions between inputs and outputs, managers' skills and competencies gain across the years as well as the environment are drivers of airports' efficiency. Previous studies in agricultural economics have considered managerial practices (for example, Rougoor, Trip et al. 1998; Hansson 2008; and Manevska-Tasevska and Hansson 2011), agricultural education based on knowledge (for example, Galanopoulos et al. 2006 and Manevska-Tasevska 2013), experience (for example, Puig-Junoy and Argiles 2004), and economy-driven goals (for example, Willock et al. 1999 and Wilson et al. 2001) having a positive effect in efficiency. Nevertheless, these factors have not been accounted for for air transport studies. In this study, a stochastic frontier production function is used to measure airports' efficiency. Because of investments made in some airports in detriment of others, capital investments are accounted as nonneutral technical changes allowing time-varying efficiencies. The overall efficiency is expected to be differentiated between airport-specific factors such as airports' managers' practices and a component related to time-varying residual factors. Both managerial skills and the economic size of the airports understood as technological endowments should provide insights into airports' performance.
Ane Elixabete Ripoll-Zarraga
Universidad Pontificia de Comillas
On the cost of demand uncertainties in product-mix decisions: A simulation study built upon Goldratt's PQ problem
Abstract: Product-mix problems, where a range of products that generate different income competes for a limited set of resources, are key to the success of organizations in many industries. These are simple optimization problems in their most basic forms; however, the consideration of uncertainties may turn them into intractable problems. In this presentation, I investigate the economic impact of demand uncertainties on organizations facing such decision-making problems. In this fashion, I extend Goldratt's PQ problem to uncertain settings by considering variability in the volume and mix of customer needs. To this end, we develop a hybrid model-driven decision support system, characterized by a colored Petri net that shapes the dynamics of the agent-based production system along with a discrete-event engine of the simulation clock that makes it more agile. I design a Drum-Buffer-Rope mechanism aimed at protecting the throughput of the production system when it is exposed to uncertainty. Through a statistical study, I obtain regression models that link the net profit to the degree of volume and mix variabilities. I observe that, as demand uncertainty grows, the net profit becomes more volatile and tends to decrease. However, I find that until a certain threshold, the production system is barely affected by variabilities; thus, the Drum Buffer Rope makes it robust to uncertainties. This raises important implications for professionals dealing with product-mix problems because understanding the link between uncertainties and performance may prompt them to consider several actionable changes in their organizations.
Stata for millennials (and generation Z): Stata teaching videos using YouTube
Abstract: The objective of our presentation is to exchange information and experiences with other professors as well as university researchers and professionals interested in promoting and improving teaching methods mainly in sociology, political science, and economics. In this presentation, we will share an experience on teaching innovation of a project won in a competitive process for obtaining funds at the Universidad Autónoma de Madrid (D_020.18_INN "Quantitative techniques in short teaching videos"). The innovation project consists of the elaboration of a series of educational videos on the analysis of social data—aggregate or individual—using Stata. In this presentation, we will discuss at least three aspects:
Universidad Autónoma de Madrid
Interactive graphs with Stata
Abstract: Given the emergence of big data generated by massive digitization, as well as the growing access to information from the so-called second digital revolution, social scientists face a number of methodological challenges to better understand social life: data collection, new ways of sampling, automatic coding, and statistical analysis of information.
This presentation proposes the graphical analysis of information based on data binarization. The idea is to build three-dimensional binary matrices formed by 1) temporal or spatial sets, 2) scenarios, and 3) events or characteristics, supported by matrices with their attributes. The treatment of this structure is based on the methodology of two-mode networks, combined with statistical tools for selection and location of nodes and representation of edges. Graphs have been used not only to solve topographic problems and to represent social structures, but also to study relationships between variables. To improve their analytical potential, these graphs are endowed with an interactive potential that includes the selection of various attributes for the recognition of the elements analyzed and the modification of parameters to focus on stronger relationships.
In this presentation, we advance a Stata program that uses its recent link with Python to elaborate these interactive graphs. We give a variety of examples that range from the analysis of photo collections, content analysis of text, representation of concerts and exhibitions, surveys of personal correspondence, etc., to the analysis of multiple-response questions in questionnaires.
Modesto Escobar et al.
Universidad de Salamanca
More graphics and tables with Stata
Abstract: It has befome more difficult to publish an academic article in which you show only the table with the results of the different regressions used to test your hypotheses. At least, the one who adds a few graphs of predicted probabilities (margins plots) or the occasional graph of coefficients knows that the chances of success are higher. Despite this, not all graphics produced by Stata are the same and, without a doubt, some look prettier than others. Our presentation is dedicated to the graphic presentation of the results for multiple regression models. In this way, we divide the presentation into two parts.
The first is dedicated to the presentation of the estimates of the effects of the variables (the betas) with the help of the community-contributed coefplot command. We will start from a basic graph, and we will enrich it with headings, groups, notes, statistical significance, and other less-known options that help to better distinguish the contributions of each model. In addition, we will discuss the relevance or irrelevance of standardizing the variables (standard deviation = 1,) showing how, surprisingly, some results change if we standardize or if we leave the variables as they were.
However, "betas" are of little use when the corresponding variables are not quantitative. Because many of the variables of interest are qualitative in the social scientists, the second part of the presentation is dedicated to the presentation of the average marginal effects (AMEs) of all the independent variables, using the option post to save results that then allow graphs using official and community-contributed Stata commands.
Universidad Autónoma de Madrid
Open panel discussion with Stata developers
Dr. Ricardo Mora
Dpto. Economía, Universidad Carlos III de Madrid
Sociología y CC Políticas:
Dr. Modesto Escobar
Dpto. Sociología y Comunicación, Universidad de Salamanca
Ciencias de la Salud:
Dr. Alexander Zlotnik
Cuerpo Superior de Sistemas y Tecnologías de la Información de la Administración del Estado
Jefe de Servicio en el Ministerio de Sanidad, Consumo y Bienestar Social de España