Home  /  Users Group meetings  /  2018 Spain

The Spanish Stata Conference was held on 24 October 2018 at Universitat Pompeu Fabra, Campus Ciutadella, Edifici Mercé Rodoreda, but you can view the program below.


Introduction to Bayesian analysis using Stata
Abstract: Researchers' interest about the use of Bayesian regression analysis has been significantly increasing in recent years. One of the fundamental reasons for this growing interest is that a wide variety of models can be accommodated within this alternative regression approach. This flexibility is due in part to the possibility of using a common theoretical framework to estimate the parameters for posterior distributions associated with different kinds of model specifications. I will outline the main aspects associated with Bayesian regression in Stata, and I will show the facilities incorporated in Stata 15 to make this kind of analysis more accessible to those who are not very familiar with this approach.

Additional information:

Gustavo Sánchez
Ensemble learning targeted maximum-likelihood estimation for Stata users
Abstract: eltmle is a Stata program implementing the targeted maximum-likelihood estimation (TMLE) for the ATE for a binary or continuous outcome and binary treatment. eltmle includes the use of a super learner called from the SuperLearner package v.2.0-21 (Polley et al. 2011). Modern epidemiology has been able to identify significant limitations of classic epidemiological methods, like outcome regression analysis, when estimating causal quantities such as the average treatment effect (ATE) for observational data. For example, using classical regression models to estimate the ATE requires the assumption that the effect measure is constant across levels of confounders included in the model, i.e., that there is no effect modification. Other methods do not require this assumption, including g-methods (for example, the g-formula) and targeted maximum-likelihood estimation (TMLE). The average treatment effect (ATE) or risk difference is the most commonly used causal parameter. Many estimators of the ATE, but not all, rely on parametric modeling assumptions. Therefore, the correct model specification is crucial to obtain unbiased estimates of the true ATE. TMLE is a semi-parametric, efficient substitution estimator allowing for data-adaptive estimation while obtaining valid statistical inference based on the targeted minimum loss-based estimation. TMLE has the advantage of being doubly robust. Moreover, TMLE allows inclusion of machine learning algorithms to minimize the risk of model misspecification, a problem that persists for competing estimators. Evidence shows that TMLE typically provides the least unbiased estimates of the ATE compared with other double robust estimators. The following links provide access to a TMLE tutorial: https://migariane.github.io/TMLE.nb.html and the GitHub repository for the eltmle Stata package, https://github.com/migariane/meltmle.

Additional information:

Miguel Ángel Luque-Fernández
Universidad de Granada, London School of Hygiene and Tropical Medicine, CIBERESP ISCII
Text mining with ngram variables
Abstract: Text data, such as answers to open-ended questions, are sometimes ignored because they are hard to analyze. My community-contributed Stata command ngram turns text into hundreds of variables using the "bag of words" approach. Broadly speaking, each variable records how often the corresponding word or word sequence occurs in a given text. This is more useful than it sounds. The program supports text in 12 European languages.

Additional information:

Matthias Schonlau
University of Waterloo
Cross-validated area under the oc curve (cvAUROC)
Abstract: Receiver operating characteristic (ROC) analysis is used for comparing predictive models, both in model selection and model evaluation. This method is often applied in clinical medicine and social science to assess the tradeoff between model sensitivity and specificity. After one fits a binary logistic regression model with a set of independent variables, the predictive performance of this set of variables—as assessed by the area under the curve (AUC) from a ROC curve—must be estimated for a sample (the "test" sample) that is independent of the sample used to predict the dependent variable (the "training" sample). An important aspect of predictive modeling (regardless of model type) is the ability of a model to generalize to new cases. Evaluating the predictive performance (AUC) of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. K-fold cross-validation can be used to generate a more realistic estimate of predictive performance. To assess this ability in situations in which the number of observations is not very large, cross-validation and bootstrap strategies are useful. cvAUROC is a community-contributed Stata command that implements k-fold cross-validation for the AUC for a binary outcome after fitting a logistic regression model and provides the cross-validated fitted probabilities for the dependent variable or outcome, contained in a new variable named _fit. Different options and examples for the use of cvAUROC can be downloaded at https://github.com/migariane/cvAUROC and can be directly installed in Stata using ssc install cvAUROC.

Additional information:
spain18_Miguel Ángel Luque-Fernández(2).pdf

Miguel Ángel Luque-Fernández
Universidad de Granada, London School of Hygiene and Tropical Medicine, CIBERESP ISCIII
Camille Maringe
London School of Hygiene and Tropical Medicine
The impact of the priority review voucher on research and development investment for neglected diseases
Abstract: The priority review voucher (PRV) was implemented in the United States in 2007 with the aim to stimulate research and development (R&D) for neglected diseases. The idea is the following: pharmaceutical companies are granted a priority review voucher by the food and drug administration (FDA) (for example, review within 6 months compared with the standard 10 months) upon successful development of a product (for example, drug or vaccine) for diseases of the PRV list. The voucher either can be used for a blockbuster drug or sold to a third party. The PRV is believed to be a strong consideration among pharmaceutical companies to initiate or continue a project for a neglected disease, with the last one having been granted in June 2018. R&D investment is measured by the number of clinical trials initiated yearly and per disease, which is downloadable from the WHO platform registry. Because the policy targets a specific group of diseases in a specific country (for example, the U.S.), we isolate the impact of the policy through the differences-in-differences (DD) approach and differences-in-differences-in-differences (DDD) approach.
Céline Aerts
Barcelona Institute for Global Health (ISGlobal)
Marisa Miraldo
Eliana Barenho
Imperial College London
Elisa Sicuri
Barcelona Institute for Global Health (ISGlobal), Imperial College London
Demand for house improvement in rural Gambia
Abstract: We estimated the demand for house improvement in rural Gambia, West Africa, by exploring three definitions of demand: utility-derived demand, stated demand, and revealed preferences-based demand. Data were collected in the context of a cluster-randomized controlled trial aiming at identifying and measuring the impact of improved houses on selected health outcomes. We collected panel data (4 rounds over approximately 1 year to control for seasonality) from nearly 200 households representing intervention, control and nonstudy groups, from a random subsample of 15 study villages. We collected information on satisfaction with owned houses (utility), willingness to pay for house improvement (stated preferences), and routine housing behavior (revealed preferences). We estimated the determinants of demand through ordered logit or linear (depending on the outcome variable distribution) fixed-effects models. Under the hypothesis that housing investment choices in such a rural context (and considering the short term) aim at maintaining utility constant across seasons, we plotted predicted demand from the estimated models against time (rounds) and analyzed and interpreted differences across the three definitions of demand.

Additional information:

Elisa Sicuri
Barcelona Institute for Global Health (ISGlobal), Imperial College London
Lesong Conteh
Barcelona Institute for Global Health (ISGlobal)
Estimating and interpreting effects for nonlinear and nonparametric models
Abstract: After we fit a model, our analysis does not stop. We want to use our results to construct counterfactual scenarios. We want to study the effects of changes in variables over the population or for a specific subpopulation. Answering such questions is more challenging for nonlinear models and, in particular, for models in which we make no assumptions about functional forms—nonparametric models. In this presentation, we will illustrate how to answer these and other relevant empirical questions for nonlinear cross-sectional and panel-data models and for nonparametric models. We do this within a unified framework using Stata.

Additional information:

Enrique Pinzón
Propensity-score matching with clustered data in Stata
Abstract: In observational studies, estimation of causal effects often relies on the assumption that all relevant confounders are observed. Under this assumption, propensity-score matching (PSM) can be used to adjust for observed confounders. PSM is a semiparametric alternative to regression models that consists of two steps: 1) estimation of the probability of receiving the treatment (propensity score); 2) matching on the estimated propensity score.

PSM has been originally proposed for unstructured data, and available Stata routines are designed for these types of data. However, clustered or hierarchical data are common in many fields of study (for example, students nested into school, voters into parties, patients into hospitals). Building on recent methodological developments, the goal of this presentation is to show how PSM can be implemented with clustered data in Stata. Using examples on real data, I will present methods that exploit the information on the clustered structure of the data in two ways: in the estimation of the propensity-score model (through the inclusion of fixed or random effects) or in the implementation of the matching algorithm.

Additional information:

Bruno Arpino
Universitat Pompeu Fabra
Exercises on the Internet for researchers and students to learn Stata
Abstract: Since the release of Stata 15, it has been possible to convert the results of analyses into .doc (putdocx), .pdf (putpdf), and .html (dyndoc) files. This presentation demonstrates the process by which this is achieved to create a set of basic exercises online (http://bit.ly/Analisis2018), so researchers and students can learn how to manage Stata. First, I discuss the varied file types and how to work with them. Then, I present the steps necessary for obtaining basic analysis with the program, including percentage tables, means, and regressions. In addition to this option, Stata's dyndoc command can generate other web pages unrelated to the program, with minimal knowledge of the HTML language.

Additional information:

Modesto Escobar
Universidad de Salamanca
Graphical and numerical solutions to standard research problems in the social sciences: Some suggestions and unresolved challanges
Abstract: The goal of this presentation is to identify some common analytical problems that are often encountered in quantitative research in a wide array of social science applications (and possibly in other research fields as well), such as the analysis of multicolinearity of independent variables when qualitative variables are involved; the elaboration of three-way contingency tables with percentages; the presentation of predictive margins and frequency distributions of both qualitative and quantitative variables; the presentation of information both on predictive margins and on contrasts of the statistical significance of the differences of the effects of adjacent and non-adjanent categories of qualitative independent variables; and the construction of time-series graphs based on the frequency distribution of categorical variables. I will put forward some solutions with Stata for discussion among the audience and identify some unresolved challenges.

Additional information:

José Rama
Andrés Santana
Universidad Autónoma de Madrid
Does interview length affect panel attrition?
Abstract: Panel attrition is a threat for data quality in longitudinal studies, especially if those who drop from the study are different from the panel respondents. This presentation investigates the effect of survey length on wave nonresponse using data from Understanding Society, the United Kingdom Household Longitudinal Study (UKHLS). The concept of survey length is addressed from a theoretical point of view, and two measures, length and interview pace, are computed to test their effect on survey cooperation.
Pablo Cabrera-Álvarez
David Dóncel Abad
Universidad de Salamanca
A new proposal for the comparative analysis toward linguistic educational policies in multinational settings: An application with the Stata software for the Catalan and Basque cases
Abstract: The goal of this presentation is to put forward a new set of indexes and data analytic strategies for the comparative study of attitudes toward linguistic educational policies in multinational settings. These indexes deal with the attitudes toward the linguistic mix in primary and secondary education, most notably regarding the local-international dimension (regional and state-wide ones vis-à-vis English) and the subnational-national one. Empirical analysis will be performed with Stata using data of a specialized survey for the Catalan case (N > 2,200) and the Eusko-barometer of May 2018 (N > 600). Several analytical options will be presented for discussion.

Additional information:

Andrés Santana
Universidad Autónoma de Madrid
Wishes and grumbles
Abstract: Stata developers present will carefully and cautiously consider wishes and grumbles from Stata users in the audience. Questions, and possibly answers, may concern reports of present bugs and limitations or requests for new features in future releases of the software.
StataCorp personnel

Scientific committee

Dr. Andre Groger
Universitat Autónoma de Barcelona y Barcelona GSE

Sociología y CC Políticas:
Dr. Modesto Escobar
Dpto. Sociología y Comunicación, Universidad de Salamanca

Dr. Mariano Torcal
Ciencias políticas y sociales, RECSM y Universitat Pompeu Fabra

Dr. Sergi Sanz
Dpto. Unidad de bioestadística y gestión de datos, IS Global y Universitat de Barcelona

Dr. Llorenç Quintó
IS Global

Logistics organizer

The logistics organizer for the 2018 Spanish Stata Conference is Timberlake Consulting S.L.,
the distributor of Stata in Spain.

View the proceedings of previous Stata Users Group meetings.