Home  /  Users Group meetings  /  2017 Spain

The Spanish Stata Users Group Meeting was Thursday, 19 October 2017 at Instituto de Salud Carlos III, but you can view the program and presentation slides below.

Now what do I do with this function?
Abstract: Nonparametric analysis has been traditionally descriptive. We fit the regression function that relates the outcome of interest and the covariates, and then we graph. But we can go beyond the descriptive. We may use this function to compute marginal effects, counterfactuals, and other statistics of interest. In other words, we may use margins after npregress to conduct semiparametric analysis. I will show you how.

Additional information:

Enrique Pinzón
Generalized linear models (GLM) applied to the prediction of health expenditure
Abstract: Objectives: To implement a health expenditure prediction system based on morbidity and to analyze its goodness of fit.

Methods: Observational, descriptive, retrospective and cross-sectional study on total health expenditure using explanatory-predictive-stratified models.

There was a database of 156,811 inhabitants of the Denia health department that included age, Clinical Risk Group (CRG), total health expenditure, among other variables. The GLM with logarithmic-gamma distributions has different iterations depending on the dependent variable, the total health expenditure and as independent variables; age, gender and membership of the CRG in order to select the model that best explains the behavior of health expenditure.

The model with the highest statistical significance used the combination of the variables age, sex, CRG health status, and severity level, whose Akaike information criterion was 14.2. By correlating the values estimated by the model and the real value, we obtain a correlation of 25%.

Differing by type of expenditure, CRG showed a greater explanatory capacity in outpatient pharmaceutical spending and a lower explanatory capacity in hospital expenditure.

Conclusion: Multimorbidity factors have a greater impact on the explanation of health expenditure than demographic variables.

Additional information:

Vicente Caballer
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
David Vivas, N. Guadalajara, Alexander Zlotnik, Isabel Barrachina
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
Construction and validation of a predictive model for the identification of complex chronic patients
Abstract: The objective of this study was the construction and validation of a predictive model for the identification of complex chronic patients.

A cross-sectional study was performed on the population of the Comunidad Valenciana region in 2015 (4,708,754 persons). Dependent variable: resource use variables greater than or equal to the 95th percentile (P95), including the number of primary contacts, number of hospital admissions, number of visits to emergency departments, pharmaceutical problems, and the pharmaceutical costs. Predictive variables: age, morbidity (according to clinical risk groups (CRG)), and variables corresponding to the resource use mentioned above. Persons exceeding P95 were 0.2% of the population; thus, the study was carried out on a sample of 10% stratified by CRGs, and all persons without chronic or moderate conditions were eliminated, in other words, those belonging to health states 1, 2, 3, and 4, totaling 150,252 persons. A logistic regression model was then built. Its validity was analyzed with sensitivity, specificity, a goodness of fit test, and area under the ROC curve (AUC) metrics.

Additional information:

Silvia Badal
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
Alexander Zlotnik, Ruth Usó, David Vivas
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
Dealing with missing data in practice: Methods, applications, and implications for HIV cohort studies
Abstract: Missing data are common in HIV cohort studies, affecting both the covariates and the outcome. In this case study, we compare different methods to deal with missing data applied to estimate mortality by Hepatitis C virus coinfection in the cohort of the Spanish Network of HIV Research (CoRIS) using Stata.

I used Poisson regression to estimate mortality rate ratios, using five methods to handle missing data in both the covariates and the cause of death: complete-case, indicator method (IM), multiple imputation by chained equations (MICE), multiple imputation then deletion (MID), and inverse probability weighting (IPW).

Strong predictors were found for incomplete variables' values and for their probability of being missing. No significant differences were found in excess hazard ratios between the different methods. However, the complete-case approach led to less precise estimations; and incorrect classification of cause of death or deletion of cases with a missing cause of death when using complete-case, MID, or IPW led to underestimation of the excess mortality rates.

In this case-study, MICE seemed to work best, because it both corrected bias and produced the most accurate estimates. Although MICE rests on the untestable assumption that data are missing at random, it seemed plausible in this context.

Additional information:

Belén Alejos Ferreras
Instituto Carlos III
Statistical analysis of the evolution of autonomous symptoms in patients with Parkinson's disease
Abstract: Introduction: Autonomic symptoms (AS) of Parkinson's disease (PD) may be present from the time of diagnosis and even precede it. So far, there are many unknowns in the relationship between the evolution of AS and other variables of the disease.

Objective: To describe the evolution of AS and its relationship with motor and non-motor symptoms in PD.

Methods: Observational, multicentric study (Spain and Holland) with longitudinal follow-up and baseline evaluation and in the fourth year. The SCOPA-Motor, SCOPA-Cognition, and HY stage scales were used, along with SCOPA-AUT (SCOPA-AUT) and SCOPA-Sleep self-administered questionnaires. Statistical magnitudes (the size effect and relative change) and sensitivity to change (standard error of the measure, 10% of the total maximum score and ½ standard deviation) were calculated for each subscale of the SCOPA-AUT and their mean value (Estimated value of change, EVC). Patients were classified according to their worsening in the autonomic subscales, depending on whether the difference between baseline and follow-up score exceeded or not.

Additional information:

Abelardo Fernández Chávez
Hospital Ramón y Cajal
Methodology for using Weka machine learning algorithms from Stata
Abstract: Stata does not include most classical machine learning algorithms in its core libraries. A few algorithms are available through plug-ins, such as wrappers for the LIBSVM library; however, these sometimes exhibit performance problems, do not expose the full functionality of the algorithm, and are often challenging to modify.

Weka is an open source software suite written in Java that implements most well-known machine learning algorithms. Because its source code is available and documented, it is relatively easy to introduce custom modifications that should fulfill most practical business and research needs.

In this talk, we present a simple method for integrating Stata ado-file programs with standard Weka algorithms (CART and C4.5 decision trees, support vector machines, neural networks, Bayesian networks, KNNs, LogitBoost classifiers, stacking classifiers, generic ensemble classifiers, etc.) as well as custom Weka algorithms (such as CART trees with LogitBoost on its branches).

Additional information:

Alexander Zlotnik
Universidad Politécnica de Madrid
A proposal for a new Stata licensing scheme based on blockchain, cloud computing, and grid computing
Abstract: Stata is a well-known statistical software package used for a wide variety of statistical analyses. As Stata users in several fields know, current and future data processing often require increasingly larger computing resources. Stata/MP is a multiprocessor version suitable for some of these tasks but, even on powerful hardware, its capacities are sometimes surpassed for computationally demanding tasks. If these tasks can be parallelized, distributed computing approaches can also be used.

Some software packages that require powerful computing resources, such as ChessBase engines used for deep chess variant analysis, have introduced the possibility of offloading some calculations to its private clouds. Alternatively, large computational problems such as the SETI@home project have chosen a grid computing model. The latter approach could be further enhanced with a blockchain-based distributed ledger that registered the computational power contributed to the community by each of its members and rewarded them for their contribution. All of these approaches or combinations thereof, could be used for new Stata licensing schemes.

Additional information:

Alexander Zlotnik
Universidad Politécnica de Madrid
David Arroyo Manzano
sdmxuse: Module to import data from statistical agencies using the SDMX standard
Abstract: Statistical Data and Metadata eXchange (SDMX), is an ISO standard developed by seven international organizations (BIS, ECB, Eurostat, IMF, OECD, the United Nations, and the World Bank) to facilitate the exchange of statistical data. The package sdmxuse (available from the SSC archive) allows Stata users to download and import SDMX data directly within their favorite software. The program builds and sends a query to the statistical agency (using RESTful web services), then imports and formats the downloaded dataset (in XML format). The complex structure of the datasets (so-called "cube") is reviewed to show how users can send specific queries and import only the required time series. sdmxuse might prove useful for researchers who need frequently updated time series and wish to automate the downloading and formatting process.

Additional information:

Sébastien Fontenay (was unable to attend and present)
Université Catholique de Lovaine
Using Stata to estimate dynamic binary random effects models with unbalanced panels
Abstract: The purpose of this paper is to implement estimators proposed by Albarráet al. (2017) in Stata for dynamic binary choice correlated random effects (CRE) models with unbalanced panel data. The procedure allows for unrestricted correlation between the sample selection process that determines the unbalancedness and the time invariant unobserved heterogeneity. We create a specific command for this procedure, named xtunbalmd. It fits the model for each subpanel separately and obtains estimates of the common parameters across subpanels by minimum distance (MD). This estimation method is faster than estimation by maximum likelihood (ML), because it allows the same estimation routines that we would use if we had a balanced panel, while keeping the good asymptotic properties of the ML estimator for the whole sample.

Additional information:

Pedro Albarrán
Universidad de Alicante
Raquel Carrasco, Jesús M. Carro
Universidad Carlos III de Madrid
Complementarity analysis in multinomial models: The gentzkow command
Abstract: In the presence of a choice of two binary variables, the usual econometric procedure within the framework of the random utility model is the estimation of a bivariate probit that accounts for the potential correlation between the error terms for the utilities of all different alternatives. This approach is not useful if the objective of the analysis is the study of the complementarity or substitutability of the two alternatives, because the bivariate model assumes by construction that the two alternatives are independent from the economic point of view. In other words, in the bivariate probit, a factor that points to one alternative in the first choice, but does not affect the utilities of the other choice directly, does not induce a change in the second choice. To study the complementarity or substitutability of alternatives, it is necessary to fit a more flexible model such as the multinomial model and to compute expected complementarity patterns from the standard results. In this presentation, we show the Stata command gentzkow, which performs the complete analysis, and we show its usefulness with an example with data from China on the double choice of grandparents to first live in the same house as their children and grandchildren, and secondly to help with the care of the grandchild.

Additional information:

Ricardo Mora
Universidad Carlos III
Yunrong Li
Southwestern University of Finance and Economics
Performing probabilistic cost-effectiveness analysis via decision tree modeling in Stata: The manantial command
Abstract: Decision models are based on Markov processes that describe the statistical laws for possible states or sequential events to which an individual or patient is subject within a system. Every decision model can be represented as a probability tree containing nodes and branches. Each node represents a possible state of the patient in its clinical evolution and socioeconomic status, while each branch joins two states that are sequentially possible. Thus, several branches arise from the initial node representing the patient's input to the system representing the following different possible states. Each terminal node of the tree represents the last possible state after a particular patient evolution. Therefore, probability trees are diagrams that represent all possible evolutions of a patient within a system. By assigning net costs to each node and conditional probabilities to each branch, it is possible to calculate the expected net cost per patient. Using Monte Carlo techniques, the distribution of estimated net costs per patient in the population of interest can be estimated to incorporate the uncertainty inherent in using estimated values for conditional probabilities and net costs. I introduce the manantial command, which takes as inputs the decision tree, probability distributions, and payoffs. The command provides significance tests and confidence intervals, and perfoms sensitivity analysis. We illustrate the use of the command with an evaluation of early intervention in psychosis. Early intervention in psychosis is a clinical approach for those who experience symptoms of psychosis for the first time. It is part of a new paradigm of prevention of psychiatry that is conditioning the reform of mental health services. The focus is on the early detection and treatment of early symptoms of psychosis during the formative years critical to the psychotic condition.

Additional information:

Manuel García-Goñi
Universidad Complutense de Madrid
Ricardo Mora
Universidad Carlos III de Madrid
Latent class analysis and finite mixture models with Stata
Abstract: Sometimes we are interested in identifying and understanding different groups in a population, even though we cannot directly observe which group each individual belongs to. Latent class analysis deals with these problems.

Often, those classes are determined by heterogeneity on regression models, where the relationship of a dependent variable (or variables) with a group of covariates varies from group to group. The new features added in Stata 15 allow us to fit many latent class models, including to the gsem command finite mixture models, which can also be fit using the new prefix fmm. We will introduce these topics and discuss examples using Stata.

Additional information:

Isabel Cañette
Random samples generation with Stata from continuous and discrete distributions
Abstract: Simulations nowadays are a very important way of analyzing new improvements in different areas before the physical implementation, which may require hard resources that could only be affronted in case of a high probability of success. The use of random samples from different distributions are a must in simulations.

In this talk, we introduce new Stata functions for generating random samples from continuous and discrete distributions that are not considered in the defined Stata random-number generation functions. In addition, we will also introduce new Stata functions for generating random samples as an alternative of the build-in Stata functions.

The goodness of the generated samples will be checked using the mean squared error (MSE) of the differences between the frequencies of the sample and the theoretical expected ones. We will also provide bar charts that will allow the user to graphically compare the sample with the exact distribution function of the random distribution that is being sampled.

Additional information:

Gabriel Aguilera-Venegas
Universidad de Málaga, Universidad Politécnica de Madrid
José Luis Galán-García, M. Ángeles Galán-García, Pedro Rodríguez-Cielos, Ricardo Rodríguez-Cielos
Universidad de Málaga, Universidad Politécnica de Madrid
Multilevel models for cross-sectional and longitudinal data
Abstract: Typical multilevel analysis in comparative research implies the use of cross-sectional data for multiple countries. Multilevel models in such settings are likely to be affected by problems of endogeneity and omitted variables biases because of unobserved heterogeneity. However, there is a growing volume of longitudinal data in comparative data projects, because they typically span multiple waves (e.g., the European Social Survey or the World Values Survey). This allows us to exploit the longitudinal dimension of the data by splitting the effect of aggregate variables into two different sources of variation (between and within countries), which makes multilevel models robust against the problem of unobserved heterogeneity. Drawing upon a few recent works in the literature that propose to include both cross-sectional and longitudinal effects in multilevel models, I focus on the theoretical and practical implications of this modeling strategy. Furthermore, I provide some examples and practical recommendations using this approach with Stata.
Antonio M. Jaime-Castillo
Universidad de Málaga
The effect of birth weight on cognitive performance: Is there a social gradient? Is there compensation?
Abstract: Demography has traditionally been interested in birth weight and the impact that certain descriptive characteristics have on birth weight. Most of the interest in this variable lies in the fact that weight at birth is a significant predictor of infant health outcomes (as well as health at adult ages), but also of cognitive performance and educational results. While evidence explaining the prevalence of low birth weight is common in different disciplines, it is much less frequent to see high-quality evidence built from large sample sizes quantifying the impact of weight at birth on schooling outcomes.

In this paper, we use data from the Chinese Family Panel Study (2010 wave), a large-scale representative sample of Chinese households, to model the effect of low birth weight on standardized test scores among Chinese children aged 10-15 years.

Our evidence confirms a highly significant negative effect of LBW on the results obtained by children in both mathematics and Chinese language. The paper also shows a clear gradient in the prevalence of low birth weight by family background. Our evidence also implies that highly educated parents (mothers) can actually compensate the disadvantage that low birth weight represents in terms of cognitive performance.

Héctor Cebolla-Boado
Leire Salazar
Education in Spain: Tell me how you look at data and I will tell you what you will see
Abstract: According to Eurostat between 2011 and 2016, early school leaving has fallen in Spain from 26.3% to 19.0%. Following this fast progression, the target for 2020 (15%) will certainly be reached soon. Using the same database, it is assured that the percentage of population aged 30-34 years with tertiary studies has remained above 40%. With these indicators, we can only congratulate a society that is winning the battle against premature school leaving and has such an abundant volume of highly qualified young people.

The above information is based on the Spanish Labour Force Survey (aka, EPA-Encuesta de Población Activa), a panel survey where an individual can be observed up to six times in a row. By applying the Stata-specific module for panel data analysis xt, the real view on the level of education in Spain dramatically changes, because early school leaving is much higher and the proportion of people who completed a university degree much lower. We just need to take into account that the EPA is surveying the same individuals in different occasions.

Pau Miret Gamundi
Universidad Autónoma de Barcelona, Centro de Estudios Demográficos
postweight or calibrate? Survey post-adjustments in Stata
Abstract: Normally, after survey data collection is completed, the final sample differs from population figures on key variables. If the population figures are known, the final sample can be adjusted using techniques such as post-stratification or calibration. These techniques are used to compute weights, which ensure that the weighted distribution of the final sample matches the population on key variables. This presentation compares two commands available in Stata that, under certain circumstances, lead to different results: svyset and calibrate. svyset, which is the Stata command to deal with survey data, includes poststrata and postweight options to post-adjust survey data; calibrate (D'Souza, 2011) is used to compute different types of calibration. The main difference between how these two packages compute the weights is the treatment of the missing values. Here, we present how to use these packages alongside an explanation on the differences followed by research examples.

Additional information:

Pablo Cabrera
Universidad de Salamanca
Modesto Escobar
Universidad de Salamanca
Strategies and tricks for teaching and researching with Stata
Abstract: The presentation we propose is mainly conceived as a contribution to Stata teaching strategies and tricks in university courses, but many of the tricks we show are very useful for research purposes as well. Some of the questions we will address in our presentation are: How can we open datasets that are originally in older versions of Stata? How can we open datasets available in other programs, like SPSS? We bring attention to several short commands that enable us to accomplish these goals. How do we show both codes and label values in our tables? How do we perform comparisons of means that show both means and p-values? We point to the existence of a command that allows us to do so while showing a very easy-to-interpret output. How do we show correlations with their corresponding significance levels? How do we combine several graphs with a unique legend with the same scale in both axes? How do we compare several models with tables and with graphs?

Additional information:

Andrés Santana
Universidad Autónoma de Madrid
José Rama
Universidad Autónoma de Madrid
Wishes and grumbles


Scientific committee

Ricardo Mora
Universidad Carlos III de Madrid

Modesto Escobar
Universidad de Salamanca

Alexander Zlotnik
Universidad Politécnica de Madrid

Logistics organizer

The logistics organizer for the 2017 Spanish Stata Users Group meeting is Timberlake Consulting S.L.,
the distributor of Stata in Spain.

View the proceedings of previous Stata Users Group meetings.