2013 Spanish Stata Users Group meeting: Abstracts
Treatment effects using Stata
Treatment-effects estimation is a fundamental tool in the empirical analysis of program and policy evaluations. Its importance in these areas is reflected in the ample body of work that has been done in the last decades. Using Stata, I will illustrate estimation of the more traditional treatment-effects models used by researchers in these areas and some of the more recent advances.
Meta-análisis de pruebas diagnósticas con Stata: estudio de simulación
CIBER Epidemiología y Salud Pública (CIBERESP), España
Los meta-análisis aplicados a pruebas diagnósticas son una técnica estadística compleja para la que Stata dispone de algunos comandos escritos por usuarios (metandi, midas, metan). Además de tener en cuenta las diferentes precisiones con que se estiman la sensibilidad y la especificidad en los estudios primarios, el método debe tener en cuenta la correlación entre estos dos índices de validez. Para esta situación se ha propuesto recientemente el uso de los modelos no lineales mixtos (xtmelogit) como modelos de análisis estándar, sustituyendo a los inicialmente propuestos modelos lineales mixtos (xtmixed). El lugar reservado para los modelos de efectos fijos y aleatorios univariantes es cada vez más estrecho. Sin embargo, en ciertas situaciones clínicas, las diferencias entre las estimaciones de la validez diagnóstica obtenida por los distintos abordajes (bivariante y univariante) son despreciables, observándose sin embargo ventajas relativas para los distintos métodos.
Se presentarán los resultados de estudios de simulación de meta-análisis de validez diagnóstica abordados con diferentes modelos de análisis (bivariante y univariante), concluyendo con una propuesta de algoritmo para el abordaje del meta-análisis de estudios de validez diagnóstica.
Stata logistic regression nomogram generator
Department of Electronic Engineering, Politecnic University of Madrid
Predictive models for clinical decision making are often neglected because of the lack of output calculation aids. Logistic regression nomograms are one such tool that has seen wide adoption in biomedical research in recent years. We present a nomogram generator for Stata that may be executed after an almost arbitrary logistic regression model. The coefficient is extracted from the e(b) coefficient vector, and variable ranges are obtained from scalars stored by the summarize command. Although Stata possesses a well-developed graphic library, several uncommon procedures had to be used in some parts of the program. Time-series graphs with a fictitious time dimension and a custom data-point-generation technique were implemented. A graph for individual scores was then combined with another graph that shows the conversion of total score to outcome probability. Options fysize and aspect of the graph combine command were used to account for the unequal heights of the graphs. Continuous variable labels are dynamically adjusted in such a way that they do not overlap. The current version of the program has limitations concerning the regression command executed beforehand: interaction operators are not supported, dummy variables must use the *b.var syntax, and continuous variables must precede dummys.
Further explanations, examples and download links for nomogram generators for logistic and Cox regressions are available at: www.zlotnik.net/stata/nomograms
The measurement of the effect on citation inequality of differences in citation practices across scientific fields
This paper has two aims: (i) to introduce a novel method for measuring the proportion of overall citation inequality that can be attributed to differences in citation practices across scientific fields; and (ii) to implement an empirical strategy for making meaningful comparisons between the number of citations received by articles in the 22 broad fields distinguished by Thomson Scientific. The paper is based on a model in which the number of citations received by any article is a function of the article’s scientific influence and the field to which it belongs. A key assumption for this model is that articles in the same quantile of any field citation distribution have the same degree of citation impact in their respective fields. Using a dataset of 4.4 million articles published from 1998 to 2003 with a five-year citation window, we find that differences in citation practices between the 22 fields account for about 14% of overall citation inequality. Our empirical strategy for making comparisons of citation counts across fields is based on the strong similarities found in the behavior of citation distributions over a large quantile interval. We obtain three main results. First, we provide a set of exchange rates to express citations in any field into citations in the all-fields case. Results are very satisfactory for 20 out of 22 fields. Second, when the raw citation data are normalized with our exchange rates, the effect of differences in citation practices is reduced to approximately 2% of overall citation inequality in the normalized citation distributions. Third, we provide an empirical explanation of why the usual normalization procedure based on the fields’ mean citation rates is found to be equally successful.
Implementing the mutual information index in Stata
Frankel and Volij (2011) have recently shown that the mutual information index is the only multigroup segregation index that, in addition to possessing other desirable properties, satisfies the strong group decomposability property (SGD). Consider, for example, the study of occupational segregation jointly by ethnicity and gender. SGD is an important property because an index that satisfies it allows us to identify the proportion of occupational segregation by ethnicity and gender that can be attributed exclusively to either ethnicity or gender. In this presentation, I present a flexible Stata ado-file that permits the computation of the mutual information index and its decomposition based on the SGD property. I also provide illustrations based on the joint study of gender and ethnicity in occupational segregation as well as the measurement of the gender division of labor.
The role of parenting practices and preschool education for social inequalities in learning outcomes: A cross-country comparison
Research on preschool education is gaining momentum since the discovery of its potential to reduce the effect of social inequalities on educational attainment. Much of this research uses case studies to measure mid- and long-term effects of preschool attendance on academic performance and school transitions. In this paper we adopt a broader, comparative approach, by exploring the extent to which preprimary school child stimulation can reduce the effect of background differentials on learning outcomes across countries. Particularly, we seek to unveil the relative impact of parental involvement and preschool education on reducing the effect of educational disadvantages that stem from parental education. Are the effects of both practices cumulative or complementary? Are there cross-country differences in the learning benefits of preschool education? Is kindergarten equally stimulating for children from different social origins? Is the impact of different parenting practices sensitive to national contexts? We use PIRLS 2011 data, which provides a standardized measure of reading literacy among students in fourth grade from a large number of countries. We estimate random-intercept and random-slope multilevel models to assess the effect of the type of child care adopted by families on educational outcomes. This approach enables us to decompose the variance in reading skills into its constitutive parts at the country, school, and student level.
Generalized structural equation models: Fitting customized models without programming (Español)
Statisticians and economists often need to fit models that have not been implemented in statistical packages; for example, bivariate response models where one variable is continuous and the other is discrete or ordinal-response models with endogenous regressors. The usual way to estimate the parameters for those models would be by writing customized programs. Fortunately, generalized structural equation models, implemented in the Stata gsem command, allow us to build many customized models without the need of programming. I will first introduce the different aspects of generalized structural equation models: family and link, latent variables, and random effects. Then I will demonstrate how to use these building blocks to perform customized estimations.
Survival analysis with Stata: Case studies of fertility of immigrant women (Español)
Event history analysis (and survival analysis) is being used increasingly in several areas of social science research, expanding its traditional use in demographics and health sciences. This is due, firstly, to its potential for analyzing the evolution of the occurrences of an event along time, and secondly, to the availability of statistical packages to perform these analyses. This presentation will address the tools available in Stata for survival analysis. First, I will present the data structure and the treatment of censored observations. Second, the development of the survival function graphically. And finally, the development of multivariate regression models with discrete time. All this is done to predict the probability of having the first child in Spain after the arrival for immigrant women in Spain, using the 2007 National Immigrant Survey (ENI) of INE.
Distributive conflicts and willingness to pay for the environment
Previous studies suggest that high-income individuals are more willing to pay for environmental taxes than low-income individuals, while the opposite would be true for redistributive taxes. In this presentation we argue that the positive relationship between willingness to pay for the environment and income is the result of a distributive conflict between redistributive taxes and environmental taxes. To test our hypotheses, we use data from ISSP (Environment III) for a sample of European countries and estimate a biprobit multilevel model with two dependent variables: willingness to pay for the environment and willingness to pay redistributive taxes. Our explanatory variables include income as well as other control variables. We use the gllamm library to estimate such a model, taking advantage of the possibility to simultaneously consider the correlation between errors in the two equations and within countries. Our results confirm that there is a distributive conflict between environmental and redistributive taxes in which rich (poor) individuals are more (less) willing to pay environmental taxes and less (more) willing to pay redistributive taxes.
Context-conditional effects of electoral cycles: The usefulness of margins and marginsplots
Models of political business cycles are still alive in political economy analysis. Though some consensus has emerged about how and why governments conduct elections in democratic societies, supportive evidence of political business cycles is mixed and sometimes inconsistent in the literature. However, a renewed approach has emerged considering the contextual effect of cycles. In a nutshell, an incumbent’s incentives to build up policy cycles might be conditioned by the latitude to do it as well as the electoral payoff of such a strategy. This presentation shows how using margins and marginsplot helps social researchers to explore and communicate about conditional effects in political science.
Three approaches to study co-occurrence: Principal components, correspondence, and network analysis (Español)
While much of the academic literature is based on the study of causation phenomena, it is also true that in many cases these causal analyses are based on the co-occurrences of phenomena, rather than on the causal mechanisms that connect them. This presentation addresses three different approaches to study the co-occurrence of data measured on a dichotomous scale or with two-value factors. First, it proposes a principal component analysis performed with tetrachoric correlations; secondly, a factorial correspondence analysis with non-exclusive categories; and finally, a network analysis based on Haberman standardized residuals. To exemplify these approaches, they are applied to three research studies: a text analysis, a study on a multiple response quiz, and an analysis on photographs from family albums.