Home  /  Users Group meetings  /  2016 Switzerland

The 2016 Swiss Stata Users Group meeting was November 17, but you can still interact with the user community even after the meeting and learn more about the presentations shared.


Data consolidation and cleaning using fuzzy string comparisons with the matchit command
Abstract: matchit is a user-written command allowing one to combine two datasets based on similar but not necessarily equal text strings and to compare the text similarity between two string variables from the same dataset. These features make matchit a handy and powerful tool in the preparation of data for statistical and econometric analysis as well as in the creation of metrics based on text similarity.

A nonexhaustive list of typical uses for matchit includes duplicate record consolidation within a nonstandardized dataset (for example, cleaning a list of patient names including multiple spellings), combination of two datasets with non-standardized keys (for example, merging hospital and insurance data based on treatment names), or creating quantitative measures based on string similarity (for example, comparing the scientific proximity between two medical schools based on their scientific publications and patents).

matchit can perform a wide range of string similarity algorithms—such as ngram, token, soundex, nysiis, or hybrid ones—that, combined with different weighting and scoring functions, allow users to perfect the resulting dataset. Moreover, it also allows for coding custom algorithms and functions benefiting from indexation and other built-in functionalities.
Additional information
Julio Raffo
WIPO Economics and Statistics Division
"Match of the day": Finding least proximal measurements to a given date with fmatch
Abstract: Researchers working with observational data are often faced with the problem of finding the nearest measurement around a particular date. For instance, a medical researcher may be interested in finding the least proximal CD4+ Tcell count measurement prior to initiation of an antiretroviral therapy against HIV, which is known to be predictive for treatment success. While this is not an overly difficult programming task, it takes several lines of potentially error‐prone code for implementation.

The fmatch command offers a versatile tool that can achieve such tasks with a single line of code. fmatch is a "wrapper" program for the well‐known mmerge command by J. Weesie. It offers multiple options for controlling merging options via specification of date ranges to define eligible measurements as well as for finding smallest or largest values among all eligible measurements. The functionality of fmatch will be illustrated with examples from HIV research.
Additional information
Viktor von Wyl
Epidemiology, Biostatistics, and Prevention Institute, University of Zurich
Handling missing data in Stata: Imputation and likelihood-based approaches
Abstract: Missing values are common in many fields. If analyses do not properly account for missing values, the resulting estimates may be biased. Stata gives the user access to multiple principled methods of handling missing values in a dataset. This talk will focus on two methods, multiple imputation (MI) and full information maximum likelihood (FIML). After an introduction to important concepts in the analysis of missing data, this presentation will provide an overview of how to perform analyses using MI and FIML in Stata. A comparison of the techniques and their advantages and disadvantages will be included.
Additional information
Rose Medeiros
StataCorp LP
Reproducible research with Stata
Abstract: MarkDoc is a general-purpose literate programming package for Stata that can serve a variety of purposes such as creating dynamic documents, dynamic presentation slides, Stata package help files, and Stata package documentation in various formats. The presentation introduces the package and its overall workflow as well as the recent improvements in the package. Moreover, the applications of the package for data analysis, teaching statistics, and documenting new Stata packages are discussed.
Additional information
E.F. Haghish
Université de Fribourg
Creating LaTeX and HTML documents from within Stata using textdoc and webdoc
Abstract: At the 2009 meeting in Bonn, I presented a new Stata command called texdoc. The command allowed weaving Stata code into a LaTeX document, but its functionality and its usefulness for larger projects was limited. In the meantime, I heavily revised the texdoc command to simplify the workflow and improve support for complex documents. The command is now well suited, for example, to generate automatic documentation of data analyses or even to write an entire book. In this talk, I will present the new features of texdoc and provide examples of their application. Furthermore, I will present a newly released companion command called webdoc that can be used to produce HTML or Markdown documents.
Ben Jann
Institute of Sociology, University of Bern
The assessment of fit in the class of logistic regression models: A pathway out of the jungle of pseudo-R2s using Stata
Abstract: Since the early nineties, logistic regression for binary, ordinal, and nominal dependent variables has become widely spread in the social sciences. Nevertheless, there is no consensus on how to assess the fit of these models corresponding to practical significance. A lot of pseudocoefficients of determination have been proposed but seldom used in applied research. Most of these pseudo-R2 follow the principle of the proportional reduction of error comparing the likelihood, the log-likelihood, or the precision of prediction with those of a baseline model including the constant only.

Alternatively, McKelvey and Zavoina (1975) have proposed a different one estimating the proportion of explained variance of the underlying latent dependent variable. Summarizing the Monte Carlo studies of Hagle and Mitchell (1992), Veall and Zimmermann (1992, 1994) and Windmeijer (1995) show that the McKelvey and Zavoina pseudo-R2 is the best one to evaluate the fit of binary and ordinal logit or probit models. Applying the assumption of identical independent distributed errors, I also propose a generalization of the McKelvey and Zavoina pseudo-R2 to the multinomial logistic regression, assessing the fit of each binary comparison simultaneously. The usefulness of this concept is demonstrated by applied data analysis of an election study with Stata using the self-developed mzr2 command.
Additional information
Wolfgang Langer
Institute of Sociology, University of Halle-Wittenberg
Counterfactual distributions: Estimation and inference in Stata
Abstract: Counterfactual distributions are important ingredients for policy and decomposition analysis. For example, we might be interested in what the outcome distribution for the treated units would be had they not received the treatment or in what the distribution of wages for female workers would be in the absence of gender discrimination in the labor market (that is, if female workers are paid the same as male workers with the same characteristics) or in what the distribution of housing prices would be if we clean up a local hazardous waste site. More generally, we can think of a policy intervention either as a change in the distribution of a set of explanatory variables X that determine the outcome variable of interest Y or as a change in the conditional distribution of Y given X. The Stata commands counterfactual and cdeco implement estimation and inference procedures for these two types of applications. The estimation of the conditional distribution can be based on the main regression methods, including classical, quantile, duration, and distribution regressions. The commands provide not only pointwise but also functional confidence bands, which cover the entire functions with prespecified probability and can be used to test functional hypotheses such as no effect, positive effect, or stochastic dominance.
Additional information
Blaise Melly
Department of Economics, University of Bern
Distribution regression made easy
Abstract: Incorporating covariates in (income or wage) distribution analysis typically involves estimating conditional distribution models, that is, models for the cumulative distribution of the outcome of interest conditionally on the value of a set of covariates. A simple strategy is to estimate a series of binary outcome regression models for F(z|xi)=Pr(yi≤z|xi)F(z|xi)=Pr(yi≤z|xi) for a grid of values for z (Peracchi and Foresi, 1995, Journal of the American Statistical Association; Chernozhukov et al., 2013, Econometrica). This approach, now often referred to as "distribution regression", is attractive and easy to implement. This talk illustrates how the Stata commands margins and suest can be useful for inference here and suggests various tips and tricks to speed up the process and solve potential computational issues. It also shows how to use conditional distribution model estimates to analyze various aspects of unconditional distributions.
Additional information
Philippe van Kerm
Luxembourg Institute of Socio-Economic Research
Effective plots to assess bias and precision in method comparison studies
Abstract: Bland and Altman’s limits of agreement (LoA) have traditionally been used in clinical research to assess the agreement between different methods of measurement for quantitative variables. However, when the variances of the measurement errors of the two methods are different, Bland and Altman’s plot may be misleading; there are settings where the regression line shows an upward or a downward trend but there is no bias or a zero slope and there is a bias.

Therefore, the goal of this presentation is to clearly illustrate why and when a bias arises, particularly when heteroskedastic measurement errors are expected, and to propose new plots to help the investigator visually and clinically appraise the performance of the new method. These plots do not have the above-mentioned defect and are still easy to interpret, in the spirit of Bland and Altman’s LoA.

To achieve this goal, we rely on the modeling framework recently developed by Nawarathna and Choudhary, which allows the measurement errors to be heteroskedastic and depend on the underlying latent trait. Their estimation procedure, however, is complex and rather daunting to implement. Therefore, we have developed a new estimation procedure that is much simpler to implement and yet performs very well, as illustrated by our simulations.

The methodology requires several measurements with the reference standard and possibly only one with the new method for each individual.
Additional information
Patrick Taffé
Institute of Social and Preventive Medicine, University of Lausanne
New figure schemes for Stata: plotplain and plotting
Abstract: While Stata’s computational capabilities have intensively increased over the last decade, the quality of its default figure schemes is still a matter of debate among users. Clearly, some of the arguments speaking against Stata figures are subject to individual taste, but others are not, such as horizontal labeling, unnecessary background tinting, missing gridlines, and oversized markers. The two schemes introduced here attempt to solve the major shortcomings of Stata’s default figure schemes. Furthermore, the schemes come with 21 new colors, of which 7 colors are distinguishable for people suffering from color blindness.
Additional information
Daniel Bischof
Department of Political Science, University of Zurich
Easy multipanel plotting with grcomb
Abstract: grcomb is a user-written wrapper for Stata's graph combine. It makes quick-and-dirty multipanel plotting easy.
Additional information
Alex Gamma
Psychiatric University Hospital, Zurich


Scientific committee

Ben Jann
Institute of Sociology, University of Bern

Radoslaw Panczak
Institute of Social and Preventive Medicine, University of Bern

Marcel Zwahlen
Institute of Social and Preventive Medicine, University of Bern

Logistics organizer

The logistics organizer for the 2016 Swiss Stata Users Group meeting is Ritme, scientific solutions, the distributor of Stata in Switzerland, France, and Belgium.

View the proceedings of previous Stata Users Group meetings.