>> Home >> Resources & support >> Users Group meetings >> 2015 German Stata Users Group meeting >> Abstracts


Statistical learning with boosting

Matthias Schonlau
University of Waterloo, Canada

Additional materials:

Estimating survival-time treatment effects and endogenous treatment effects using Stata

David Drukker
StataCorp LP
After reviewing the potential-outcome approach to estimating treatment effects from observational data, this talk discusses new estimators in Stata 14 for estimating average treatment effects from survival-time data and estimators for average treatments from endogenous-treatment designs. The talk also covers new research on estimating quantile treatment effects.

Additional materials:

Multiprocess modeling with Stata

Tamás Bartus
Corvinus University of Budapest
Multiprocess hazard models consist of multilevel hazard and discrete choice equations with correlated random effects and are routinely used by demographers to correct estimates for endogeneity and sample selection. Although no official Stata command is devoted to estimating systems of hazard equations, the official gsem command and the user-written cmp command offer the opportunity to estimate models of this sort (Roodman 2011; Bartus and Roodman 2014). The presentation addresses (1) the joint estimation of multilevel discrete-time survival and discrete-choice equations with the gsem and the cmp commands; (2) the estimation of (either multilevel or single-level) systems of lognormal survival and discrete choice equations with the cmp command; and (3) the preparation of multi-spell survival datasets for the purpose of estimation. Multiprocess survival modeling is illustrated using standard examples from demographic research.


Roodman, D. 2011. Fitting fully observed recursive mixed-process models with cmp. Stata Journal 11: 159–206.
Bartus, T., D. Roodman 2014. Estimation of multiprocess survival models with cmp. Stata Journal 14: 756–777.

Additional materials:

A Stata ado for categorical data analysis with latent variables

Hans-Jürgen Andreß
University of Cologne
Maximilian Hörl
University of Cologne
Alexander Schmidt-Catran
University of Cologne
Path models are used widely in the social sciences to illustrate statistical models used in applied research. They describe the assumed relationships and dependencies between the variables of interest and are easy to comprehend even for statistical laypersons. Up to now, they have mostly been applied to quantitative data. But the main ideas are easily transferred to the analysis of categorical data. In doing so, they present a unified approach on different statistical methods for categorical data analysis. The catsem ado attempts to access all of these different methods, which are scattered over a whole range of Stata commands, with an easy-to-understand and intuitive command language that basically describes path diagrams. Moreover, it adds functionality that at present is not yet included in Stata: the possibility to include categorical latent variables (Andreß 1997) and the possibility to analyze fairly general functions of the responses as described by Grizzle et al. (1969).


Andreß, H.-J., J. A. Hagenaars, S.Kühnel. 1997. Analyse von Tabellen und kategorialen Daten: Log-lineare Modelle, latente Klassenanalyse, logistische Regression und GSK-Ansatz. Berlin: Springer-Lehrbuch.
Grizzle, J., C. Starmer, and G. Koch. 1969. Analysis of categorical data by linear models. Biometrics, 25:489—504.

Additional materials:

simarwilson: DEA-based two-step efficiency analysis

Harald Tauchmann
Friedrich-Alexander-Universität Erlangen-Nürnberg
Measuring efficiency of production units (DMU) has developed into an industry in applied econometrics. Unlike parametric approaches, nonparametric techniques—namely DEA—yield individual efficiency scores for DMUs but do not directly answer the question of what determines efficiency differentials between them. One obvious way to circumvent this limitation is to conduct a two-stage analysis where DEA scores obtained on the first stage, serve as lefthand-side variables in regression on the second stage that links efficiency to exogenous factors. Such a two-step approach, however, encounters severe problems: (i) DEA efficiency scores are bounded—depending on how efficiency is defined—from above or from below at the value of one; and (ii) DEA generates a complex and generally unknown correlation pattern among estimated efficiency scores, resulting in invalid inference in the subsequent regression analysis. To address these problems, Simar and Wilson (2007) suggest a simulation-based, multistep iterative procedure that follows DEA and is based on (i) truncated regressions, (ii) simulating the unknown error correlation, and (iii) calculating bootstrapped standard errors. We introduce the new Stata command simarwilson which implements this procedure. It complements the user-written command dea (Ji and Lee 2010), which has to precede simarwilson in applied work.


Simar, L., P. W. Wilson. 2007. Estimation and inference in two-stage, semiparametric models of production processes. Journal of Econometrics 136: 31–64.
Ji, Y.-B., C. Lee. 2010. Data envelopment analysis. Stata Journal 10: 267–280.

Additional materials:

A simple procedure to correct for measurement errors in survey research

Anna de Castellarnau
Research and Expertise Centre for Survey Methodology, University Pompeu Fabra

Although there is wide literature on the existence of measurement errors, few researchers are correcting them in their analyses. In this presentation, we will show that correction for measurement errors in survey research is not only necessary but also possible and actually rather simple. Using the quality estimates obtained from the free online software Survey Quality Predictor (SQP), correlation and covariance matrices can easily be corrected and used as input for analyses. This procedure was described for Stata, LISREL, and R in the ESS EduNet module "A simple procedure to correct for measurement errors in survey research". This presentation will focus on the correction of measurement errors in regression analysis and causal models using Stata.

Additional materials:

Time-series analysis using ARFIMA

Frank Ebert
Ebert Beratung und Innovationen GmbH

Since version 12, Stata has offered the analysis of ARFIMA models. How can it be applied, and what should be considered when using it? Weather data are reported to show a "long" memory. This can be checked by estimating the fractional integration parameter d of an autoregressive fractionally (or fractal) integrated moving average (ARFIMA) process. Further relevant data are high-frequency stock market quotations and energy prices. Weather data (in particular wind time series) seem to show a complementary behavior to energy prices. A further aspect is the characterization of time series by its fractional integration parameter d. Can it be used to compress large amounts of time-series data? More technical questions are the following: What should be considered working with data that are influenced by fractal (nonwhite) noise and what could be done to overcome performance problems?

Additional materials:

PSIDTOOLS: An interface to the the Panel Study of Income Dynamics

Ulrich Kohler
University of Potsdam

The presentation discusses a collection of user-written programs designed to make analyses of the Panel Study of Income Dynamics (PSID) easier. The PSID is the longest-running longitudinal household survey in the world. Beginning in 1968, the PSID collected yearly information from over 18,000 individuals living in 5,000 households. The PSID offers data to study a broad range of topics, including employment, income, wealth, expenditures, health, and numerous others. However, as in many other Panel studies, the hurdles for using the data are relatively high. One reason is that the main corpus of the PSID data is being delivered to the end user in sets of yearly ASCII text files, forcing the user to first retrieve a dataset streamlined to the research topic. The PSID tools make these initial steps of PSID data analysis very easy. Particularly, the programs automatically create Stata datasets from ASCII text files, load and merge items from several PSID waves, ease wide-long conversions (while keeping labeling information), and automatically add value-label information from the PSID homepage to the dataset in memory.

Additional materials:

Extensions to the label commands

Daniel Klein
University of Kassel
Stata has commands to change variable names, as well as their contents, using expressions, a variety of functions, or simple transformation rules. Name abbreviations, wildcard characters, time-series operators, and factor-variable notation further facilitate working with variables. Managing value and variable labels, on the other hand, is not as convenient. Despite a large number of existing user-written commands for this purpose, there is still room for improvement. In this presentation, I introduce a new package, elab, that aims at transferring concepts for manipulating variables to value and variable labels. The package enhances the capabilities of official Stata's label suit and introduces additional tools similar to existing Stata commands for managing variables. Features of elab include support for value-label name abbreviations and wildcard characters and for restricting requests to subsets of integer-to-text mappings. The package offers commands to systematically change integer values and text in value labels using arithmetic expressions or string functions. It further provides programming utilities, making it easy to implement these features in do- and ado-files.

Additional materials:

A new Stata command for computing and graphing percentile shares

Ben Jann
University of Bern
Percentile shares provide an intuitive and easy-to-understand way for analyzing income or wealth distributions. A celebrated example are the top income shares sported by the works of Thomas Piketty and colleagues. Moreover, series of percentile shares, defined as differences between Lorenz ordinates, can be used to visualize whole distributions or changes in distributions. In this talk, I present a new command called pshare that computes and graphs percentile shares (or changes in percentile shares) from individual level data. The command also provides confidence intervals and supports survey estimation.

Additional materials:

Report to users / Wishes and grumbles

Bill Rising

Bill Rising, director of Educational Services at StataCorp LP, will be happy to receive wishes for developments in Stata and almost as happy to receive grumbles about the software.





The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ YouTube
© Copyright 1996–2016 StataCorp LP   •   Terms of use   •   Privacy   •   Contact us