»  Home »  Stata Conferences and Users Group meetings »  Stata Conference Baltimore 2017
Stata Conference
Baltimore 2017

July 27–28

Share information.
Exchange ideas.


Network with fellow Stata users and experts at the 2017 Stata Conference in Baltimore on July 27–28.

Stata's user community is inventive and unique, developing their own commands and applying Stata in new ways to real-world situations. At the Stata Conference, you will learn new techniques from Stata developers and users. Discuss, develop, and dine with us in "Charm City".

Program: Thursday, July 27

8:00–8:50 Registration and breakfast
8:50–9:00 Welcome and introduction
Working with demographic life table data in Stata
Abstract: This presentation introduces two user-written Stata commands related to the data and calculations of demographic life tables, whose most prominent feature is the calculation of life expectancy at birth. The first command, hmddata, provides a convenient interface to the Human Mortality Database (HMD, www.mortality.org), a database widely used for mortality data by demographers, health researchers, and social scientists. Different subcommands of hmddata allow data from this database to be easily loaded, transformed, reshaped, tabulated, and graphed. The second command, lifetable, produces demographic period life tables. The main features are that life table columns can be flexibly calculated using any valid minimum starting information; abridged tables can be generated from complete ones; finally, a Stata dataset can hold any number of life tables, and the various lifetable subcommands can operate on any subset of them.
Daniel C. Schneider
Max Planck Institute for Demographic Research
Use of Stata for psychometrics: Reflections from a novice user from a low-resource setting
Abstract: Background: There has been an exponential increase in the development and transcultural validation of patient reported outcomes (PROs). Consequently, the multidimensional scale of the perceived social support scale (MSPSS) has been extensively translated and validated. Social support is envisaged as an essential buffer to stressful lifetime events. However, the psychometrics of translated PROs are much dependent on the quality of the translation and psychometric evaluation processes Methodology and results: The MSPSS was translated into Shona (A Zimbabwean native language) using the backward-forward translation method and administered to 1125 informal caregivers. Factorial validity was assessed using both EFA and CFA techniques. The original three-factor structure was replicated using Stata 14 Reflections: Stata is invaluable, first, because it can process both EFA and CFA, which is cost effective. Secondly, its less demand for CPU operational memory as compared with other packages such as SPSS makes it the software of choice. Third, the output is easy to interpret, which is important for novice researchers in low-resource settings where access to bio-statistician services is scarce and expensive. However, there is need for tutorial videos on media platforms such as YouTube and provision of a simplified user guide to cater for novice researchers. Conclusion: Stata is an indispensable psychometric evaluation tool with potential for domination in the low-resource setting market given its versatility. Making support resources more available for novice researchers can possibly increase the utility of the Stata software.
Jermaine Dambi
University of Cape Town and University of Zimbabwe
Application of the MIMIC model to detect and predict differential item functioning
Abstract: There has been extensive research indicating gender-based differences among STEM subjects, particularly mathematics (Albano and Rodriguez, 2013; Lane, Wang, and Magone 1996). Similarly, gender-based differential item functioning (DIF) has been researched because of the disadvantages females face in STEM subjects when compared with their male counterparts. Given that, this study will apply the multiple indicators multiple causes (MIMIC) model, a type of structural equation model, to detect the presence of gender-based DIF using the Program for International Student Assessment (PISA) mathematics data from students in the United States of America and then predict the DIF using math-related covariates. This study will build upon a previous study that explored the same data using the hierarchical generalized linear model and will be confirmatory in nature. Based on the results of the previous study, it is expected that several items will exhibit DIF that disadvantages females and that mathematics-based self-efficacy will predict the DIF. However, additional covariates will also be explored, and the two models will be compared in terms of their DIF detection and the subsequent modeling of DIF. Implications of these results include females underachieving when compared with their male counterparts, thus continuing the current trend. These gender differences can further manifest at the national level, causing U.S. students as a whole to underperform at the international level. Last, the efficacy of the MIMIC model to detect and predict DIF will be illustrated and become increasingly used to model and better understand differences and DIF.
Kevin Krost
Virginia Tech
Joshua Cohen
Virginia Tech
Extended-value logic
Abstract: In 2001, I gave a presentation on three-valued logic. Since then, I have developed some ideas that grew out of that investigation, leading to new insights about missing values and to the development of five-valued logic. I will also show how these notions extend to numeric computation and to an abstract generalization of the principles involved. This is not about analysis; this is about data construction and preparation, and it is a possibly interesting conceptual tool.
David Kantor
Data for Decisions
10:20–10:40 Break
Uncomplicated Parallel Computing with Stata
Abstract: Parallel lets you run Stata faster, sometimes faster than MP itself. By organizing your job in several Stata instances, parallel allows you to work with out-of-the-box parallel computing. Using the 'parallel' prefix, you can get faster simulations, bootstrapping, reshaping big data, etc., without having to know a thing about parallel computing. With no need of having Stata/MP installed on your computer, parallel has showed to dramatically speed up computations up to two, four, or more times depending on how many processors your computer has.
Brian Quistorff
George G. Vega Yon
University of Southern California
Stata extensibility with the Java API: Tools, examples, and advice
Abstract: The inclusion of the Java API for Stata provides users, and user programmers, with exciting opportunities to leverage a wide array of existing work in the context of their Stata workflow. This talk will introduce a few tools designed to help others wanting to integrate Java libraries into their workflow, the Stata Maven Archetype, and the StataJavaUtilities library. In addition to a higher-level overview, the presentation will also show examples of using existing Java libraries to expand statistical models in psychometrics and send yourself emails when your job is complete, of phonetic string encodings and string distances, of accessing file/operating system properties, and examples to use as starting points for developing Java plugins in Stata.
Billy Buchanan
Fayette County Public Schools
Big data in Stata with the ftools package
Abstract: In recent years, very large datasets have become increasingly prevalent in most social sciences. However, some of the most important Stata commands (collapse, egen, merge, sort, etc.) rely on algorithms that are not well suited for big data. In my talk, I will present the ftools package, which contains plugin alternatives to these commands and performs up to 20 times faster on large datasets [1]. Further, I will explain the underlying algorithm and Mata function and show how to use this function to create new Stata commands and to speed up existing packages. [1]: See benchmarks here: https://github.com/sergiocorreia/ftools/#benchmarks
Sergio Correia
Board of Governors of the Federal Reserve System
12:00–1:00 Lunch
On the shoulders of giants, or not reinventing the wheel
Abstract: Part of the art of coding is writing as little as possible to do as much as possible. The presentation expands on this truism. Examples are given of Stata code to yield graphs and tables in which most of the real work is happily delegated to workhorse commands. In graphics, a key principle is that graph twoway is the most general command, even when you do not want rectangular axes. Variations on scatter- and line plots are precisely that, variations on scatter- and line plots. More challenging illustrations include commands for circular and triangular graphics, in which x and y axes are omitted, with an inevitable but manageable cost in re-creating scaffolding, titles, labels, and other elements. In tabulations and listings, the better-known commands sometimes seem to fall short of what you want. However, a few preparation commands (such as generate, egen, collapse, or contract) followed by list, tabdisp, or _tab can get you a long way. The examples range in scope from a few lines of interactive code to fully developed programs. The presentation is thus pitched at all levels of Stata users.
Nicholas Cox
Durham University, United Kingdom
Incorporating Stata into reproducible documents
Abstract: Part of reproducible research is eliminating manual steps such as hand-editing documents. Stata 15 introduces several commands which facilitate automated document production, including dyndoc for converting dynamic Markdown documents to web pages, putdocx for creating Word documents, and putpdf for creating PDF files. These commands allow you to mix formatted text and Stata output, and allow you to embed Stata graphs, in-line Stata results, and tables containing the output from selected Stata commands. We will show these commands in action, demonstrating automating the production of documents in various formats, and including Stata results in those documents.
Hua Peng
2:20–2:40 Break
Propensity scores and causal inference using machine learning methods
Abstract: We compare a variety of methods for predicting the probability of a binary treatment (the propensity score), with the goal of comparing otherwise like cases in treatment and control conditions for causal inference about treatment effects. Better prediction methods can under some circumstances improve causal inference by reducing both the finite sample bias and variability of estimators, but sometimes, better predictions of the probability of treatment can increase bias and variance, and we clarify the conditions under which different methods produce better or worse inference (in terms of mean squared error of causal impact estimates).
Austin Nichols
Abt Associates
Linden McBride
Cornell University
Now you see me: High school dropout and machine learning
Abstract: In this paper, we create an algorithm to predict which students are eventually going to drop out of U.S. high school using information available in ninth grade. We show that using a naive model—as implemented in many schools—leads to poor predictions. In addition to this, we explain how schools can obtain more precise predictions by exploiting the big data available to them, as well as more sophisticated quantitative techniques. We also compare the performances of econometric techniques like logistic regression with machine learning tools such as support vector machine, boosting and LASSO. We offer practical advice on how to apply machine learning methods using Stata to the high-dimensional datasets available in education. Model parameters are calibrated by taking into account policy goals and budget constraints.
Dario Sansone
Georgetown University
3:40–4:00 Break
Small area estimation/Poverty Map in Stata
Abstract: We present a new Stata package for small-area estimations of poverty and inequality implementing methodologies from Elbers, Lanjouw, and Lanjouw (2003). Small-area methods attempt to solve low representativeness of surveys within areas or the lack of data for specific areas and subpopulations. This is accomplished by incorporating information from outside sources. A common outside source is census data, which often lack detailed information on welfare. Thus far, a major limitation toward such analysis in Stata has been the memory required to work with census data . The povmap package introduces new Mata functions and a plugin used to circumvent memory limitations that will arise when working with big data.
Minh Nguyen
World Bank
Paul Andres Corral Rodas; Joao Pedro Wagner De Azevedo; Qinghua Zhao
World Bank
Interactive maps
Abstract: We present examples of how to construct interactive maps in Stata, using only built-in commands available even in secure environments. One can also use built-in commands to smooth geographic data as a pre-processing step. Smoothing can be done using methods from twoway contour, or predictions from a GMM model as described in Drukker , Prucha, and Raciborski (2013). The basic approach to creating a map in Stata is twoway area, with the options nodropbase cmiss(no) yscale(off) xscale(off), with a polygon “shape file” dataset (often created by the user-written shp2dta by Kevin Crow, possible with a change of projection using programs by Robert Picard) and multiple calls to area with if qualifiers to build a choropleth or scatter to superimpose point data. This approach is automated by several user-written commands and works well for static images but is less effective for web content where a Javascript entity is desirable. However, it is straightforward to write out the requisite information using the file command and to use open-source map tools to create interactive maps for the web. We present two useful examples.
Ali Lauer
Abt Associates
Analyzing satellite data in Stata
Abstract: We provide examples of how one can use satellite or other remote sensing data in Stata, with a variety of analysis methods, including examples of measuring economic disadvantage using satellite imagery.
Hiren Nisar
Abt Associates

Program: Friday, July 28

8:30–9:00 Registration and breakfast
Computing occupational segregation indices with standard errors: An ado-file application with an illustration for Colombia
Abstract: We developed an ado-file to easily estimate three selected occupational segregation indicators with standard errors using a bootstrap procedure. The indicators are the Duncan and Duncan (1955) dissimilarity index, the Gini coefficient based on the distribution of jobs by gender (see Deutsch et al. [1994]) and the Karmel and MacLachlan (1988) index of labor market segregation. This routine can be easily applied to conventional labor market microdata in which information regarding the occupation classification, industry, and occupational category variables is usually available. As an illustration of the application of this ado-file, we present estimates of both occupational and industry segregation by gender drawn from household surveys' Colombian microdata. The estimation of occupational segregation measures with standard errors proves to be useful in assessing statistical differences in segregation measures within labor market groups and over time.
Jairo G Isaza Castro
Universidad de la Salle
Karen Hernandez; Karen Guerrero; Jessy Hemer
Universidad de la Salle
cvcrand and cptest: Efficient design and analysis of cluster randomized trials
Abstract: Cluster randomized trials (CRTs), where clusters (for example, schools or clinics) are randomized but measurements are taken on individuals, are commonly used to evaluate interventions in public health and social science. Because CRTs typically involve only a few clusters, simple randomization frequently leads to baseline imbalance of cluster characteristics across treatment arms, threatening the internal validity of the trial. In CRTs with a small number of clusters, classic approaches to balancing baseline characteristics—such as matching and stratification—have several drawbacks, especially when the number of baseline characteristics the researcher desires to balance is large (Ivers et al. 2012). An alternative approach is constrained randomization, whereby an allocation scheme is randomly selected from a subset of all possible allocation schemes based on the value of a balancing criterion (Raab and Butcher 2001). Subsequently, an adjusted permutation test can be used in the analysis, which provides increased efficiency under constrained randomization compared with simple randomization (Li et al. 2015). We describe constrained randomization and permutation tests for the design and analysis of CRTs and provide examples to demonstrate the use of our newly created Stata package (cvcrand), which uses Mata to efficiently process large allocation matrices—to implement constrained randomization and permutation tests.
John Gallis
Duke University
Fan Li; Hengshi Yu; Elizabeth L. Turner
Duke University
Using theory to define a computationally tractable specification space in confirmatory factor models
Abstract: Researchers constructing measurement models must decide how to proceed when an initial specification fits poorly. Common approaches include search algorithms that optimize fit and piecemeal changes to the item list or the error specification. The former approach may yield a good-fitting model that is inconsistent with theory or may fail to identify the best-fitting model because of local optimization issues. The latter suffers from poor reproducibility and may also fail to identify the optimal model. We outline a new approach that defines a computationally tractable specification space based on theory. We use the example of a hypothesized latent variable with 25 candidate indicators divided across 5 content areas. Using Stata’s tuples command, we identify all combinations of indicators containing >=1 indicator per content area. In our example, this yields 7,294 models. We estimate each model on a derivation dataset and select candidate models with fit statistics that are acceptable or could be rendered acceptable by permitting correlated errors. Eight models fit these criteria. We evaluate modification indices, respecify if there is theoretical justification for correlated errors, and select a final model based on fit statistics. In contrast to other methods, this approach is easily replicable and may result in a model that is consistent with theory and has acceptable fit.
Geoff Dougherty
Johns Hopkins Bloomberg School of Public Health
Dr. Lorraine Dean
Johns Hopkins Bloomberg School of Public Health
A theory-based method for linking individual-level data from multiple data sources
Abstract: We have developed, and implemented in Stata, a theory-based method for subject-level linking of data from multiple sources, such as data from electronic medical records of different healthcare provider organizations. This method is intended to support linking data for research purposes. Numerous linkage approaches exist and are generally classified as either “deterministic” or “probabilistic.” However, even probabilistic approaches do not provide end users with certain needed information, namely, theoretically justified estimates of false linkage (“false-positive”) levels and of false nonlinkage (“false-negative”) levels. By contrast, our theory-based method uses probability theory to generate linkage solutions, as well as theory-based estimates of false-positive and false-negative levels for those solutions. Actually, this method produces a sequence of linkage solutions from which a researcher can choose, with the false-positive and false-negative rates estimated for each of those solutions. The method functions in two phases, both very computationally intensive. The first phase generates the sequence of linkage solutions, including a “false-positive” estimate for each solution. The second phase estimates the “false-negative” level for each solution. As implemented, masked data elements are used as inputs to provide strong privacy protection.
Ronald Horswell
Pennington Biomedical Research Center, LSU System
10:20–10:40 Break
Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit-root test
Abstract: We present response surface coefficients for a large range of quantiles of the Elliott, Rothenberg, and Stock (Econometrica 1996) DF-GLS unit-root tests for different combinations of the number of observations and the lag order in the test regressions, where the latter can be either specified by the user or endogenously determined. The critical values depend on the method used to select the number of lags. The Stata command ersur is presented, and its use illustrated with an empirical example that tests the validity of the expectations hypothesis of the term structure of interest rates.
Christopher Baum
Boston College and DIW Berlin
Jesús Otero
Universidad del Rosario, Colombia
lqmm: A Stata command for estimating linear quantile mixed models
Abstract: Linear (normal) mixed-effects models are highly popular, and flexible regression models are used to analyze the conditional mean of clustered outcome variables. Normal mixed models assume that, conditional on the random effects, the outcome distribution is affected by the predictors only through its location parameter. In the presence of complex effects of the predictors on the scale or shape of the conditional distribution, alternative approaches may be warranted. Quantile regression is a statistical tool that extends regression for the mean to the analysis of the entire distribution of the outcome variable. Quantile regression with clustered data is a very active area of research in statistics. In this talk, we present the Stata implementation (based on Mata) of the homonymous R package (Geraci 2014, Journal of Statistical Software 57: 1–29) for fitting quantile regression models with random effects. Some examples will be provided.
Akhtar Hossain
University of South Carolina
Marco Geraci
University of South Carolina
crreg: A new command for generalized continuation ratio models
Abstract: A continuation ratio model represents a variant of an ordered regression model that is suited to modeling processes that unfold in stages, such as educational attainment. The parameters for covariates in continuation ratio models may be constrained to be equal, subject to a proportionality constraint across stages, or freely vary across stages. Currently, there are three user-written Stata commands that fit continuation ratio models. Each of these commands fits some subset of continuation ratio models involving parameter constraints, but none of them offer complete coverage of the range of possibilities. In addition, all the commands rely on reshaping the data into a stage-case format to facilitate estimation. The new crreg command expands the options for continuation ratio models to include the possibility for some or all of the covariates to be constrained to be equal, to freely vary, or to have a proportionality constraint across stages. The crreg command relies on Stata’s ML routines for estimation and avoids reshaping the data. The crreg command includes options for three different link functions (the logit, probit, and cloglog) and supports Stata’s survey and multiple imputation suites of commands.
Shawn Bauldry
Purdue University
Jun Xu
Ball State University
Andrew Fullerton
Oklahoma State University
12:00–1:00 Lunch
The multivariate dustbin
Abstract: When I was in graduate school, I was taught that multivariate methods were the future of data analysis. In that dark computer stone age, multivariate meant multivariate analysis of variance (MANOVA), linear discriminant function analysis (LDA), canonical correlation analysis (CA), and factor analysis (which will not be discussed in this presentation). Statistical software has evolved considerably since those ancient days. MANOVA, LDA, and CA are still around but have been eclipsed and pushed aside by newer, sexier methodologies. These three methods have been consigned to the multivariate dustbin, so to speak. This presentation will review MANOVA, LDA, and CA, discuss the connections among the three approaches, and highlight the positives and negatives of each approach.
Phil Ender
UCLA (Ret.)
Analyzing interval-censored survival-time data in Stata
Abstract: In survival analysis, right-censored data have been studied extensively and can be analyzed using Stata's extensive suite of survival commands, including streg for fitting parametric survival models. Right-censored data are a special case of interval-censored data. Interval-censoring occurs when the failure time of interest is not exactly observed but is only known to lie within some interval. Left-censoring, which occurs when the failure is known to happen some time before the observed time, is also a special case of interval-censoring. Survival data may contain a mixture of uncensored, right-censored, left-censored, and interval-censored observations. In this talk, I will describe basic types of interval-censored data and demonstrate how to fit parametric survival models to these data using Stata's new stintreg command. I will also discuss postestimation features available after this command.
Xiao Yang
2:20–2:40 Break
Stata implementation of alternative residual inclusion estimators for models with endogenous regressors
Abstract: Empirical analyses often require implementation of nonlinear models whose regressors include one or more endogenous variables – regressors that are correlated with the unobserved random component of the model. Failure to account for such correlation in estimation leads to bias and produces results that are not causally interpretable. Terza et al. (2008) discuss a relatively simple framework designed to explicitly account for such endogeneity – the residual inclusion (RI) framework. They also give the analytic details of a corresponding two-stage estimator that yields consistent parameter estimates in a wide variety of nonlinear regression contexts – two-stage residual inclusion (2SRI). The 2SRI estimates can be obtained using packaged Stata commands, but the corresponding asymptotically correct standard errors (ACSE) require some analytic derivation and Mata coding (for details see Terza [2016]). In the proposed presentation, we will discuss two alternative estimation approaches for RI models with a view toward broadening the menu of Stata implementation options for users who may not prefer to program in Mata or may be inclined to avoid analytic derivations: generalized method of moments (GMM) (StataCorp (2015)) and quasi-limited information maximum likelihood (QLIML) (Wooldridge (2014)). GMM can be applied using the packaged Stata gmm command, and although it requires that the user supply analytic formulae for the relevant moment conditions, it frees the user from the Mata coding required by 2SRI for calculation of the ACSE. QLIML is implemented via the Mata optimize command, so it does require some knowledge of Mata coding. On the other hand, it does not place any analytic demands on the user for calculation of the parameter estimates or their ACSE. We will detail all three of these approaches and, in the context of an empirical example, give template code for their Stata implementation (including calculation of the ACSE). We note that although the methods are essentially asymptotically equivalent, the methods yield different results in the context of our example. We offer analytic explanations for these differences. We also apply the methods to Monte Carlo simulated samples to further elucidate their implementation. Such simulations also serve to validate their large-sample properties and reveal aspects of finite sample performance. Because the methods are essentially asymptotically equivalent, we conclude that one’s choice of approach should depend solely on the user’s coding preferences and his or her proclivity for analytic derivation. We hope that this presentation will broaden Stata users’ access to this important class of models (the RI framework) for the specification and estimation of econometric models involving endogenous regressors.


Terza, J., A. Basu, and P. and Rathouz, P. 2008. Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. Journal of Health Economics 27: 531-543.

Terza, J.V. 2016. Simpler standard errors for two-stage optimization estimators. Stata Journal 16: 368-385.

Stata Release 14. Statistical software. College Station, TX: StataCorp LP.

Wooldridge, J.M. 2014. Quasi-Maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables. Journal of Econometrics 182: 226-234.

Joseph Terza
Indiana University–Purdue University Indianapolis
David M. Drukker
Estimating treatment effects in the presence of correlated binary outcomes and contemporaneous selection
Abstract: Estimating the causal effect of a treatment is challenging when selection into the treatment is based on contemporaneous unobservable characteristics, and the outcome of interest is represented by a series of correlated binary outcomes. Under these assumptions, traditional nonlinear panel-data models, such as the random-effects logistic model, will produce biased estimates of the treatment effect because of correlation between the treatment variable and model unobservables. In this presentation, I will introduce a new Stata estimation command, etxtlogit, that can estimate a model where the outcome is a series of J-correlated logistic binary outcomes and selection into the treatment is based on contemporaneous unobservable characteristics. The presentation will introduce the new estimation command, present Monte Carlo evidence, and offer empirical examples. Special cases of the model will be discussed, including applications based on the explanatory (behavioral) Rasch model, a model from item response theory (IRT).
Matthew P. Rabbitt
Economic Research Service, U.S. Department of Agriculture
Wishes and grumbles


Seats are limited. Choose one of the options below. Lunch and refreshments are included in the registration fee.

  Price Student price
Both days
$195 Buy
$75 Buy
Day 1: Thursday,
July 27, 2017
$125 Buy
$50 Buy
Day 2: Friday,
July 28, 2017
$125 Buy
$50 Buy
Dinner (optional)
July 27, 2017
 $45 Buy

The optional users dinner will be at Chiapparelli’s on Thursday,
July 27, at 6:30.

237 South High Street
Baltimore, MD 21202
Tel: 410-837-0309


The Renaissance Baltimore Harborplace Hotel is offering a special rate of $229 per night for Stata Conference attendees staying between July 26 and July 29, 2017. There is limited availability, so book your room soon to get the conference rate.

Reserve your room


Renaissance Baltimore Harborplace Hotel
202 East Pratt Street
Baltimore, MD 21202

The conference venue is near several tourist attractions, including the USS Constellation and other vessels in the harbor, the American Visionary Arts Museum, and the National Aquarium.

Scientific committee

Joe Canner (Chair)
Department of Surgery
Johns Hopkins University

John McGready
Department of Biostatistics
Johns Hopkins University

Austin Nichols
Abt Associates

Sharon Weinberg
Applied Statistics and Psychology
New York University






The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ YouTube
© Copyright 1996–2017 StataCorp LLC   •   Terms of use   •   Privacy   •   Contact us