Last updated: 20 August 2014

2014 Stata Conference Boston

31 July–1 August 2014


Omni Parker House
60 School Street
Boston, MA 02108


Do-it-yourself multiple imputation: Mode-effect correction in a public opinion survey

Stas Kolenikov
In this talk, I demonstrate how to build a multiple-imputation procedure from scratch. The motivating example comes from a public opinion survey in which the sampled respondents provided their responses on the web or by phone. As is known in survey methodology literature, presence of an interviewer on the phone produces higher reports of socially desirable behaviors, such as number of friends or political engagement, or lower reports of undesirable behaviors, such as illicit drug use. Treating these less accurate responses as partially missing data, I develop a non-standard multiple-imputation model that is driven by a concept of utility from choice and decision literature in economics. My implementation is aligned to supply the data to Stata's mi suite, in the sense that I create the imputations, and mi combines them using Rubin's rules. Additionally, the workflow of the mode-effect detection features multiple testing corrections. It requires extensive post operations and that the lists of variables be exchanged between the do-files of the project which I also demonstrate in this presentation.

Additional information

ctgov: A suite of Stata commands for reporting trial results to

Phil Schumm
Department of Health Studies, University of Chicago
In response to the 1997 Food and Drug Administration Modernization Act, the National Institutes of Health established, an online, publicly-accessible registry for clinical trials. The 2007 Food and Drug Administration Amendments Act broadened the scope of eligible trials, added outcomes reporting as a requirement, and established penalties for non-compliance. Although increased the transparency with which clinical trials are conducted in the U.S. and opened up new possibilities for research using the information collected, additional resources, time, and effort are required to comply with this mandate. This presentation will introduce ctgov, a suite of Stata commands to facilitate the reporting of trial results. By using this tool, researchers will be able to generate results for automatic upload to as they are doing their primary analyses, thereby eliminating much of the additional effort and ensuring that the results in match those in the official publication or report. Although primarily of interest to clinical researchers, biostatisticians, and pharmaceutical companies, the approach taken by ctgov also has connections to work being done in the area of reproducible research.

Additional information

Using Stata for educational accountability and compliance reporting

Billy Buchanan
Mississippi Department of Education
In 2013, the Mississippi State Legislature passed a law requiring the state to adopt a single combined statewide accountability system for schools and districts; the law also restricted the state from using some of the methods used in the accountability system of the time. Once the Mississippi Board of Education voted to adopt the proposed model, the next major task was to program all the business rules and requirements and calculations. This presentation will focus on how that led to the current accountability system. Using Stata, I could reduce much of the complexity of the previous accountability model when compared with other software. The current model uses 15 programs written in Stata to import data from an internal server, implement the rules specified in the business rules document, estimate the ratios required by the system, create graphs to illustrate school versus district versus state comparisons, and build school and district reports for public consumption. Using Stata's capabilities, we can generate reports by writing LaTeX source code and a Bash script used to compile and clean up the output from the LaTeX files. This saves considerable time.

Additional information

Profile analysis

Phil Ender
UCLA Statistical Consulting Group
This presentation will discuss profile analysis, a multivariate method for examining differences in the shapes of profiles across groups. Profile analysis uses of Stata’s manova command along with manovatest for estimation. This presentation will also demonstrate the user-written command profileplot to graphically display group profiles.

Additional information

Computer simulation of patient flow through an operating suite

David Clark
Maine Medical Center, Portland, Maine
Operating room (OR) inefficiency is costly and stressful for patients and staff. To evaluate possible improvements, we simulated our OR and recovery room (RR) processes with Stata. We used hospital data (in long format) and parametric time-to-event regression (streg) to derive loglogistic distributions for the duration of procedures, RR stays, and room turnaround. Variables were then reshaped into a single row (wide format) for the simulation program. Patient and room status for a 24-hour day were changed sequentially using a forvalues loop with 5-minute steps. Scheduled and historical times were first used deterministically to recreate anticipated and actual events and durations. Patient observations were then replicated (using expand) with different pseudorandom parameters in each row. Distributions of patient length of stay in OR and RR (and room turnaround times) thus approximated theoretical input distributions. Refinements included reassigning cases if the scheduled room was running late, changing staff availability, and incorporating unscheduled emergencies. Summary statistics were compiled (using egen) for each case and the system as a whole and were consistent with historical data. Stata has some advantages over specialized simulation programs, especially for current Stata users. We plan to build a user interface, make other improvements, and share our program through RePEc.

Additional information

Stata hybrids: Updates and ideas

James Fiedler
Universities Space Research Association
At last year’s Stata conference, I presented projects that facilitate the combined use of Stata and Python. One project provides the ability to use Python within Stata via a C plugin. The other project provides a custom Python class that can be used to open, modify, and save Stata datasets. In this talk, I will begin by describing some modifications and extensions to these projects. I will then present a few new ideas for useful combinations of Stata with other tools. Some of these ideas can be realized using the Python projects above, some using JavaScript and a web browser.

Additional information

Mata routines for solution of nonlinear systems using interval methods

Matthew Baker
Hunter College and the Graduate Center, CUNY
Solution of nonlinear systems has become increasingly important as a step in many estimation problems and is a problem of interest in its own right. I introduce a collection of Mata routines that can be used to find all solutions to nonlinear equation systems and demonstrate their usage on a sequence of test problems. While specifically tailored to solving polynomial systems, the method can be applied to any continuous system with continuous Jacobian. The methods rely on interval Newton methods, a technique that combines Taylor expansion, bisection, and interval programming. The routines come equipped with a heuristic solver that allows for approximate solution of problems that are especially time consuming or problems that do not require that all solutions be found. Support tools for the solver include functions for interval arithmetic and the manipulation of a series of matrices in parallel. I discuss an extended application of the solution tools to the problem of finding all equilibria of discrete action games, which in general requires solving polynomial systems.

Additional information

Making interactive online visualizations with stata2leaflet and stata2d3

Robert Grant
St. George’s Medical School, University of London
The last three years have seen explosive growth in the variety and sophistication of interactive online graphics. These are mostly implemented in the web language JavaScript, with the D3 (Data-Driven Documents) library being the most popular and flexible at present. Leaflet is a mapping library also being widely used. R users have some packages that translate their data and specifications into interactive graphics and maps; these packages write a text file containing the HTML and JavaScript instructions that make up a webpage containing the desired visualization. This translation into a webpage is easily achieved in Stata, and I will present the stata2leaflet command which produces zoomable, clickable online maps. Contemporary interactive graphs benefit from allowing the viewer to filter and select data of interest, which is a second layer of specification implemented in the stata2d3 commands. stata2d3 capitalizes on the consistency of Stata graph syntax by parsing and translating a standard Stata graph command into a webpage. Users can choose to include explanatory comments against each line in the source code; these are invisible to viewers but help them learn HTML and JavaScript and make further refinements.

Additional information

Classification using random forests in Stata and R

Linden McBride
Cornell University
Many estimation problems focus on classification of cases (into bins) with tools that aim to identify cases using only a small subset of all possible questions. These tools can be used in diagnoses of disease, identification of advanced or failing students using tests, or classification into poor and nonpoor for the targeting of a means-tested social program. Most popular estimation procedures for generating these tools prioritize minimization of in-sample prediction errors, but the objective in generating such tools is the minimization of out-of-sample prediction errors. We provide a comparison of linear discriminant, discrete choice, and random forest methods, with applications to means-tested social programs. Out-of-sample prediction error is typically minimized by random forest algorithms.

Additional information

A midas retouch regarding diagnostic meta-analysis

Ben Dwamena
University of Michigan
The talk describes recent updates for midas, a comprehensive and medically popular program for diagnostic test accuracy meta-analysis. A major change is that midas is now an estimation command and a wrapper for meglm in Stata 13 . The update allows more flexibility for specifying covariance structures, link functions other than logit, more extensive postestimation options and specification of starting values (especially with sparse data), and the possibility of estimating univariate (independent) versus bivariate (correlated) modeling of sensitivity and specificity.

Additional information

Nonstandard deviation: Making the global local

Marcello Pagano
Harvard School of Public Health
In October 2012, HarvardX, through edX, offered its first two online courses. One of these was PH207X: Health in Numbers. The course covered biostatistics and epidemiology at an introductory level and lasted 12 weeks. 60,000 students later, we have exposed more students to those disciplines than we could have over the next 250 years with typical brick and mortar teaching. To do this, we had to have a statistical package, and we chose Stata. This talk will cover some of what we learned from the experience.

Additional information

Transformation survival models

Yulia Marchenko
StataCorp LP
The Cox proportional hazards model is one of the most popular methods for analyzing survival or failure-time data. The key assumption underlying the Cox model is that of proportional hazards. This assumption may often be violated in practice. Transformation survival models extend the Cox regression methodology to allow for nonproportional hazards. They represent the class of semiparametric linear transformation models, which relates an unknown transformation of the survival time linearly to covariates. In my presentation, I will describe these models and demonstrate how to fit them in Stata.

Additional information

Generalized quantile regression in Stata

David Powell
Quantile regression techniques are useful in understanding the relationship between explanatory variables and the conditional distribution of the outcome variable, which allows the parameters of interest to vary based on a nonseparable disturbance term. Additional covariates may be necessary or simply desirable for identification, but including additional variables into a conditional quantile model separates the disturbance term, which alters the underlying structural model. To address this problem, Powell (2013) introduces the Generalized Quantile Regression (GQR) estimator, which provides the impact of the treatment variables on the outcome distribution and allows for conditioning on control variables without altering the interpretation of the estimates. Quantile regression and instrumental-variable quantile regression are special cases of GQR, but GQR allows for more flexible estimation of quantile treatment effects. We can easily extend the estimator to include instrumental variables and panel data. We introduce a Stata command—gqr—that implements a GMM-based GQR estimator. User specified options for the command include the usual panel data options and allow the user to control for endogeneity in explanatory variables by using instruments. The command allows users different means for characterizing standard errors of estimated parameters, including both direct methods and Markov chain Monte Carlo simulation.

Additional information

Small multiples, or the science and art of combining graphs

Nicholas J. Cox
Durham University
Good graphics often exploit one simple graphical design that is repeated for different parts of the data, which Edward R. Tufte dubbed as the use of small multiples. In Stata, small multiples are supported for different subsets of the data with by() or over() options of many graph commands; users can easily emulate this in their own programs by writing wrapper programs that call twoway or graph bar and its siblings. Otherwise, specific machinery offers repetition of a design for different variables, such as the (arguably much under-used) graph matrix command. Users can always put together their own composite graphs by saving individual graphs and then combining them. This presentation offers further modest automation of the same design repeated for different data. Three general programs allow small multiples in different ways. sparkline, also inspired by Tufte but using a centuries-old design popular in many sciences, is most suitable for multiple time series, yet it also has other applications. crossplot offers a simple student-friendly graph matrix for each y and each x variable specified, which is more general than a scatterplot matrix. combineplot is a command for combining univariate or bivariate plots for different variables.

Additional information

Measuring mobility

Austin Nichols
Urban Institute
I review various measures of mobility using panel data. with applications to measuring economic or social mobility in survey data. I demonstrate a variety of approaches.

Additional information

bipolate: A Stata command for bivariate interpolation with particular application to 3D graphing

Joseph Canner
Johns Hopkins University School of Medicine
Stata has a variety of flexible commands for graphing in two dimensions; however, it has few options for graphing in three dimensions. The user-written surface command by Adrian Mander, available from SSC, attempts to fill this gap, providing both 3D wire-frame plots and dropline plots. However, when some (x,y) combinations do not have a corresponding z-value, the graphs produced by surface are often unintelligible. SAS addresses this problem with PROC G3GRID, which creates a dataset of interpolated values, providing a smooth surface plot when used as input for PROC G3D. The default method of interpolation used by PROC G3GRID was proposed by Hiroshi Akima in 1978. To reproduce this functionality in Stata, we used a publicly available Fortran implementation of Akima's method. We converted these Fortran subroutines into Mata and created the Stata command bipolate to interface with these subroutines. The bipolate command contains options for interpolating z-values at all possible combinations of the specified x- and y-values and for specifying specific (x,y) combinations at which to interpolate. There is also an option for handling multiple z-values for a given (x,y). Examples will be provided to illustrate the use of surface, with and without bipolate, and to illustrate various bipolate options.

Additional information

Dialog-driven event study using Stata (Cancelled)

Chuntao Li
Zhongnan University of Economics and Law
We present our user-written ado program, eventstudy. This package allows users to perform large scale event study with market models such as CAPM. The program is written with Stata's dialog command and is menu driven. Users simply feed the black box with key flavors for the event study, and the program can automatically perform the complex procedure.

Estimating average treatment effects from observational data using teffects

David Drukker
StataCorp LP
After reviewing the potential-outcome framework for estimating treatment effects from observational data, I will discuss how to estimate the average treatment effect and the average treatment effect on the treated by the regression-adjustment estimator, the inverse-probability-weighted estimator, two doubly robust estimators, and two matching estimators implemented in teffects.

Additional information

Optimal interval design for phase I oncology clinical trials

Bryan Fellman
MD Anderson Cancer Center
The optimal interval design is a novel phase I trial design for finding the maximum tolerated dose (MTD). The optimal interval design casts dose finding as a sequential decision-making problem for assigning an appropriate dose for each enrolled patient. The design optimizes the assignment of doses to patients by minimizing incorrect decisions of dose escalation or deescalation, that is, erroneously escalating (or deescalating) the dose when the current dose is actually higher (or lower) than the MTD. This feature of the optimal interval design strongly ensures adherence to ethical standards. In addition, because the optimal dose assignment tends to treat patients at (or close to) the MTD, at the end of the trial, this design will be able to select the MTD with a high probability since most data and statistical power are concentrated around the MTD. This presentation will briefly cover the methods of the design and demonstrate a command that implements them in a clinical setting.

Additional information

Distributed computations in Stata

Michael Lokshin
Sergiy Radyakin
Development Economics Research Group, The World Bank
Many complex tasks frequently challenge the computational resources in simulation modeling and estimation. Often these tasks have a distinct number of separable iterations that can be performed in parallel, simultaneously, and independently from each other. Earlier approaches were limited to an execution on a single machine (e.g., PARALLEL, 2013) in parallel sessions. We are developing a system, which can be run in an MS Windows network, with automatic registration and deregistration of computing nodes (each running Stata), a task scheduler, and a results aggregator. A multiple-machine networked approach allows greater scale and ultimately higher performance.

Additional information

Binned scatterplots: Introducing binscatter and exploring its applications

Michael Stepner
binscatter is a new program that produces binned scatterplots, which provide a nonparametric estimate of a conditional expectation function. This presentation will describe the features of binscatter and explore its versatile applications. Those applications include: observing the relationship between two variables in large datasets, visualizing OLS regressions, visualizing regression-discontinuity designs, plotting event studies, and visualizing IV regressions. The presentation will demonstrate how binscatter can be used to complement the empirical techniques most commonly used in applied economic research.

Additional information

Floating-point numbers: A visit through the looking glass

William Gould
StataCorp LP
In lieu of his usual Report to users, Bill Gould will talk on floating-point numbers. Researchers do not adequately appreciate that floating-point numbers are a simulation of real numbers and, as with all simulations, some features are preserved while others are not. When writing code, or even do-files, treating the computer's floating numbers as if they were real numbers can lead to substantive problems and to numerical inaccuracy. In this, the relationship between computers and real numbers is not entirely unlike the relationship between tea and Douglas Adams's Nutri-Matic drink dispenser. The Nutri-Matic produces a concoction that is "almost, but not quite, entirely unlike tea." Gould shows what the universe would be like if it were implemented in floating-point rather than in real numbers. The floating-point universe turns out to be nothing like the real universe and probably could not be made to function. Without jargon and without resort to binary, Gould shows how floating-point numbers are implemented on an imaginary base-10 computer and quantifies the kinds of errors that can arise. In this, float-point subtraction stands out as really being almost, but not quite, entirely unlike subtraction. Gould shows how to work around such problems. The point of the talk is to build your intuition about the floating-point world so that you as a researcher can predict when calculations might go awry, know how to think about the problem, and determine how to fix it.

Additional information

Scientific organizers

Stephen Soldz, (chair) Boston Graduate School of Psychoanalysis

Christopher F. Baum, Boston College

Marcello Pagano, Harvard School of Public Health

Logistics organizers

Nathan Bishop, StataCorp

Chris Farrar, StataCorp

Gretchen Farrar, StataCorp