Stas Kolenikov

Abt SRBI

In this talk, I demonstrate how to build a multiple-imputation
procedure from scratch. The motivating example
comes from a public opinion survey in which the sampled respondents
provided their responses on the web or by phone. As is known in survey
methodology literature, presence of an interviewer on the phone produces
higher reports of socially desirable behaviors, such as number of
friends or political engagement, or lower reports of undesirable
behaviors, such as illicit drug use. Treating these less accurate
responses as partially missing data, I develop a non-standard multiple-imputation
model that is driven by a concept of utility from choice and
decision literature in economics. My implementation is aligned to supply
the data to Stata's **mi** suite, in the sense that I create the
imputations, and **mi** combines them using Rubin's rules. Additionally,
the workflow of the mode-effect detection features multiple
testing corrections. It requires extensive **post** operations and
that the lists of variables be exchanged between the do-files of the project
which I also demonstrate in this presentation.

**Additional information**

boston14_kolenikov.pdf

boston14_kolenikov.pdf

Phil Schumm

Department of Health Studies, University of Chicago

In response to the 1997 Food and Drug Administration
Modernization Act, the National Institutes of Health
established ClinicalTrials.gov, an online, publicly-accessible registry
for clinical trials. The 2007 Food and Drug Administration Amendments
Act broadened the scope of eligible trials, added outcomes
reporting as a requirement, and established penalties for
non-compliance. Although ClinicalTrials.gov increased the transparency
with which clinical trials are conducted in the U.S. and opened up new
possibilities for research using the information collected, additional
resources, time, and effort are required to comply with this mandate.
This presentation will introduce **ctgov**, a suite of Stata commands to
facilitate the reporting of trial results. By using this tool,
researchers will be able to generate results for automatic upload to
ClinicalTrials.gov as they are doing their primary
analyses, thereby eliminating much of the additional effort and ensuring
that the results in ClinicalTrials.gov match those in the official
publication or report. Although primarily of interest to clinical
researchers, biostatisticians, and pharmaceutical companies, the approach
taken by **ctgov** also has connections to work being done in the area of
reproducible research.

**Additional information**

boston14_schumm.pdf

boston14_schumm.pdf

Billy Buchanan

Mississippi Department of Education

In 2013, the Mississippi State Legislature passed a law
requiring the state to adopt a single combined statewide accountability
system for schools and districts; the law also restricted
the state from using some of the methods used in the
accountability system of the time. Once the Mississippi Board of Education
voted to adopt the proposed model, the next major task was to program
all the business rules and requirements and calculations. This
presentation will focus on how that led to the
current accountability system. Using Stata, I could
reduce much of the complexity of the previous accountability model
when compared with other software. The current model uses 15 programs
written in Stata to import data from an internal server, implement the
rules specified in the business rules document, estimate the ratios
required by the system, create graphs to illustrate school versus district
versus state comparisons, and build school and district reports for public
consumption. Using Stata's capabilities, we can generate reports
by writing LaTeX source code and a Bash script used to
compile and clean up the output from the LaTeX files. This saves
considerable time.

**Additional information**

boston14_buchanan.pdf

boston14_buchanan.pdf

Phil Ender

UCLA Statistical Consulting Group

This presentation will discuss profile analysis, a multivariate method for examining differences in the
shapes of profiles across groups. Profile analysis uses of Stata’s **manova**
command along with **manovatest** for estimation. This presentation will also
demonstrate the user-written command **profileplot** to graphically
display group profiles.

**Additional information**

boston14_ender.pdf

boston14_ender.pdf

David Clark

Maine Medical Center, Portland, Maine

Operating room (OR) inefficiency is costly and stressful for
patients and staff. To evaluate possible improvements, we simulated our
OR and recovery room (RR) processes with Stata. We used hospital data
(in long format) and parametric time-to-event regression (streg) to
derive loglogistic distributions for the duration of procedures, RR
stays, and room turnaround. Variables were then reshaped into a single
row (wide format) for the simulation program. Patient and room status
for a 24-hour day were changed sequentially using a **forvalues** loop with
5-minute steps. Scheduled and historical times were first used
deterministically to recreate anticipated and actual events and
durations. Patient observations were then replicated (using **expand**) with
different pseudorandom parameters in each row. Distributions of patient
length of stay in OR and RR (and room turnaround times) thus
approximated theoretical input distributions. Refinements included
reassigning cases if the scheduled room was running late, changing staff
availability, and incorporating unscheduled emergencies. Summary
statistics were compiled (using **egen**) for each case and the system as a
whole and were consistent with historical data. Stata has some
advantages over specialized simulation programs, especially for current
Stata users. We plan to build a user interface, make other improvements,
and share our program through RePEc.

**Additional information**

boston14_clark.ppt

boston14_clark.ppt

James Fiedler

Universities Space Research Association

At last year’s Stata conference, I presented projects that
facilitate the combined use of Stata and Python. One project provides
the ability to use Python within Stata via a C plugin. The other project
provides a custom Python class that can be used to open, modify, and
save Stata datasets. In this talk, I will begin by describing some
modifications and extensions to these projects. I will then present a
few new ideas for useful combinations of Stata with other tools. Some of
these ideas can be realized using the Python projects above, some using
JavaScript and a web browser.

**Additional information**

boston14-fiedler.pdf

boston14-fiedler.pdf

Matthew Baker

Hunter College and the Graduate Center, CUNY

Solution of nonlinear systems has become increasingly
important as a step in many estimation problems and is a problem of
interest in its own right. I introduce a collection of Mata routines
that can be used to find all solutions to nonlinear equation systems
and demonstrate their usage on a sequence of test problems. While
specifically tailored to solving polynomial systems, the method can be
applied to any continuous system with continuous Jacobian. The methods
rely on interval Newton methods, a technique that combines Taylor
expansion, bisection, and interval programming. The routines come
equipped with a heuristic solver that allows for approximate solution
of problems that are especially time consuming or problems that do not
require that all solutions be found. Support tools for the solver
include functions for interval arithmetic and the manipulation of a series of
matrices in parallel. I discuss an extended application of the solution
tools to the problem of finding all equilibria of discrete action games,
which in general requires solving polynomial systems.

**Additional information**

boston14_baker.pdf

boston14_baker.pdf

Robert Grant

St. George’s Medical School, University of London

The last three years have seen explosive growth in the
variety and sophistication of interactive online graphics. These are
mostly implemented in the web language JavaScript, with the D3 (Data-Driven
Documents) library being the most popular and flexible at
present. Leaflet is a mapping library also being widely used. R users
have some packages that translate their data and specifications into
interactive graphics and maps; these packages write a text file containing the
HTML and JavaScript instructions that make up a webpage containing the
desired visualization. This translation into a webpage is easily
achieved in Stata, and I will present the **stata2leaflet** command which
produces zoomable, clickable online maps. Contemporary interactive
graphs benefit from allowing the viewer to filter and select data of
interest, which is a second layer of specification implemented in the
**stata2d3** commands. **stata2d3** capitalizes on the consistency of Stata
graph syntax by parsing and translating a standard Stata graph command
into a webpage. Users can choose to include explanatory comments
against each line in the source code; these are invisible to viewers but help
them learn HTML and JavaScript and make further refinements.

**Additional information**

boston14_grant.pdf

boston14_grant.pdf

Linden McBride

Cornell University

Many estimation problems focus on classification of cases
(into bins) with tools that aim to identify cases using only a small
subset of all possible questions. These tools can be used in diagnoses
of disease, identification of advanced or failing students using tests,
or classification into poor and nonpoor for the targeting of a
means-tested social program. Most popular estimation procedures for
generating these tools prioritize minimization of in-sample prediction
errors, but the objective in generating such tools is the minimization
of out-of-sample prediction errors. We provide a comparison of linear
discriminant, discrete choice, and random forest methods, with
applications to means-tested social programs. Out-of-sample prediction
error is typically minimized by random forest algorithms.

**Additional information**

boston14_mcbride.pdf

boston14_mcbride.pdf

Ben Dwamena

University of Michigan

The talk describes recent updates for **midas**, a comprehensive
and medically popular program for diagnostic test accuracy
meta-analysis. A major change is that **midas** is now an estimation command
and a wrapper for **meglm** in Stata 13 . The update allows more flexibility
for specifying covariance structures, link functions other than logit,
more extensive postestimation options and specification of starting
values (especially with sparse data), and the possibility of estimating
univariate (independent) versus bivariate (correlated) modeling of
sensitivity and specificity.

**Additional information**

boston14_dwamena.pdf

boston14_dwamena.pdf

Marcello Pagano

Harvard School of Public Health

In October 2012, HarvardX, through edX, offered its first two
online courses. One of these was *PH207X: Health in Numbers*. The
course covered biostatistics and epidemiology at an introductory level
and lasted 12 weeks. 60,000 students later, we have exposed more students
to those disciplines than we could have over the next 250 years with
typical brick and mortar teaching. To do this, we had to have a
statistical package, and we chose Stata. This talk will cover some of
what we learned from the experience.

**Additional information**

boston14_pagano.pdf

boston14_pagano.pdf

Yulia Marchenko

StataCorp LP

The Cox proportional hazards model is one of the most popular
methods for analyzing survival or failure-time data. The key assumption
underlying the Cox model is that of proportional hazards. This
assumption may often be violated in practice. Transformation survival
models extend the Cox regression methodology to allow for
nonproportional hazards. They represent the class of semiparametric
linear transformation models, which relates an unknown transformation of
the survival time linearly to covariates. In my presentation, I will
describe these models and demonstrate how to fit them in Stata.

**Additional information**

boston14_marchenko.pdf

boston14_marchenko.pdf

David Powell

RAND

Quantile regression techniques are useful in understanding the
relationship between explanatory variables and the conditional
distribution of the outcome variable, which allows the parameters of
interest to vary based on a nonseparable disturbance term. Additional
covariates may be necessary or simply desirable for identification, but
including additional variables into a conditional quantile model
separates the disturbance term, which alters the underlying structural
model. To address this problem, Powell (2013) introduces the Generalized
Quantile Regression (GQR) estimator, which provides the impact of the
treatment variables on the outcome distribution and allows for
conditioning on control variables without altering the interpretation of
the estimates. Quantile regression and instrumental-variable quantile
regression are special cases of GQR, but GQR allows for more flexible
estimation of quantile treatment effects. We can easily extend the estimator
to include instrumental variables and panel data. We introduce
a Stata command—**gqr**—that implements a GMM-based GQR estimator.
User specified options for the command include the usual panel data
options and allow the user to control for endogeneity in
explanatory variables by using instruments. The command allows
users different means for characterizing standard errors of
estimated parameters, including both direct methods and
Markov chain Monte Carlo simulation.

**Additional information**

boston14_powell.pdf

boston14_powell.pdf

Nicholas J. Cox

Durham University

Good graphics often exploit one simple graphical design that is repeated
for different parts of the data, which Edward R. Tufte dubbed as the use of
small multiples. In Stata, small multiples are supported for different
subsets of the data with **by()** or **over()** options of many graph
commands; users can easily emulate this in their own programs by writing
wrapper programs that call **twoway** or **graph bar** and its siblings.
Otherwise, specific machinery offers repetition of a design for different
variables, such as the (arguably much under-used) **graph matrix** command.
Users can always put together their own composite
graphs by saving individual graphs and then combining them.
This presentation offers further modest automation of
the same design repeated for different data. Three general
programs allow small multiples in different ways. **sparkline**, also inspired
by Tufte but using a centuries-old design popular in many
sciences, is most suitable for multiple time series, yet it also has
other applications. **crossplot** offers a simple student-friendly graph
matrix for each *y* and each *x* variable specified, which is more general
than a scatterplot matrix. **combineplot** is a command for
combining univariate or bivariate plots for different variables.

**Additional information**

boston14_cox.ppt

boston14_cox.do

boston14_cox.ppt

boston14_cox.do

Austin Nichols

Urban Institute

I review various measures of mobility using panel data.
with applications to measuring economic or social mobility in survey
data. I demonstrate a variety of approaches.

**Additional information**

boston14_nichols.pdf

boston14_nichols.pdf

Joseph Canner

Johns Hopkins University School of Medicine

Stata has a variety of flexible commands for graphing in two
dimensions; however, it has few options for graphing in three
dimensions. The user-written **surface** command by Adrian Mander, available
from SSC, attempts to fill this gap, providing both 3D wire-frame plots
and dropline plots. However, when some (x,y) combinations do not have a
corresponding *z*-value, the graphs produced by surface are often
unintelligible. SAS addresses this problem with PROC G3GRID, which
creates a dataset of interpolated values, providing a smooth surface
plot when used as input for PROC G3D. The default method of
interpolation used by PROC G3GRID was proposed by Hiroshi Akima in 1978.
To reproduce this functionality in Stata, we used a publicly available
Fortran implementation of Akima's method. We converted these Fortran
subroutines into Mata and created the Stata command **bipolate** to
interface with these subroutines. The **bipolate** command contains options
for interpolating *z*-values at all possible combinations of the specified
x- and y-values and for specifying specific (x,y) combinations at which
to interpolate. There is also an option for handling multiple *z*-values
for a given (x,y). Examples will be provided to illustrate the use of
**surface**, with and without **bipolate**, and to illustrate various **bipolate**
options.

**Additional information**

boston14_canner.pptx

boston14_canner.pptx

Chuntao Li

Zhongnan University of Economics and Law

We present our user-written ado program, **eventstudy**.
This package allows users to perform large scale event study with market
models such as CAPM. The program is written with Stata's **dialog** command and
is menu driven. Users simply feed the black box with key flavors
for the event study, and the program can automatically perform the complex procedure.

David Drukker

StataCorp LP

After reviewing the potential-outcome framework for
estimating treatment effects from observational data, I will
discuss how to estimate the average treatment effect and the average
treatment effect on the treated by the regression-adjustment estimator,
the inverse-probability-weighted estimator, two doubly robust
estimators, and two matching estimators implemented in **teffects**.

**Additional information**

boston14_drukker.pdf

boston14_drukker.pdf

Bryan Fellman

MD Anderson Cancer Center

The optimal interval design is a novel phase I trial design for finding the
maximum tolerated dose (MTD). The optimal interval design casts dose finding
as a sequential decision-making problem for assigning an appropriate dose
for each enrolled patient. The design optimizes the assignment of doses to
patients by minimizing incorrect decisions of dose escalation or
deescalation, that is, erroneously escalating (or deescalating) the dose
when the current dose is actually higher (or lower) than the MTD. This
feature of the optimal interval design strongly ensures adherence to ethical
standards. In addition, because the optimal dose assignment tends to treat
patients at (or close to) the MTD, at the end of the trial, this design will
be able to select the MTD with a high probability since most data and
statistical power are concentrated around the MTD. This presentation will
briefly cover the methods of the design and demonstrate a command that
implements them in a clinical setting.

**Additional information**

boston14_fellman.pdf

boston14_fellman.pdf

Michael Lokshin

Sergiy Radyakin

Sergiy Radyakin

Development Economics Research Group, The World Bank

Many complex tasks frequently challenge the
computational resources in simulation modeling and
estimation. Often these tasks have a distinct number of separable
iterations that can be performed in parallel, simultaneously, and
independently from each other. Earlier approaches were limited to an
execution on a single machine (e.g., PARALLEL, 2013) in parallel
sessions. We are developing a system, which can be run in an MS Windows
network, with automatic registration and deregistration of computing
nodes (each running Stata), a task scheduler, and a results aggregator.
A multiple-machine networked approach allows greater scale and ultimately
higher performance.

**Additional information**

boston14_radyakin.pdf

boston14_radyakin.pdf

Michael Stepner

MIT

boston14_stepner.pdf

William Gould

StataCorp LP

In lieu of his usual *Report to users*, Bill Gould will talk on
floating-point numbers.
Researchers do not adequately appreciate that floating-point numbers are a
simulation of real numbers and, as with all simulations, some features are
preserved while others are not. When writing code, or even do-files,
treating the computer's floating numbers as if they were real numbers can
lead to substantive problems and to numerical inaccuracy. In this, the
relationship between computers and real numbers is not entirely unlike the
relationship between tea and Douglas Adams's Nutri-Matic drink dispenser.
The Nutri-Matic produces a concoction that is "almost, but not quite,
entirely unlike tea."
Gould shows what the universe would be like if it were implemented in
floating-point rather than in real numbers. The floating-point universe
turns out to be nothing like the real universe and probably could not be
made to function.
Without jargon and without resort to binary, Gould shows how floating-point
numbers are implemented on an imaginary base-10 computer and quantifies the
kinds of errors that can arise. In this, float-point subtraction stands
out as really being almost, but not quite, entirely unlike subtraction.
Gould shows how to work around such problems.
The point of the talk is to build your intuition about the floating-point
world so that you as a researcher can predict when calculations might go
awry, know how to think about the problem, and determine how to fix it.

**Additional information**

boston14_gould.pdf

boston14_gould.pdf