Meet with us in the Windy City!
Network. Learn. Grow.
The Stata Conference is a unique opportunity to hear from Stata experts in the top of their field, as well as Stata’s own researchers and developers. Open to users of all disciplines, the Stata Conference has something for everyone. Join us for this chance to learn from and network with Stata users from all over the world.
New this year, network with other users and Stata developers at the mixer and poster session.
Day 1: Thursday, July 11
|7:15–7:50||Registration and continental breakfast|
|7:50–8:00||Welcome and introduction|
Using Stata for data collection and managementTravel time with Chinese map Abstract: The Chinese use Baidu Map for navigation service. Many researchers in the field of economics may need to find the time to travel from one place to another, with different traveling modes such as driving, public transportation, bicycling, and walking.
In this presentation we provide an algorithm in Stata to help users to calculate this kind of traveling time as well as traveling distance for each pair of Chinese addresses. We also package this into an ado-command to help users.
Zhongnan University of Economics and Law
Automating a codebook Abstract: We present a command for automating what is otherwise a time-consuming task: the creation of comprehensive codebooks in Microsoft Word format.
Our command leverages using Stata 15’s putdocx suite of commands to create a codebook of all variables in the data (by default) or user-specified varlist. The resulting Microsoft Word format output file begins with a cover page listing details of the data file, including the filename, last saved date, number of observations, number of variables, data labels, and data notes. The remainder of the output file contains one page per variable with that variable name, label, type, number of observations, number of unique values, and number of missing values. Numeric values also display the mean, standard deviation, range, percentiles, and a histogram. Dates display range. String variables show a complete or a truncated frequency distribution as desired. The output file is useful for accession into a data archive such as ICPSR or your local repository—this was the problem we set out to solve. Output files can also serve as a starting point for exploratory data analysis. An unexpected benefit of this approach is exposing variable names and labels to system-level search indexes (Spotlight in macOS or Windows Search).
University of Alaska Anchorage
ietoolkit: How DIME Analytics develops Stata code from primary data work Abstract: Over the years, the complexity of data work in development research has grown exponentially, and standardizations for workflows are needed for researchers and data analysts
to work simultaneously on multiple projects. ietoolkit was developed to standardize and simplify best practices for data management and analysis across the 100-plus members of the World Bank's Development Research Group, Impact Evaluations team (DIME). It includes a standardized project folder structure; standardized Stata "boilerplate" code; standardized balance tables, graphs, and matching procedures; and modified dropping and saving commands with built-in safety checks. The presentation will outline how the ietoolkit structure is meant to serve as a guide for projects to move their data through the analysis process in a standardized way, as well as offer a brief introduction to the other commands. The intent is for many projects within one organization to have a predictable workflow, such that researchers and data analysts can move between multiple projects and support other teams easily and rapidly without expending time relearning idiosyncratic project organization structures and standards. These tools are developed open-source on GitHub and available publicly.
World Bank Group (DIME)
iefieldkit: Stata commands for primary data collection and cleaning Abstract: Data collection and cleaning workflows use highly repetitive but extremely important processes. iefieldkit was developed to standardize and simplify
best practices for high-quality primary data collection across the 100-plus members of the World Bank's Development Research Group, Impact Evaluations team (DIME). It automates error-checking for electronic ODK-based survey modules such as those implemented in SurveyCTO; duplicate checking and resolution; data cleaning, including renaming, labeling, recoding, and survey harmonization; and codebook creation. The presentation will outline how the iefieldkit package is intended to provide a data-collection workflow skeleton for nearly any type of primary data collection, from questionnaire design to data import. One feature of many iefieldkit commands is their utilization of spreadsheet-based workflows, which reduce repetitive coding in Stata and document corrections and cleaning in a human-readable format. This enables rapid review of data quality in a standardized process, with the goal of producing maximally clean primary data for the downstream data construction and analysis phases in a transparent and accessible manner. These tools are developed open-source on GitHub and available publicly.
World Bank Group (DIME)
Graphics developmentBarrel-aged software development: brewscheme as a four-year-old Abstract: The term "software development" implies some type of change over time. While Stata goes through extraordinary steps to support backward compatibility, user-contributors may not always see a need to continue developing programs shared with the community.
How do you know if or when you should add additional programs or functionality to an existing package? Is it easy and practical to extend existing Stata code, or is it easier to refactor everything from the ground up? What can you do to make it easier to extend existing code? While brewscheme may have started as a relatively simple package with a couple of commands and limited functionality, in the four years since it was introduced, it has grown into a multifunctional library of tools to make it easier to create customized visualizations in Stata while being mindful of color sight impairments. I will share my experience, what I have learned, and strategies related to how I dealt with these questions in the context of the development of the brewscheme package. I will also show what the additional features do that the original brewscheme did not do.
Fayette County Public Schools
Substantive applicationsSimulating baboon behavior using Stata Abstract: This presentation originated from a field study of the behavior of feral baboons in Tanzania. The field study used behavior sampling methods,
including on-the-moment (instantaneous) and thru-the-moment (one-zero). Some primatologists critiqued behavioral sampling as not reflecting true frequency or duration. A Monte Carlo simulation study was performed to compare behavior sampling with actual frequency and duration.
Using cluster analysis to understand complex datasets: Experience from a national nursing consortium Abstract: Cluster analysis is a type of exploratory data analysis for classifying observations and identifying distinct groups. It may be useful for complex datasets where commonly used regression modeling approaches may be inadequate because of outliers, complex interactions, or violation of assumptions.
In health care, the complex effect of nursing factors (including staffing levels, experience, and contract status), hospital size, and patient characteristics on patient safety (including pressure ulcers and falls) has not been well understood. In this presentation, I will explore the use of Stata cluster analysis (cluster) to describe five groups of hospital units that have distinct characteristics to predict patient pressure ulcers and hospital falls in relationship to employment of supplemental registered nurses (SRNs) in a national nursing database. The use of SRNs is a common practice among hospitals to fill gaps in nurse staffing. But the relationship between the use of SRNs and patient outcomes varies widely, with some groups reporting a positive relationship, while other groups report an adverse relationship. The purpose of this presentation is to identify the advantages and disadvantages of cluster analysis and other methods when analyzing nonnormally distributed, nonlinear data that have unpredictable interactions.
Virginia Mason Medical Center
The individual process of neighborhood change and residential segregation in 1940: An implication of a discrete choice model Abstract: Using the 1940 restricted census microdata, this study develops discrete choice models to investigate how individual and household characteristics, along with the features of neighborhoods of residence, affect individual choices of residential outcomes in US cities.
This study will make several innovations: (1) We will take advantage of 100% census microdata on the whole population of the cities to establish discrete choice models estimating the attributes of alternatives (for example, neighborhoods) and personal characteristics simultaneously. (2) This study will set a routine of reconstructing personal records to the data structure eligible for discrete choice models and then test whether the assumptions are violated. (3) This study will assess the extent and importance of discrimination and residential preferences, respectively, through the model specification. The results suggest that both in-group racial and class preferences can explain the individual process of neighborhood changes. All groups somehow practice out-group avoidance based on race and social class. Such phenomena are more pronounced in multiracial cities.
Karl X.Y. Zou
Texas A&M University
Featured presentation from StataCorpUsing other programming languages within Stata Abstract: Users may extend Stata's features using other programming languages, for example, Python, Java, and C. I will discuss how other programming languages can be used from within Stata. I will demonstrate calling code from Stata to extend Stata's functionality, obtaining access to Stata's data and metadata from within that code, and returning results to Stata.
Difference-in-differencesExtending the difference-in-differences (DID) to settings with many treated units and same intervention time: Model and Stata implementation Abstract: The difference-in-differences (DID) estimator is popular to estimate average treatment effects in causal inference studies. Under the common support assumption, DID overcomes the problem of unobservable selection using panel, time, or location fixed effects and the knowledge of the pre- or postintervention times.
New developments of DID have been recently proposed: (i) the synthetic control method (SCM) applies when a long pre- and postintervention time series is available, only one unit is treated, and intervention occurs in a specific time (implemented in Stata via SYNTH by Hainmueller, Abadie, Dimond ); (ii) an extension to binary time-varying treatment with many treated units has also been proposed and implemented in Stata via TVDIFF (Cerulli and Ventura, 2018). However, a command to accommodate a setting with many treated units and the same intervention time is still lacking. In this presentation, I propose a potential-outcome model to accommodate this latter setting and provide a Stata implementation via the new Stata routine FTMTDIFF (standing for fixed-time multiple treated DID). I will finally set some guidelines for future DID developments.
IRCrES-CNR, National Research Council of Italy
Bacon decomposition for understanding differences-in-differences with variation in treatment timing Abstract: In applications of a difference-in-differences (DD) model, researchers often exploit natural experiments with variation in onset, comparing outcomes across groups of units that receive treatment starting at different times.
Goodman-Bacon (2019) shows that this DD estimator is a weighted average of all possible two-group or two-period DD estimators in the data. The bacon command performs this decomposition and graphs all two-by-two DD estimates against their weight, which displays all identifying variation for the overall DD estimate. Given the widespread use of the two-way fixed effects DD model, bacon has broad applicability across domains and will help researchers understand how much of a given DD estimate comes from different sources of variation.
Stata programmingThe matching problem using Stata Abstract: A main purpose of this presentation is to discuss an algorithm for the matching problem. As an example, K-cycle Kidney exchange problem is defined and solved using user-written Stata program.
Korea National Defense University
Mata implementation of Gauss-Legendre quadrature in the M-estimation context: Correcting for sample-selection bias in a generic nonlinear setting Abstract: Many contexts in empirical econometrics require nonclosed form integration for appropriate modeling and estimation design. Applied researchers often avoid such correct but computationally demanding specifications and opt for simpler misspecified modeling designs.
The presentation will detail a newly developed Mata implementation of a relatively simple numerical integration technique – Gauss-Legendre quadrature. Although this Mata code is applicable in a variety of circumstances, it was mainly written for use in M-estimation when the relevant objective function (for example, the likelihood function) involves integration at the observation level. As inputs, the user supplies a vector-valued integrand function (for example, a vector of sample log-likelihood integrands) and a matrix of upper and lower integration limits. The code outputs the corresponding vector of integrals (for example, the vector of observation-specific log-likelihood values). To illustrate the use of this Mata implementation, we conduct an empirical analysis of classical sample-selection bias in the estimation of wage offer regressions. We estimate a nonlinear version of the model based on the modeling approach suggested by Terza (Econometric Reviews 2009) which requires numerical integration. This model is juxtaposed with the classical linear sample-selection specification of Heckman (Annals of Economic and Social Measurement 1976), for which numerical integration is not required.
Indiana University Purdue University Indianapolis
Economic applicationsA practical application of the mvport package: CAPM-based optimal portfolios Abstract: The mvport package has commands for financial portfolio optimization and portfolio backtesting. I present a practical implementation of a CAPM-based strategy to select stocks, and then apply different optimization settings, and evaluate the resulting portfolios.
The presentation illustrates how to automate the process through a simple do-file that allows to easily change parameters (for example, stock list, market index, risk-free rate) using an Excel interface. The program automates the following: a) data collection, b) CAPM model estimation for all stocks, c) selection of stocks based on CAPM parameters, d) portfolio optimization with different configurations, and e) portfolio backtesting. For data collection, the getsymbols and the freduse command is used to get online price data for all the S&P500 stocks and the risk-free rate. For each stock, two competing CAPM models are estimated: using a simple regression and using an autoregressive conditional heteroskedasticity (ARCH) model. The CAPM parameters are used to select stocks. Then the mvport package is used to optimize different configurations of the portfolio. Finally, the performance of each portfolio configuration is calculated and compared with the market portfolio.
Tec de Monterrey
Tools to analyze interest rates and value bonds Abstract: Bond markets contain a wealth of information about investor preferences and expectations. However, extracting such information from market interest rates can be computationally burdensome. I introduce a suite of new Stata commands to aid finance professionals and researchers in using Stata to analyze the term structure of interest rates and value bonds.
The genspot command uses a bootstrap methodology to construct a spot rate curve from a yield curve of market interest rates under a no-arbitrage assumption. The genfwd command generates a forward rate curve from a spot rate curve, allowing researchers to infer market participants’ expectations of future interest rates. Finally, the pricebond command uses forward rates to value a bond with user-specified terms.
Discover Financial Services
Panel stochastic frontier models with endogeneity in Stata Abstract: I introduce xtsfkk, a new Stata command for fitting panel stochastic frontier models with endogeneity. The advantage of xtsfkk is that it can control for the endogenous variables in the frontier or the inefficiency term in a longitudinal setting.
Hence, xtsfkk performs better than standard panel frontier methodologies such as xtfrontier that overlook endogeneity by design.
Social networking mixer and poster session
Enjoy complimentary drinks and connect with new and long-time Stata users all while viewing a
number of accepted poster presentations on a variety of topics.
Post-estimation analysis with Stata by SPost13 commands of survey data analyzed by MNLM
The causal effects of wages on labor supply for married women: Evidence from American couples
Triple deficit hypothesis: A dynamic panel model for East African countries
The Eastern Africa Statistical Training Centre
Fitting generalized linear models when the data exceeds available memory
Johns Hopkins University School of Medicine
Estimation of varying coefficient models in Stata
Levy Economics Institute
Psychiatric morbidity in physically injured children and adolescents: A national evaluation
Predicting student academic interruption based on prior math course performance: An example of using Stata’s margins command in institutional research
Natasha A. Baloch
University of South Florida
Day 2: Friday, July 12
Statistical topicsRecentered influence functions (RIF) in Stata: RIF-regression and RIF-decomposition Abstract: Recentered influence functions (RIF) are statistical tools that have been popularized by Firpo, Fortin, and Lemieux (2009) for analyzing unconditional partial effects (UPE) on quantiles in a regression analysis framework (unconditional quantile regressions).
The flexibility and simplicity of this tool, however, has opened the possibility to extend the analysis to other distributional statistics, using linear regressions or decomposition approaches. In this presentation, I introduce three Stata commands to facilitate the use of RIFs in the analysis of outcome distributions: rifvar() is an egen extension used to create RIFs for a large set of distributional statistics; rifhdreg facilitates the estimation of RIF regressions enabling the use of high-dimensional fixed effects; and oaxaca_rif implements Oaxaca-Blinder-type decomposition analysis.
Verifying the existence of maximum likelihood estimates in generalized linear models Abstract: There has been considerable ambiguity over how to verify whether estimates from nonlinear models "exist" and what can be done if they do not. This is the so-called separation problem. We characterize the problem in detail across a wide range of generalized linear models and introduce a novel method for dealing with it in the presence of high-dimensional fixed effects,
as are often recommended for gravity models of international trade and in other common panel-data settings. We have included these methods in a new Stata command for HDFE-Poisson estimation called PPMLHDFE. We have also created a suite of test cases developers may use in the future for testing whether their estimation packages are correctly identifying instances of separation. These projects are joint with Sergio Correia and Paulo Guimaraes. We have written two papers related to these topics and also created a website with example code and data illustrating the separation issue and how we solve it. Please see our github for more details: https://github.com/sergiocorreia/ppmlhdfe/.
University of Richmond
Unbiased IV in Stata Abstract: A well-known result is that exactly identified IV has no moments, including in the ideal case of an experimental design (that is, a randomized control trial with imperfect compliance). This result no longer holds when the sign of the first stage is known, however.
I describe a Stata implementation of an unbiased estimator for instrumental-variable models with a single endogenous regressor where the sign of one or more first‐stage coefficients is known (due to Andrews and Armstrong 2017) and its finite sample properties under alternative error structures.
Using Stata for reproducible researchIncorporating Stata into reproducible documents during survey data management for intervention projects in Africa Abstract: Program intervention will be very successful if a proper data management system is created and quality checks are implemented during survey data collection.
Stata 15 introduces several commands that facilitate automated document production, putdocx for creating Word documents and images and graph putdocx append for combining Word documents. These commands allow you to mix formatted text and Stata output and allow you to embed Stata graphs, in-line Stata results, and tables containing the output from selected Stata commands. We will show these commands in action, demonstrating automating the production of documents in various formats and including Stata results in those documents, while combining existing Stata commands to identify outliers, produce descriptive statistics, univariate and bivariate tables, automated quality check for any survey data collection app, for example, ODK, CsPro, Survey-Solutions, and Redcap.
Jerry Chukwuebuka Agulehi
Connecting Stata and Microsoft Word using StatTag for collaborative reproducibility Abstract: Although Stata can render output and reports to Microsoft Word, pdf and html files, Stata users must sometimes transcribe statistical content in to separate Microsoft Word documents (for example, documents drafted by colleagues in Word or documents that must be prepared in Word), a process that is error prone, irreproducible, and inefficient.
This talk will illustrate how StatTag (www.stattag.org), an open source, free, and user-friendly program that we developed, addresses this problem. Since its introduction in 2016, StatTag has undergone substantial improvements and refinements. StatTag establishes a bidirectional link between Stata files and a Word document and supports a reproducible pipeline even when (1) statistical results must be included and updated in Word documents that were never generated from Stata; and (2) text in Word files generated from Stata has departed substantially from original content, for example, through tracked changes or comments. We will demonstrate how to use StatTag to connect Stata and Word files so that all files can be edited separately, but statistical content—values, tables, figures, and verbatim output—can be updated automatically in Word. Using practical examples, we will also illustrate how to use StatTag to view, edit, and rerun Stata code directly from Word.
Abigail S. Baldridge
Featured presentations from StataCorpUsing the lasso and related estimators for prediction Abstract: The lasso, the elastic net, and ridge regression are three popular machine-learning methods. In this presentation, I will discuss prediction using these methods for linear, binary, and count outcomes. We will discover why these estimators are effective and how they work. Then I will show some examples of these tools in action.
Inference after lasso model selection Abstract: The increasing availability of high-dimensional data and increasing interest in more realistic functional forms have sparked a renewed interest in automated methods for selecting the covariates to include in a model. I discuss the promises and perils of model selection and pay special attention to some new estimators that provide reliable inference after model selection.
Topics in biostatisticsUncovering the true variability in meta-analysis results using resampling methods Abstract: Traditionally, meta-analyses are performed using a single effect estimate from each included study, resulting in a single combined effect estimate and confidence interval. However, there are a number of processes that could give rise to multiple effect estimates from each study, such as multiple individuals extracting study data, the use of different analysis methods for dealing with missing data or dropouts, and the use of different types of endpoints for measuring the same outcome.
Depending on the number of studies and the number of possible estimates per study, the number of combinations of studies for which a meta-analysis could be performed could be in the thousands. Accordingly, meta-analysts need a tool that can iterate through all of these possible combinations (or a reasonably sized sample thereof), compute an effect estimate for each, and summarize the distribution of the effect estimates and standard errors for all combinations. We have developed a Stata command, resmeta, for this purpose that can generate results for 10,000 combinations in a few seconds. This command can handle both continuous and categorical data, can handle a variable number of estimates per study, and has options to compute a variety of different estimates and standard errors. In the presentation, we will cover case studies where this approach was applied, considerations for more general application of the approach, command syntax and options, and different ways of summarizing the results and evaluating different sources of variability in the results.
Johns Hopkins University School of Medicine
Comparing treatments in the presence of competing risks based on life years lost Abstract: Competing risks are frequently encountered in medical research. Examples are clinical trials in head-and-neck and prostate cancer where deaths from cancer and deaths from other causes are competing risks.
Andersen (Stat in Med 2013) showed that the area under the cause j cumulative incidence curve from 0 to t* can be interpreted as the number of life years lost (LYL) due to cause j before time t*. LYL can be estimated and compared in Stata using either the pseudo-observations approach described in Overgaard, Andersen, and Parner (Stata Journal 2015) or by modification of a routine by Pepe and Mori (Stat in Med 1993) for testing the equality of cumulative incidence curves. We describe an application of the method to the DeCIDE trial, a phase III randomized clinical trial of induction chemotherapy plus chemoradiotherapy versus chemoradiotherapy alone in patients with locally advanced head-and-neck cancer. We present simulation results demonstrating that the pseudo-observations and Pepe-Mori approaches yield similar results. We also evaluate the power obtained from comparing life years lost relative to standard procedures for analyzing competing risks data, including cause-specific logrank tests (Freidlin and Korn; Stat in Med 2005) and the Fine-Gray model (Fine and Gray; JASA 1999).
University of Chicago and NRG Oncology
Hierarchical summary ROC analysis: A frequentist-Bayesian colloquy in Stata Abstract: Meta-analysis of diagnostic accuracy studies requires the use of more advanced methods than meta-analysis of intervention studies. Hierarchical or multilevel modeling accounts for the bivariate nature of the data, both within- and between-study heterogeneity and threshold variability. The hierarchical summary receiver operating characteristic (HSROC) and the bivariate random-effects models are currently recommended by the Cochrane Collaboration.
The bivariate model is focused on estimating summary sensitivity and specificity and as a generalized linear mixed model is estimable in most statistical software, including Stata. The HSROC approach models the implicit threshold and diagnostic accuracy for each study as random effects and includes a shape or scale parameter that enables asymmetry in the SROC by allowing accuracy to vary with implicit threshold. As a generalized nonlinear mixed model, it has not been previously or directly estimable in Stata, though possible with WinBUGS and SAS Proc NLMIXED or indirectly extrapolating its parameters from the bivariate model in Stata. This talk will demonstrate for the first time how the HSROC model can be fit in Stata using ML programming and the recently introduced bayesmh command. Using a publicly available dataset, I will show the comparability of Stata results with those obtained with WinBUGS and SAS Proc NLMIXED.
Ben Adarkwa Dwamena
University of Michigan Medical School
Using Stata to solve problems in applied workkmr: A command to correct survey weights for unit nonresponse using a group's response rates Abstract: This article describes kmr, a Stata command to estimate a micro compliance function using group level nonresponse rates (2007, Journal of Econometrics 136: 213-235), which can be used to correct survey weights for unit nonresponse. We illustrate the use of kmr with an empirical example using the Current Population Survey and state-level nonresponse rates.
CUNY Graduate Center and Stone Center on Socio-economic Inequality
tesensitivity: A Stata package for assessing the unconfoundedness assumption Abstract: This talk will discuss a new set of methods for quantifying the robustness of treatment effects estimated under the unconfoundedness assumption (also known as selection on observables or conditional ignorability). Specifically, we estimate bounds on the ATE, the ATT, and the QTE under nonparametric relaxations of unconfoundedness indexed by a scalar sensitivity parameter c.
These deviations allow for limited selection on unobservables, depending on the value of c. For large enough c, these bounds equal the no assumptions bounds. Our methods allow for both continuous and discrete outcomes but require discrete treatments. We implement these methods in a new Stata package, tesensitivity, for easy use in practice. We illustrate how to use this package and these methods with an empirical application to the National Supported Work Demonstration program.
|3:40–4:40||Open panel discussion with Stata developers|
Seats are limited. Choose one of the options below. Lunch and refreshments are included in the registration fee.
Day 1: Thursday,
11 July 2019
Day 2: Friday,
12 July 2019
11 July 2019
The optional users dinner will be at Il Porcellino on Thursday,
11 July 2019, at 6:30.
59 W Hubbard
Chicago, IL 60654
The InterContinental Chicago Magnificent Mile is offering a special group rate of $189 per night for Stata Conference attendees staying between 10–13 July 2019.
InterContinental Chicago Magnificent Mile
505 North Michigan Avenue
Chicago, IL 60611
The conference hotel is conveniently located a short three minute walk from the Gleacher Center. There is limited availability, so book your room by 10 June 2019 to receive the special rate.
Contact the reservations department at (800) 628-2112 and identify yourself as a participant of the 2019 Stata Conference Chicago (or group code MSG) or make a reservation via the link below.
University of Chicago
Booth School of Business
450 N Cityfront Plaza Dr
Chicago, IL 60611 USA
Phil Schumm (Chair)
Department of Public Health Sciences
The University of Chicago
Department of Sociology
University of Notre Dame
Department of Sociology
Department of Economics
University of Michigan
To stay up to date with future Conference announcements, sign up for an alert now.