Home / Stata Conferences / 2023 Stata Conference Stanford

2023 Stata Conference

Stanford, California · 20–21 July

Don't forget to save the date and join us next year at the 2024 Stata Conference in Portland, Oregon on 1–2 August 2024!

View the conference photos here, and view the proceedings and presentation slides below.

Proceedings

All times Pacific Daylight Time

Day 1

Day 2

8:15–8:50 a.m.

Registration

8:50–9:00 a.m.

Welcome and introductions

9:00–9:30 a.m.

Quantile regressions with multiple fixed effects

Additional information:
US23_Rios-Avila.html

Fernando Rios-Avila, Levy Economics Institute

Quantile regression (QR) is an estimation strategy that provides richer characterizations of the relationships between dependent and independent variables. Some developments in the literature have focused on extending quantile regression analysis to include individual fixed effects in the framework of panel data, avoiding the incidental parameter problem, under different assumptions. One recent article by Machado and Santos Silva (2019) proposed a location-scale estimator that allows for the inclusion of individual fixed effects in the framework of panel data, which permits individual effects to vary across quantiles. In this presentation, I propose an extension to this estimator that permits using any number of fixed effects, providing alternative estimators for SE beyond those suggested in Machado and Santos Silva (2019). I also present the command mmqreg, which implements these extensions.

9:30–10:00 a.m.

iedorep: Quickly locate reproducibility failures in Stata code

Additional information:
US23_Daniels.pptx

Benjamin Daniels, The World Bank (Development Impact Evaluation)

iedorep is a new Stata command in DIME Analytics’s ietoolkit package to check reproducibility of each line of a Stata do-file. First, iedorep takes a single do-file as an argument, runs it, and stores the Stata state after each line executes. This includes the current data signature, the state of the RNG, and the state of the sort RNG. Then it runs the do-file again, checking the state at all the same points. Finally, it reports exactly which lines (if any) have produced unstable states—quickly and accurately identifying hard-to-find reproducibility failures. This presentation will cover potential ways of using iedorep. We will discuss how it detects reproducibility errors, how it provides an efficient way to debug and check reproducibility of Stata code, and how it encourages users to write more accessible code. We will also explore how iedorep can be used in workshops and teaching activities and how it can serve as an important tool in research teams to review code and ensure project reproducibility. Finally, we will highlight areas for improvement and development challenges, such as within-loop implementation and recursive use in projects that use run or do to manage subtasks.

10:00–10:30 a.m.

Introducing the Stata linter: A tool to produce clear and transparent Stata code

Additional information:
US23_San_Martin.pdf

Luis Eduardo San Martin, The World Bank (Development Impact Evaluation)

Co-author: Rony Rodriguez-Ramirez, World Bank-DECRG

Statistical programming code developed collaboratively is common in modern data work. However, it is also usual for people to have different coding conventions, making it challenging for one reader to quickly understand another's code and impeding transparency. This is especially true for researchers using Stata because it does not have a widely accepted style guide and few economics graduate students are taught best practices for writing code. To tackle the problem of poor and inconsistent coding conventions in Stata, DIME Analytics recently launched a new tool: the Stata linter. The Stata linter uses the new lint Stata command to help users write good Stata code by identifying problematic code practices. It reads a Stata do-file and automatically detects coding style that makes code hard to follow or that can lead to unintended errors, following DIME Analytics' Stata style guide. This presentation will cover the main functionalities of lint, showcasing how it can be used to detect and correct bad coding practices and improve the readability and transparency of Stata do-files.

10:30–11:00 a.m.

Break

11:00 a.m.–12:00 p.m.

Heterogeneous difference-in-differences estimation

Additional information:
US23_Pinzón.pdf

Enrique Pinzón, StataCorp

Treatment effects might differ over time and for groups that are treated at different points in time. These groups are known as treatment cohorts. In Stata 18, we introduced two commands that estimate treatment effects that vary over time and cohort. For repeated cross-sectional data, we have hdidregress. For panel data, we have xthdidregress. Both commands let you graph the evolution of treatment over time. They also allow you to aggregate treatment within cohort and time and visualize these effects. I will show you how both commands work and briefly discuss the theory underlying them.

12:00–1:00 p.m.

Lunch

1:00–1:20 p.m.

Generalized 2SLS procedure for Stata

Additional information:
US23_Suarez_Chavarria.pdf

Nicolas Suarez Chavarria, Stanford University

In this presentation, I implement code to run the generalized 2SLS procedure to estimate peer effects described in Bramoullé, Djebbari, and Fortin (2009). With this, we can fit peer-effects models in Stata very easily; we just need to define an adjacency matrix in Mata, define our dependent variable and our exogenous variables, and then just run the regression. The program returns the standard display of a regression command with coefficients, standard errors, p-values, and so on for our endogenous and exogenous effects, and all of these coefficients are also stored in e() to be used after with other postestimation commands. The program also allows us to row-normalize our adjacency matrix and to add group-level fixed effects.
Reference
Bramoullé, Y., H. Djebbari, and B. Fortin. 2009. Identification of peer effects through social networks. Journal of Econometrics 150: 41–55.

1:20–1:40 p.m.

The longitudinal effects of disability types on incomes and employment

Additional information:
US23_Millard.pdf

Robert Millard, Stony Brook University

This presentation studies the heterogeneous effects of disability onset on the level and composition of personal income. I use linked Canadian survey and administrative tax data to estimate the change in disaggregated income measures in the 10 years following onset. Estimates are obtained using a recent inverse weighting methodology that corrects for biases in two-way fixed-effect and event-study estimators. I differentiate disability based on limitations to daily activities, constructing three aggregate types: physical, cognitive, and concurrent. I then analyze the variation in effects across activity limitations within these aggregate types. I find that people with cognitive disabilities experience declines of greater magnitude and permanence in employment rates and employment income than people with physical disabilities. However, people with only cognitive disabilities experience less of an increase in government transfer payments from programs targeting individuals with disabilities. Within cognitive disabilities, intellectual and mental limitations experience greater declines in employment and employment income and less of an increase in government transfers compared with activity limitations within physical. Within physical disabilities, dexterity, mobility, and flexibility limitations experience remarkably similar treatment paths. In contrast, I find insignificant effects for limitations caused by pain alone, which confounds the estimated effects of physical disabilities.

1:40–2:10 p.m.

Bayesian meta-analysis of time to benefit

Additional information:
US23_Boscardin.pdf

John Boscardin, University of California San Francisco

Co-authors: Irena Cenzer, Sei J. Lee, Matthew Growdon, W. James Deardorff (UCSF Division of Geriatrics)

The clinical decisions to start a treatment for any condition require balancing short-term risks with long-term benefits. A clinically interpretable survival analysis metric in such decisions is time to benefit (TTB), the time at which a specific absolute risk reduction (ARR) is first obtained between two treatment arms. We describe a method for estimating TTB using Bayesian methods for meta-analysis. We first extract published survival curves using DigitizeIt and use these to reconstruct person-level time-to-event data with the Stata module ipdfc. Next, using the bayesmh command, we fit a hierarchical Bayesian model allowing for parameters of Weibull survival curves that are specific to each study and arm. We use the resulting joint posterior distribution to estimate study-specific and overall TTB for given ARR (for example, estimates and credible intervals for time until an ARR of 0.01, which is the time until an additional 1 out of 100 patients would benefit from the treatment). As a case study, the presentation shows results from a study of TTB of blood pressure medications on prevention of cardiovascular events.

2:10–2:30 p.m.

spgen: Creating spatially lagged variables in Stata

Additional information:
US23_Kondo.zip

Keisuke Kondo, Research Institute of Economy, Trade and Industry

This presentation introduces the new community-contributed command spgen, which computes spatially lagged variables in Stata. Spatial econometric analysis has gained attention from researchers and policymakers, and demand for its use is continuously growing among Stata users. The Sp commands are provided on Stata version 15 or later and facilitate handling of spatial data and estimation of spatial econometric models. The newly developed command spgen provides the extended function of the spgenerate command in the Sp commands to deal with a large-sized spatial dataset, such as mesh data and grid square statistics. The computation of spatially lagged variables requires a spatial weight matrix, which mathematically describes the spatially dependent structures in the matrix. However, when the spatial weight matrix is too large for the computer specs, the matrix operations may be unable to calculate spatially lagged variables in the Sp commands. The spgen command deals with this problem and provides some interesting examples of spatial data analysis. Kondo, Keisuke, 2015. SPGEN: Stata module to generate spatially lagged variables. Statistical Software Components S458105, Boston College Department of Economics, revised 17 Jun 2021.
URL: https://ideas.repec.org/c/boc/bocode/s458105.html.

2:30–3:15 p.m.

Using Stata for Q-methodology studies

Additional information:
US23_Akhtar-Danesh.pptx

Noori Akhtar-Danesh, McMaster University

Q‐methodology is an innovative research method where qualitative data are analyzed using quantitative techniques. It has the strengths of both qualitative and quantitative methods and is regarded as a bridge between these two approaches. It is used for the assessment of subjectivity, including attitudes, perceptions, feelings and values, preferences, life experiences such as stress and quality of life, and intraindividual concerns such as self-esteem, body image, and satisfaction. Q-methodology can be used in any type of research where the outcome variable involves assessment of subjectivity. It is used to identify unique salient viewpoints, as well as shared views on subjective issues, thereby providing unique insights into the richness of human subjectivity. Currently, there are only a handful of programs with limited capability for Q-methodology analysis. In this presentation, I provide a brief review of Q-methodology and three community-contributed commands, qconvert, qfactor, and qpair, in Stata, that offer an attractive set of options for Q-methodology analysis, including different factor-extraction and factor-rotation techniques. Applications of these commands will be illustrated using two real datasets.

3:15–4:00 p.m.

Break

4:00–4:30 p.m.

locproj: A new Stata command to estimate local projections

Additional information:
US23_Ugarte-Ruiz.pdf

Alfonso Ugarte-Ruiz, BBVA

locproj estimates linear and nonlinear impulse–response functions (IRFs) based on the local projections methodology first proposed by Jorda (2005). The procedure allows one to easily implement several options used in the growing literature of local projections. The options allow defining the desired specification in a fully automatic or in a customized way. For instance, it allows defining any nonlinear combination of variables as the impulse (shock) or defining methodological options that depend on the response horizon. It allows choosing different estimation methods for both time-series and panel data, including the instrumental-variables options currently available in Stata. It performs the necessary transformations to the dependent variable in order to estimate the local projections in the desired transformation, such as levels, logs, differences, log-differences, cumulative changes, and cumulative log-differences. For every option, the procedure generates the corresponding transformation of the dependent variable needed in case the user wants to include lags of the dependent variable. It reports the IRF, together with its standard error and confidence interval as an output matrix and through an IRF graph. The user can easily choose different options for the desired IRF graph and other options to save and use the results.

4:30–5:00 p.m.

Optimal policy learning using Stata

Additional information:
US23_Cerulli.pdf

Giovanni Cerulli, IRCRES-CNR

In the footsteps of the recent literature on empirical welfare maximization (EWM), I present a new Stata command called opl to carry out "optimal policy learning", a statistical procedure to design treatment assignments using a machine learning approach. The opl command focuses on three policy classes: threshold based, linear combination, and fixed-depth tree. I show a practical example based on a real policy case—that is, the popular LaLonde training program—where, by stressing the policymaker perspective, I show how to carry out optimal treatment assignment and the potential operative problems that can come up in applying this procedure to real-world case studies. I will discuss, in particular, problems of “angle solutions”. The presentation offers a general protocol to carry out optimal policy assignment using Stata and stresses the policymaker empirical perspective and related issues arising when carrying out optimal policy assignment in practice.

5:00 p.m.

Adjourn

8:45–9:15 a.m.

Registration

9:15–9:45 a.m.

Consistent estimation of finite mixtures: An application to latent group panel structures

Additional information:
US23_Langevin.pdf

Raphaël Langevin, McGill University

In this presentation, I show that maximizing the likelihood of a mixture of a finite number of parametric densities leads to inconsistent estimates under weak regularity conditions. The size of the asymptotic bias is positively correlated with the overall degree of overlap between the densities within the mixture. In contrast, I show that slight modifications in the classification expectation-maximization (CEM) algorithm—the likelihood generalization of the K-means algorithm—produce consistent estimates of all parameters in the mixture, and I derive the asymptotic distribution of the proposed estimation procedure. I confirm the inconsistency of MLE procedures, such as the expectation-maximization (EM) algorithm, using numerical experiments with simple Gaussian mixture models. Simulation results show that the proposed estimation strategy generally outperforms the EM algorithm when estimating latent group panel structures with unrestricted group membership across units and over time. I also compare the finite-sample performance of each estimation strategy using a mixture of two-part models to predict individual healthcare expenditures from health administrative data. Estimation results show that the proposed consistent CEM approach leads to smaller prediction errors than models fit with the EM algorithm, with a reduction of more than 40% in the out-of-sample prediction error compared with the standard, single-component, two-part model. The proposed estimation procedure thus represents a useful tool when both homogeneity of the parameters and constant group membership are assumed not to hold in panel-data analysis.

9:45–10:05 a.m.

Reproducible research in Stata: Managing dependencies and project files

Additional information:
US23_Correia.pdf

Sergio Correia, Board of Governors of the Federal Reserve

Co-author: Matthew Seay (Board of Governors of the Federal Reserve)

Reproducibility of results is one of Stata's most valuable features, as well as an essential goal for researchers and journal editors. This ability, however, is limited by the lack of version control for user-submitted packages, which are often distributed through Github and other channels outside of the Statistical Software Components (SSC) archive. Thus, other researchers or even coauthors might fail to reproduce a given result given the same code and data because of different package versions. In this talk, we present REQUIRE, a Stata package that fills this gap by ensuring that package dependencies are consistent across users. For this, REQUIRE is able to extract a package version number based on the "starbang lines" included by users at the top of each ado-file. Because starbangs are not standardized and come in many different variants, our package takes particular care to cover corner cases and have a coverage as broad as possible across all packages available on SSC and Github. Then REQUIRE can be used to assert that an exact or minimum package version is present and install it if asked for. Last, we showcase how to use this package together with the related SETROOT package that tracks projects' working directories.

10:05–10:25 a.m.

Creating Likert-scale visualizations: An approach using Stata and Tableau

Additional information:
US23_Cervantes.pptx

Sergio Cervantes, WestEd

Stata and Tableau are tools that can be used to gain insight into Likert-scale responses. However, very little research exists that discusses how one can create Likert-scale visualizations with the use of Stata and Tableau in tandem. The purpose of this work is to help researchers create Likert-scale visualizations efficiently. The step-by-step process will serve as a guide for researchers to create dashboard-worthy visualizations that effectively present data. The key is creating an Excel file exported from Stata that can be imported as a data source into Tableau. It is important that this file include respondents' IDs, group variable, and Likert-scale responses. In addition, the raw data must be prepared using reshape, and an additional variable indicating the numeric values of the Likert-scale responses (or vice versa) must be generated using gen. Once the Excel file is imported into Tableau, we can set up the visual with the sheet interface. Using Tableau, we can create a Likert-scale visual with select mark modifications and even include item-response averages using level of detail (LOD) arithmetic. Using best data practices and formatting, we can create visuals that effectively communicate findings from raw survey data.

10:25–10:45 a.m.

Metaprogramming: What it is, how to use it, and why you should care

Additional information:
US23_Buchanan (.html)

Billy Buchanan, SAG Corporation

Metaprogramming provides a highly flexible approach to solving complex programming problems. Although metaprogramming can be challenging to implement in some programming languages, metaprogramming is easy to implement in Stata largely because of the evaluation of local macros. However, metaprogramming is rarely discussed in the Stata community despite the benefits that metaprogramming can and does provide for many Stata users already. This talk will include a discussion of what metaprogramming is and how metaprogramming can be used effectively to increase efficiency and will illustrate the use of metaprogramming in Stata.

10:45–11:15 a.m.

Break

11:15 a.m. –12:15 p.m.

Bayesian model averaging

Additional information:
US23_Marchenko.pdf

Yulia Marchenko, StataCorp

Model uncertainty accompanies many data analyses. Stata's new bma suite that performs Bayesian model averaging (BMA) helps address this uncertainty in the context of linear regression. Which predictors are important given the observed data? Which models are more plausible? How do predictors relate to each other across different models? BMA can answer these and more questions. BMA uses the Bayes theorem to aggregate the results across multiple candidate models to account for model uncertainty during inference and prediction in a principled and universal way. In my presentation, I will describe the basics of BMA and demonstrate it with the bma suite. I will also show how BMA can become a useful tool for your regression analysis, Bayesian or not!

12:15–1:15 p.m.

Lunch

1:15–2:00 p.m.

Open panel discussion with Stata developers

Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users.

2:00–2:30 p.m.

dqrep: Facilitating harmonized data-quality assessments with Stata

Additional information:
US23_Schmidt.zip

Carsten Oliver Schmidt, University Medicine Greifswald

Co-authors: Stephan Struckmann, Birgit Schauer (University Medicine Greifswald)

Transparent data-quality reporting is a key element of reproducible research. Transparency ranges from explicit assumptions underlying any data-quality checkup to harmonized reporting that facilitates comparisons of results within and across studies. However, this is far from being common. To the best of our knowledge, none of the existing routines was capable of triggering a series of structured reports on multiple datasets with potentially unknown errors based on a single command call to grade and compare data-quality issues. Therefore, the dqrep Stata package was developed. dqrep triggers a set of more than 60 newly developed Stata ados to compute a customizable range of quality checks. This comprises descriptive overviews, missing values, rule violations, outliers, time trends, observer and device effects. Underlying assumptions are read from easily modifiable spreadsheets. Based on this, all results are integrated in PDF and .docx files, as well as in result summary files to facilitate postprocessing, for example, to create benchmarks. It is shown how a single command call is used to control the data-quality pipeline in a large-scale cohort study and how this may contribute to FAIR research.
Availability
dqrep can be downloaded using the net command from:. https://packages.qihs.uni-greifswald.de/repository/stata/dqrep.

2:30–3:00 p.m.

Measuring associations and evaluating forecasts of categorical variables

Additional information:
US23_Sirchenko.pdf

Andriy Sirchenko, Nyenrode Business University

Co-authors: Jochem Huismans (University of Amsterdam), Jan Willem Nijenhuis (Nedap NV)

This presentation introduces a new Stata command, classify, that computes various measures of association and correlation between two categorical variables (binary, ordinal, or nominal), evaluates the performance of categorical deterministic forecasts, and provides diagnostic probability scores of the accuracy of probabilistic forecasts. We compiled a comprehensive catalogue of 9 diagnostic scores for probabilistic forecasts and over 210 measures of association and correlation employed in different fields, along with the terminological synonymy and bibliography associated with them. In addition to the overall measures, the command computes the category-specific metrics for each observed category and its macro and weighted averages. We also classify all measures according to the two types of symmetry as well as propose and compute the complement and transpose symmetric variants of those measures that are not symmetric.

3:00–3:30 p.m.

Break

3:30–4:00 p.m.

Program monitoring of educational tablet-based interventions using topic modeling in Stata

Additional information:
US23_Bahlibi.pptx

Abraham Bahlibi, Imagine Worldwide

Rigorous research conducted in Africa since 2015 established that onebillion's software, an award-winning tablet-based curriculum, produces meaningful impacts in literacy and numeracy (Levesque, Bardack, and Chigeda 2020; Levesque et al. 2022; Pitchford, Hubber, and Chigeda 2017). As these programs are scaled up, program monitoring will become critical for maintaining the quality of implementation and outcomes. International organizations have called for using text analysis as a tool for monitoring and evaluation (Wencker 2019). The present study piloted the use of text analysis to identify themes from field observations of a tablet-based program using onebillion's software for early grade learners. We collected 426 open-ended observations by field officers. We used the Stata package ldagibbs to run topic modeling/latent Dirichlet allocation (LDA). LDA clusters text documents into a user-chosen number of topics (Schwarz 2018). We anticipated that LDA would generate topics that help us more efficiently summarize field observations. LDA successfully generated topics such as faulty audio cables and how they contributed to noisier classrooms. We will receive more survey data as we scale to new sites. Pilot results suggest that LDA may be an efficient means of identifying topics otherwise difficult to identify with staff review of voluminous survey responses.

View Day 2 schedule & agenda ›

Scientific committee

The scientific committee is responsible for the Stata Conference program.

With submissions encouraged from both new and longtime Stata users from all backgrounds, the committee will review all abstracts in developing an exciting, diverse, and informative program. We look forward to seeing you in Stanford.

Colin Cameron

University of California, Davis

Department of Economics

Margaret Stedman

Stanford University

School of Medicine

Jeremy Freese

Stanford University

School of Humanities and Sciences

Lakshika Tennakoon

Stanford University

School of Medicine

Why attend?

Connect with the inventive and creative user community.

Experience what happens when new and longtime Stata users from across all disciplines gather to discuss real-world applications of Stata. Whether you are a beginner or an expert, you will find something just for you at Stata Conferences, which are held each year in several different locations around the world.

These conferences provide in-depth presentations from experienced Stata users and experts from StataCorp that focus on helping you use Stata more effectively.

Network

Open to users of all disciplines and experience levels, Stata Conferences bring together a unique mix of experts and professionals. Develop a well-established network within the Stata community.

Stay up to date

Hear from Stata experts in the top of their fields, as well as Stata's own researchers and developers. Gain valuable insights, discover new commands, learn best practices, and improve your knowledge of Stata.

Discover new features

Presentation topics have included new community-contributed commands, methods and resources for teaching with Stata, new approaches for using Stata together with other software, and much more.

Stata/MP4 Annual License (download)

2023 Stata Conference

Stanford, California · 20–21 July

Proceedings

All times Pacific Daylight Time

Day 1

Day 2

8:15–8:50 a.m.

Registration

8:50–9:00 a.m.

Welcome and introductions

9:00–9:30 a.m.

Quantile regressions with multiple fixed effects

Fernando Rios-Avila, Levy Economics Institute

9:30–10:00 a.m.

iedorep: Quickly locate reproducibility failures in Stata code

Benjamin Daniels, The World Bank (Development Impact Evaluation)

10:00–10:30 a.m.

Introducing the Stata linter: A tool to produce clear and transparent Stata code

Luis Eduardo San Martin, The World Bank (Development Impact Evaluation)

Co-author: Rony Rodriguez-Ramirez, World Bank-DECRG

10:30–11:00 a.m.

Break

11:00 a.m.–12:00 p.m.

Heterogeneous difference-in-differences estimation

Enrique Pinzón, StataCorp

12:00–1:00 p.m.

Lunch

1:00–1:20 p.m.

Generalized 2SLS procedure for Stata

Nicolas Suarez Chavarria, Stanford University

1:20–1:40 p.m.

The longitudinal effects of disability types on incomes and employment

Robert Millard, Stony Brook University

1:40–2:10 p.m.

Bayesian meta-analysis of time to benefit

John Boscardin, University of California San Francisco

Co-authors: Irena Cenzer, Sei J. Lee, Matthew Growdon, W. James Deardorff (UCSF Division of Geriatrics)

2:10–2:30 p.m.

spgen: Creating spatially lagged variables in Stata

Keisuke Kondo, Research Institute of Economy, Trade and Industry

2:30–3:15 p.m.

Using Stata for Q-methodology studies

Noori Akhtar-Danesh, McMaster University

3:15–4:00 p.m.

Break

4:00–4:30 p.m.

locproj: A new Stata command to estimate local projections

Alfonso Ugarte-Ruiz, BBVA

4:30–5:00 p.m.

Optimal policy learning using Stata

Giovanni Cerulli, IRCRES-CNR

5:00 p.m.

Adjourn

8:45–9:15 a.m.

Registration

9:15–9:45 a.m.

Consistent estimation of finite mixtures: An application to latent group panel structures

Raphaël Langevin, McGill University

9:45–10:05 a.m.

Reproducible research in Stata: Managing dependencies and project files

Sergio Correia, Board of Governors of the Federal Reserve

Co-author: Matthew Seay (Board of Governors of the Federal Reserve)

10:05–10:25 a.m.

Creating Likert-scale visualizations: An approach using Stata and Tableau

Sergio Cervantes, WestEd

10:25–10:45 a.m.

Metaprogramming: What it is, how to use it, and why you should care

Billy Buchanan, SAG Corporation

10:45–11:15 a.m.

Break

11:15 a.m. –12:15 p.m.

Bayesian model averaging

Yulia Marchenko, StataCorp

12:15–1:15 p.m.

Lunch

1:15–2:00 p.m.

Open panel discussion with Stata developers

Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users.

2:00–2:30 p.m.