Don't forget to save the date and join us next year at the 2024 Stata Conference in Portland, Oregon on 1–2 August 2024!
View the conference photos here, and view the proceedings and presentation slides below.
Quantile regression (QR) is an estimation strategy that provides richer characterizations of the relationships between dependent and independent variables. Some developments in the literature have focused on extending quantile regression analysis to include individual fixed effects in the framework of panel data, avoiding the incidental parameter problem, under different assumptions. One recent article by Machado and Santos Silva (2019) proposed a location-scale estimator that allows for the inclusion of individual fixed effects in the framework of panel data, which permits individual effects to vary across quantiles. In this presentation, I propose an extension to this estimator that permits using any number of fixed effects, providing alternative estimators for SE beyond those suggested in Machado and Santos Silva (2019). I also present the command mmqreg, which implements these extensions.
iedorep is a new Stata command in DIME Analytics’s ietoolkit package to check reproducibility of each line of a Stata do-file. First, iedorep takes a single do-file as an argument, runs it, and stores the Stata state after each line executes. This includes the current data signature, the state of the RNG, and the state of the sort RNG. Then it runs the do-file again, checking the state at all the same points. Finally, it reports exactly which lines (if any) have produced unstable states—quickly and accurately identifying hard-to-find reproducibility failures. This presentation will cover potential ways of using iedorep. We will discuss how it detects reproducibility errors, how it provides an efficient way to debug and check reproducibility of Stata code, and how it encourages users to write more accessible code. We will also explore how iedorep can be used in workshops and teaching activities and how it can serve as an important tool in research teams to review code and ensure project reproducibility. Finally, we will highlight areas for improvement and development challenges, such as within-loop implementation and recursive use in projects that use run or do to manage subtasks.
Statistical programming code developed collaboratively is common in modern data work. However, it is also usual for people to have different coding conventions, making it challenging for one reader to quickly understand another's code and impeding transparency. This is especially true for researchers using Stata because it does not have a widely accepted style guide and few economics graduate students are taught best practices for writing code. To tackle the problem of poor and inconsistent coding conventions in Stata, DIME Analytics recently launched a new tool: the Stata linter. The Stata linter uses the new lint Stata command to help users write good Stata code by identifying problematic code practices. It reads a Stata do-file and automatically detects coding style that makes code hard to follow or that can lead to unintended errors, following DIME Analytics' Stata style guide. This presentation will cover the main functionalities of lint, showcasing how it can be used to detect and correct bad coding practices and improve the readability and transparency of Stata do-files.
Treatment effects might differ over time and for groups that are treated at different points in time. These groups are known as treatment cohorts. In Stata 18, we introduced two commands that estimate treatment effects that vary over time and cohort. For repeated cross-sectional data, we have hdidregress. For panel data, we have xthdidregress. Both commands let you graph the evolution of treatment over time. They also allow you to aggregate treatment within cohort and time and visualize these effects. I will show you how both commands work and briefly discuss the theory underlying them.
In this presentation, I implement code to run the generalized 2SLS procedure to estimate peer effects described in Bramoullé, Djebbari, and
Fortin (2009). With this, we can
fit peer-effects models in Stata very easily; we just need to define an adjacency matrix in Mata, define our dependent variable and
our exogenous variables, and then just run the regression. The program returns the standard display of a regression command with coefficients,
standard errors, p-values, and so on for our endogenous and exogenous effects, and all of these coefficients are also stored in e() to be used
after with other postestimation commands. The program also allows us to row-normalize our adjacency matrix and to add group-level fixed effects.
Bramoullé, Y., H. Djebbari, and B. Fortin. 2009. Identification of peer effects through social networks. Journal of Econometrics 150: 41–55.
This presentation studies the heterogeneous effects of disability onset on the level and composition of personal income. I use linked Canadian survey and administrative tax data to estimate the change in disaggregated income measures in the 10 years following onset. Estimates are obtained using a recent inverse weighting methodology that corrects for biases in two-way fixed-effect and event-study estimators. I differentiate disability based on limitations to daily activities, constructing three aggregate types: physical, cognitive, and concurrent. I then analyze the variation in effects across activity limitations within these aggregate types. I find that people with cognitive disabilities experience declines of greater magnitude and permanence in employment rates and employment income than people with physical disabilities. However, people with only cognitive disabilities experience less of an increase in government transfer payments from programs targeting individuals with disabilities. Within cognitive disabilities, intellectual and mental limitations experience greater declines in employment and employment income and less of an increase in government transfers compared with activity limitations within physical. Within physical disabilities, dexterity, mobility, and flexibility limitations experience remarkably similar treatment paths. In contrast, I find insignificant effects for limitations caused by pain alone, which confounds the estimated effects of physical disabilities.
The clinical decisions to start a treatment for any condition require balancing short-term risks with long-term benefits. A clinically interpretable survival analysis metric in such decisions is time to benefit (TTB), the time at which a specific absolute risk reduction (ARR) is first obtained between two treatment arms. We describe a method for estimating TTB using Bayesian methods for meta-analysis. We first extract published survival curves using DigitizeIt and use these to reconstruct person-level time-to-event data with the Stata module ipdfc. Next, using the bayesmh command, we fit a hierarchical Bayesian model allowing for parameters of Weibull survival curves that are specific to each study and arm. We use the resulting joint posterior distribution to estimate study-specific and overall TTB for given ARR (for example, estimates and credible intervals for time until an ARR of 0.01, which is the time until an additional 1 out of 100 patients would benefit from the treatment). As a case study, the presentation shows results from a study of TTB of blood pressure medications on prevention of cardiovascular events.
This presentation introduces the new community-contributed command spgen, which computes spatially lagged variables in Stata. Spatial econometric analysis has
gained attention from researchers and policymakers, and demand for its use is continuously growing among Stata users. The Sp commands are
provided on Stata version 15 or later and facilitate handling of spatial data and estimation of spatial econometric models. The newly developed
command spgen provides the extended function of the spgenerate command in the Sp commands to deal with a large-sized spatial
dataset, such as mesh data and grid square statistics. The computation of spatially lagged variables requires a spatial weight matrix, which
mathematically describes the spatially dependent structures in the matrix. However, when the spatial weight matrix is too large for the
computer specs, the matrix operations may be unable to calculate spatially lagged variables in the Sp commands. The spgen command deals
with this problem and provides some interesting examples of spatial data analysis. Kondo, Keisuke, 2015. SPGEN: Stata module to generate
spatially lagged variables. Statistical Software Components S458105, Boston College Department of Economics, revised 17 Jun 2021.
Q‐methodology is an innovative research method where qualitative data are analyzed using quantitative techniques. It has the strengths of both qualitative and quantitative methods and is regarded as a bridge between these two approaches. It is used for the assessment of subjectivity, including attitudes, perceptions, feelings and values, preferences, life experiences such as stress and quality of life, and intraindividual concerns such as self-esteem, body image, and satisfaction. Q-methodology can be used in any type of research where the outcome variable involves assessment of subjectivity. It is used to identify unique salient viewpoints, as well as shared views on subjective issues, thereby providing unique insights into the richness of human subjectivity. Currently, there are only a handful of programs with limited capability for Q-methodology analysis. In this presentation, I provide a brief review of Q-methodology and three community-contributed commands, qconvert, qfactor, and qpair, in Stata, that offer an attractive set of options for Q-methodology analysis, including different factor-extraction and factor-rotation techniques. Applications of these commands will be illustrated using two real datasets.
locproj estimates linear and nonlinear impulse–response functions (IRFs) based on the local projections methodology first proposed by Jorda (2005). The procedure allows one to easily implement several options used in the growing literature of local projections. The options allow defining the desired specification in a fully automatic or in a customized way. For instance, it allows defining any nonlinear combination of variables as the impulse (shock) or defining methodological options that depend on the response horizon. It allows choosing different estimation methods for both time-series and panel data, including the instrumental-variables options currently available in Stata. It performs the necessary transformations to the dependent variable in order to estimate the local projections in the desired transformation, such as levels, logs, differences, log-differences, cumulative changes, and cumulative log-differences. For every option, the procedure generates the corresponding transformation of the dependent variable needed in case the user wants to include lags of the dependent variable. It reports the IRF, together with its standard error and confidence interval as an output matrix and through an IRF graph. The user can easily choose different options for the desired IRF graph and other options to save and use the results.
In the footsteps of the recent literature on empirical welfare maximization (EWM), I present a new Stata command called opl to carry out "optimal policy learning", a statistical procedure to design treatment assignments using a machine learning approach. The opl command focuses on three policy classes: threshold based, linear combination, and fixed-depth tree. I show a practical example based on a real policy case—that is, the popular LaLonde training program—where, by stressing the policymaker perspective, I show how to carry out optimal treatment assignment and the potential operative problems that can come up in applying this procedure to real-world case studies. I will discuss, in particular, problems of “angle solutions”. The presentation offers a general protocol to carry out optimal policy assignment using Stata and stresses the policymaker empirical perspective and related issues arising when carrying out optimal policy assignment in practice.
In this presentation, I show that maximizing the likelihood of a mixture of a finite number of parametric densities leads to inconsistent estimates under weak regularity conditions. The size of the asymptotic bias is positively correlated with the overall degree of overlap between the densities within the mixture. In contrast, I show that slight modifications in the classification expectation-maximization (CEM) algorithm—the likelihood generalization of the K-means algorithm—produce consistent estimates of all parameters in the mixture, and I derive the asymptotic distribution of the proposed estimation procedure. I confirm the inconsistency of MLE procedures, such as the expectation-maximization (EM) algorithm, using numerical experiments with simple Gaussian mixture models. Simulation results show that the proposed estimation strategy generally outperforms the EM algorithm when estimating latent group panel structures with unrestricted group membership across units and over time. I also compare the finite-sample performance of each estimation strategy using a mixture of two-part models to predict individual healthcare expenditures from health administrative data. Estimation results show that the proposed consistent CEM approach leads to smaller prediction errors than models fit with the EM algorithm, with a reduction of more than 40% in the out-of-sample prediction error compared with the standard, single-component, two-part model. The proposed estimation procedure thus represents a useful tool when both homogeneity of the parameters and constant group membership are assumed not to hold in panel-data analysis.
Reproducibility of results is one of Stata's most valuable features, as well as an essential goal for researchers and journal editors. This ability, however, is limited by the lack of version control for user-submitted packages, which are often distributed through Github and other channels outside of the Statistical Software Components (SSC) archive. Thus, other researchers or even coauthors might fail to reproduce a given result given the same code and data because of different package versions. In this talk, we present REQUIRE, a Stata package that fills this gap by ensuring that package dependencies are consistent across users. For this, REQUIRE is able to extract a package version number based on the "starbang lines" included by users at the top of each ado-file. Because starbangs are not standardized and come in many different variants, our package takes particular care to cover corner cases and have a coverage as broad as possible across all packages available on SSC and Github. Then REQUIRE can be used to assert that an exact or minimum package version is present and install it if asked for. Last, we showcase how to use this package together with the related SETROOT package that tracks projects' working directories.
Stata and Tableau are tools that can be used to gain insight into Likert-scale responses. However, very little research exists that discusses how one can create Likert-scale visualizations with the use of Stata and Tableau in tandem. The purpose of this work is to help researchers create Likert-scale visualizations efficiently. The step-by-step process will serve as a guide for researchers to create dashboard-worthy visualizations that effectively present data. The key is creating an Excel file exported from Stata that can be imported as a data source into Tableau. It is important that this file include respondents' IDs, group variable, and Likert-scale responses. In addition, the raw data must be prepared using reshape, and an additional variable indicating the numeric values of the Likert-scale responses (or vice versa) must be generated using gen. Once the Excel file is imported into Tableau, we can set up the visual with the sheet interface. Using Tableau, we can create a Likert-scale visual with select mark modifications and even include item-response averages using level of detail (LOD) arithmetic. Using best data practices and formatting, we can create visuals that effectively communicate findings from raw survey data.
Metaprogramming provides a highly flexible approach to solving complex programming problems. Although metaprogramming can be challenging to implement in some programming languages, metaprogramming is easy to implement in Stata largely because of the evaluation of local macros. However, metaprogramming is rarely discussed in the Stata community despite the benefits that metaprogramming can and does provide for many Stata users already. This talk will include a discussion of what metaprogramming is and how metaprogramming can be used effectively to increase efficiency and will illustrate the use of metaprogramming in Stata.
Model uncertainty accompanies many data analyses. Stata's new bma suite that performs Bayesian model averaging (BMA) helps address this uncertainty in the context of linear regression. Which predictors are important given the observed data? Which models are more plausible? How do predictors relate to each other across different models? BMA can answer these and more questions. BMA uses the Bayes theorem to aggregate the results across multiple candidate models to account for model uncertainty during inference and prediction in a principled and universal way. In my presentation, I will describe the basics of BMA and demonstrate it with the bma suite. I will also show how BMA can become a useful tool for your regression analysis, Bayesian or not!
Transparent data-quality reporting is a key element of reproducible research. Transparency ranges from explicit assumptions underlying any
data-quality checkup to harmonized reporting that facilitates comparisons of results within and across studies. However, this is far
from being common. To the best of our knowledge, none of the existing routines was capable of triggering a series of structured reports on
multiple datasets with potentially unknown errors based on a single command call to grade and compare data-quality issues. Therefore, the
dqrep Stata package was developed. dqrep triggers a set of more than 60 newly developed Stata ados to compute a customizable
range of quality checks. This comprises descriptive overviews, missing values, rule violations, outliers, time trends, observer and
device effects. Underlying assumptions are read from easily modifiable spreadsheets. Based on this, all results are integrated in PDF and
.docx files, as well as in result summary files to facilitate postprocessing, for example, to create benchmarks. It is shown how a single command
call is used to control the data-quality pipeline in a large-scale cohort study and how this may contribute to FAIR research.
dqrep can be downloaded using the net command from:. https://packages.qihs.uni-greifswald.de/repository/stata/dqrep.
This presentation introduces a new Stata command, classify, that computes various measures of association and correlation between two categorical variables (binary, ordinal, or nominal), evaluates the performance of categorical deterministic forecasts, and provides diagnostic probability scores of the accuracy of probabilistic forecasts. We compiled a comprehensive catalogue of 9 diagnostic scores for probabilistic forecasts and over 210 measures of association and correlation employed in different fields, along with the terminological synonymy and bibliography associated with them. In addition to the overall measures, the command computes the category-specific metrics for each observed category and its macro and weighted averages. We also classify all measures according to the two types of symmetry as well as propose and compute the complement and transpose symmetric variants of those measures that are not symmetric.
Rigorous research conducted in Africa since 2015 established that onebillion's software, an award-winning tablet-based curriculum, produces meaningful impacts in literacy and numeracy (Levesque, Bardack, and Chigeda 2020; Levesque et al. 2022; Pitchford, Hubber, and Chigeda 2017). As these programs are scaled up, program monitoring will become critical for maintaining the quality of implementation and outcomes. International organizations have called for using text analysis as a tool for monitoring and evaluation (Wencker 2019). The present study piloted the use of text analysis to identify themes from field observations of a tablet-based program using onebillion's software for early grade learners. We collected 426 open-ended observations by field officers. We used the Stata package ldagibbs to run topic modeling/latent Dirichlet allocation (LDA). LDA clusters text documents into a user-chosen number of topics (Schwarz 2018). We anticipated that LDA would generate topics that help us more efficiently summarize field observations. LDA successfully generated topics such as faulty audio cables and how they contributed to noisier classrooms. We will receive more survey data as we scale to new sites. Pilot results suggest that LDA may be an efficient means of identifying topics otherwise difficult to identify with staff review of voluminous survey responses.
With submissions encouraged from both new and longtime Stata users from all backgrounds, the committee will review all abstracts in developing an exciting, diverse, and informative program. We look forward to seeing you in Stanford.
University of California, Davis
Department of Economics
School of Medicine
School of Humanities and Sciences
School of Medicine
Experience what happens when new and longtime Stata users from across all disciplines gather to discuss real-world applications of Stata. Whether you are a beginner or an expert, you will find something just for you at Stata Conferences, which are held each year in several different locations around the world.
These conferences provide in-depth presentations from experienced Stata users and experts from StataCorp that focus on helping you use Stata more effectively.
Open to users of all disciplines and experience levels, Stata Conferences bring together a unique mix of experts and professionals. Develop a well-established network within the Stata community.
Hear from Stata experts in the top of their fields, as well as Stata's own researchers and developers. Gain valuable insights, discover new commands, learn best practices, and improve your knowledge of Stata.
Presentation topics have included new community-contributed commands, methods and resources for teaching with Stata, new approaches for using Stata together with other software, and much more.