The Stata Conference was held 4–5 August 2022. Don't forget to save the date and join us next year at the 2023 Stata Conference in Stanford, California on 20–21 July 2023!
View the conference photos here, and view the proceedings and presentation slides below.
In this presentation, I introduce the gtsheckman command, which estimates a generalized two-step Heckman sample-selection estimator adjusted for heteroskedasticity. This estimator has been previously proposed in Carlson and Joshi (2022), where the presence of heteroskedasticity was motivated by a panel-data setting with random coefficients. The gtsheckman command offers several advantages over the heckman, twostep command, including robust inference, a more general control function specification, and incorporating heteroskedasticity.
Quantile regression (command qreg) estimates quantiles of the outcome variable, conditional on the values of the independent variables, with median regression as the default form. Quantile regression can be used for several purposes: to estimate medians instead of means as a measure of central tendency—for instance, when data are markedly skewed; to estimate a particular quantile that may be of interest, such as the 10th quantile of birthweight to find predictors of low birthweight; or to study how the effects of independent variables vary over different quantiles of the dependent variable. Specifying the variance–covariance estimator for quantile regression is not straightforward. qreg offers both independent and identically distributed (i.i.d.) and robust estimators. The density estimation technique (DET) can be fitted, residual (i.i.d. only), or kernel. Three different bandwidth methods are available with the fitted and residual DETs, and eight kernel functions are available for the kernel DET. There is also a bootstrap option, which puts the total number of methods at 26. A natural question arises: which one to use? The aim of this presentation is to explore the performance of the methods and to arrive at some overall recommendations for which methods to use.
Co-authors: Joseph V. Terza (IUPUI), James Fisher (Henan University)
We give a Stata command, bivpoisson, that allows efficient estimation of seemingly unrelated count data. This command is an extension and improvement upon sureg, which is a linear, seemingly unrelated regression command based on Zellner (1963). This is the first command in Stata that allows for user-specified cross-equation correlation structure in the context of a nonlinear system of equations. This package can be widely used in many count data such as accidents, RNA sequences, and healthcare. The theoretical advantage of this model is the efficiency gain. When we encounter count-valued correlated dependent variables, a linear system of equation estimation is no longer efficient. See details of the simulation study for efficiency comparison in Terza and Zhang (2022, Working paper). Maximum likelihood estimation is used for deep-parameter estimation and causal inference, and these numerical tasks are implemented in Stata/Mata with the two-dimensional Gauss–Legendre quadrature integration algorithm. See Terza and Zhang (2020 Stata Conference) and Kazeminezhad, Terza, and Zhang (2021 Stata Conference) for the details of the algorithm and validation. The deep parameters estimated by this package include the point estimate and standard errors of (1) a vector of coefficient beta for the exponentiated linear index; and (2) the correlation coefficient parameter rho for the cross-equation heterogeneity term, which is multivariate normally distributed. A postestimation command in average treatment-effect estimation (ATE) will be developed in the later version of this command, as will model-specification tests. Other types of count marginal distributions such as Conway–Maxwell–Poisson will also be added in the future version as options for dispersion flexibility.
This presentation describes a new Stata command, rbicopula, for fitting copula-based maximum-likelihood estimation of recursive bivariate models that enable a flexible residual distribution and differ from bivariate copula or probit models in allowing the first dependent variable to appear on the right-hand side of the second dependent variable. The new command provides various copulas, allowing the user to choose a copula that best captures the dependence features of the data caused by the presence of common unobserved heterogeneity. Although the estimation of model parameters does not differ from the bivariate case, the existing community-contributed command bicop does not consider the structural model's recursive nature for predictions and doesn't enable margins as a postestimation command. rbicopula estimates the model parameters, computes treatment effects of the first dependent variable, and gives the marginal effects of independent variables. In addition, marginal effects can be decomposed into direct and indirect effects if covariates appear in both equations. Moreover, the postestimation commands incorporate two goodness-of-fit tests. Dependent variables of the recursive bivariate model may be binary, ordinal, or a mixture of both. I present and explain the rbicopula command and the available postestimation commands using data from the Stata website.
Recent versions of Stata provide helpful tools to generate reproducible reports in Microsoft Word, HTML, and PDFs. However, for better or worse, Microsoft PowerPoint presentations are the most common form of communication in many business and academic settings. Therefore, many Stata users may benefit from tools to integrate Stata and PowerPoint. I introduce a suite of new Stata programs that facilitate creating PowerPoint presentations with Stata-generated content, particularly graphs. These programs take advantage of Stata version 17’s tighter integration with the Python programming language. Using this suite of programs, collectively called “Slide Deck,” Stata users can easily create PowerPoint presentations within Stata. Slide Deck encompasses two easy-to-use, original Stata classes: “deck” and “slide.” With a few simple commands, these classes enable users to create and save a deck of PowerPoint slides that incorporate Stata graphs and other output, as well as user-supplied text (i.e., title, bullet points, etc.), without ever leaving Stata.
Shared computing environments are regularly used by academia, government, and industry. While effective for the organization to manage costs and upkeep of computing infrastructure, working in a shared computing environment presents both unique benefits and challenges compared with using hardware owned/operated by the researcher. This talk will provide some advice to successfully navigate challenges in working in shared computing environments. Some topics will include dealing with memory and disk/storage constraints, leveraging metadata for documentation and infrastructure, standardizing project setup and workflow, and discussing some newer community-contributed tools that can maximize the efficiency of your computer consumption.
You can use treatment-effects estimators to draw causal inferences from observational data. You can use lasso when you want to control for many potential covariates. With standard treatment-effects models, there is an intrinsic conflict between two required assumptions. The conditional independence assumption is likely to be satisfied with many variables in the model, while the overlap assumption is likely to be satisfied with fewer variables in the model. This presentation shows how to overcome this conflict by using Stata 17's telasso command. telasso estimates the average treatment effects with high-dimensional controls while using lasso for model selection. This estimator is robust to the model-selection mistakes. Moreover, it is doubly robust, so only one of the outcome or treatment model needs to be correctly specified.
Co-authors: Andrés Garcia-Suaza (U. del Rosario), Miguel Henry (Greylock McKinnon Associates), Jesús Otero (U. del Rosario)
We offer a two-stage (time-series and cross-section) econometric modeling approach to examine the drivers behind the spread of COVID-19 deaths across counties in the United States. Our empirical strategy exploits the availability of two years (January 2020 through January 2022) of daily data on the number of confirmed deaths and cases of COVID-19 in the 3,000 U.S. counties of the 48 contiguous states and the District of Columbia. In the first stage of the analysis, we use daily time-series data on COVID-19 cases and deaths to fit mixed models of deaths against lagged confirmed cases for each county. Because the resulting coefficients are county specific, they relax the homogeneity assumption that is implicit when the analysis is performed using geographically aggregated cross-section units. In the second stage of the analysis, we assume that these county estimates are a function of economic and sociodemographic factors that are taken as fixed over the course of the pandemic. Here we employ the novel one-covariate-at-a-time variable-selection algorithm proposed by Chudik et al. (Econometrica, 2018) to guide the choice of regressors.
Co-authors: Wilson Hernández (GRADE), José Carlos Aguilar (PUCP), Jorge Agüero (Univ of Connecticut)
The consensus is that intimate partner violence (IPV) increased during the COVID-19 lockdown. However, neither the long-term effect nor the mechanisms that explain this variation have been adequately identified (Peterman and Donnell 2020), a gap that applies to the literature in Peru and worldwide. The objective of this study is to assess the long-term impact in the 11 months from the start of lockdown on IPV, differentiating the effects by type of violence (psychological, physical, and sexual) and examining three mechanisms through which these effects may appear: prior violence, substance use, and social isolation. We do so by applying an event study and exploiting the time and location of hourly calls (N = 235,555) received by the only national helpline for domestic violence in Peru (Línea 100) (from 01/2018 to 02/2021). By focusing on Peru, we were able to respond to what happened to IPV during COVID-19 for a country in a complex situation for women: high pre-COVID-19 prevalence of IPV (Bott et al. 2019a), restrictive long-lasting lockdown measures during the pandemic, and the worst performance against COVID-19 in terms of deaths per capita and loss of national gross income. The results show that IPV varied but nonlinearly in the eleven months from the start of lockdown. Furthermore, psychological IPV was the one that showed the greatest increase, followed by physical IPV. Sexual IPV showed no changes. In terms of the impact mechanisms, previous history and alcohol consumption were the most important ones, with nonlinear variations over time. While nonlinearity may indicate a media regression to the mean for some cases as a sign of “new normal levels” of IPV, relationships with risk factors show an opposite situation in which IPV is still rising a year after the initial lockdown.
In the aftermath of the global financial crisis of 2008, macrofinancial linkages have gained more attention from policymakers as primary issues of financial system stability. A clearer understanding of probability of default (PD) drivers may help predict if a bank will default on its portfolio liabilities. This presentation develops a method to assess a bank's PD based on a multivariate copula distribution to capture nonlinear relationships between variables with complex data structures. Then we use the generalized method of moments (GMM) to observe the relationship between PD to bank performance (bank-specific indicators) and the macroeconomic indicators. Our findings illustrate some critical links between PD and macroeconomic environments. For example, empirical evidence suggests that bank-specific indicators such as the CET 1 ratio, inefficiency ratio, and deposit ratio appear to be negatively and statistically significant to a bank's PD. When we examined the structural and macroeconomic variables, we found that the policy rate, the real exchange rate, economic growth, and the unemployment rate may reduce the PD. We also found that central state-owned banks tend to have a higher risk than other bank groups and that regional state-owned banks in the central region have the greatest likelihood of default.
Co-authors: Kristoffer Bjarkefur (World Bank), Benjamin Daniels (World Bank/Georgetown Univ), Avnish Singh (World Bank)
This presentation introduces three commands providing new functionality for high-quality and transparent data handling. First, iecorrect uses human-readable sheets to document and implement all changes (corrections) to data points in one line of Stata code. Second, iecodebook export creates data dictionaries and includes new features for validating the structure or contents of datasets and creating replication datasets. Third, iesave enhances save with the additional features of tracking changes to datasets over time in a Git-friendly way. Altogether, these commands allow users to access data descriptions and changelogs without reviewing Stata code—and allows team members to contribute to data quality control without using Stata. In addition to the commands, the presentation will discuss general challenges of documenting datasets the authorship team solved during their creation.
Ex-ante evaluation of the distributional effects of a macroeconomic shock is a difficult task. One approach relies on microsimulation models often combined with a macroeconomic model (e.g., a CGE model). This approach typically follows a top-down sequence where the microsimulation model takes the outputs from the macroeconomic model as given and then uses a household survey to generate changes in the data that mimic the resulting macroeconomic aggregates. For example, this approach could be used to model how changes in the level of employment and wages by industry derived from a given macroeconomic scenario (e.g., a set of climate change policies) impact poverty and inequality. This presentation compares two methods (reweighting versus modeling occupational choices) for analyzing changes in the labor market in the context of a top-down macro–micro model. I use two surveys that are more than 10 years apart to explore how these two different ways of modeling changes in the labor market using the older survey can predict what we observe in the newer survey.
Co-authors: Kristoffer Bjarkefur (World Bank), Luiza Cardoso de Andrade (World Bank), Maria Jones (World Bank)
Development Research In Practice: The DIME Analytics Data Handbook is a new Stata-centric handbook for empirical researchers. It guides readers through best practices for code and data handling in research projects from inception to publication. It includes code snippets, links to a complete Stata project repository on GitHub, links to continuously updated workflows on the DIME Wiki, and the DIME Analytics Stata Style Guide, as well as a series of recorded lectures accompanying each chapter. The handbook is intended as a complete introduction to modern reproducible code and data work. It can be used as a training manual for new staff; a textbook companion to an undergraduate or graduate-level empirical methods course; or a desk reference for practitioners at any level. In addition to the paperback, a free ebook and PDF versions are available online. In this presentation, the authors will discuss the reasons for publishing the handbook, focusing on the need for Stata practitioners to improve standardization across projects. With the continued rise of research centers and labs, nonstandardized and idiosyncratic approaches slow down learning and impair collaboration. This handbook and discussion will provide a starting point for Stata users worldwide.
Co-authors: Sarah D. Newton (University of Connecticut), D. Betsy McCoach (University of Connecticut)
Model evaluation is an unavoidable facet of multilevel modeling (MLM). Current guidance encourages researchers to focus on two overarching model-selection factors: model fit and model adequacy (McCoach et al. 2022). Researchers routinely use information criteria to select from a set of competing models and assess the relative fit of each candidate model to their data. However, researchers must also consider the ability of their models and their various constituent parts to explain variance in the outcomes of interest (i.e., model adequacy). Prior methods for assessing model adequacy in MLM are limited. Therefore, Rights and Sterba (2019) proposed a new framework for decomposing variance in MLM to estimate R2 measures. Yet there is no Stata package that implements this framework. Thus, we propose a new Stata package that computes both (1) a variety of model fit criteria and (2) the model adequacy measures described by Rights and Sterba to facilitate multilevel model selection for Stata users. The goal of this package is to provide researchers with an easy way to utilize a variety of complementary methods to evaluate their multilevel models.
When do leaplings (persons born on February 29) celebrate their birthdays in nonleap years? What is the difference, say, in milliseconds, between two timestamps if leap seconds are counted, based on Coordinated Universal Time (UTC) standards? What if you want to make sure that the dates are properly stored and ready for fitting a time-series model or performing survival analysis? Dates and times are all too familiar concepts we often take for granted. They lurk under data management and statistical analysis with various degrees of importance depending on the task at hand. In this talk, we will demonstrate how to handle these tasks using Stata's vast collection of date and time functions with highlights of the new functions in Stata 17.
Stata has a strong suite of survey data-analysis references and tools and remains the primary choice for researchers working with survey data. On the other hand, R is the primary choice for data visualization in many academic papers, given its flexibility, especially when using the ggplot2 package based on the design philosophy of The Grammar of Graphics. An unfulfilled need for many researchers is innovatively presenting survey data-analysis results without feeling limited by working within one statistical software only. This presentation discusses a workflow of using Stata for analysis and exporting the results through the postfile commands, then handing the data off to R to create a rich array of figures. As a proof of concept, the presentation will show results from an ongoing health economics research project from the Philippines of around 200,000 observations from national income and expenditure survey data to create publication-quality dumbbell plots, concentration curves, and Pen’s parades. Finally, the presentation will briefly describe how to share code and results in a public repository like Github.
We demonstrate the powers of the underutilized Stata spatial analytical module Sp, with an eye on the broader and older path analytic modeling framework (gsem and sem, which stands for structural equation modeling [SEM]). Spatial aggregate data have become widely available, yet analysts often ignore their spatial structure (regions have neighbors, and neighboring regions are more similar than by chance). Research often reports artificial naïve/a-spatial associations that ignore this spatial nonindependence. We analyze public data from the CDC, on social vulnerability and life expectancy, at census tract level, using the state of CT in the U.S. as illustration. We compare (1) the spregress modeling options against SEM models that include the outcome’s spatial lag as copredictor; (2) a two-step mediation model with spregress against SEM with indirect effects; (3) the total effects of a spatial predictor on a spatial outcome estimated with spregress by adding up effects from neighbors to each region (and back), against nonrecursive SEM models that use spatial lag versions of each spatial variable as instrumental variables. We point to several extensions of spatial modeling into the SEM approach, like spatial factor analysis and spatial "causal" mediation models, and contrast Stata’s utilities against GeoDa and Mplus comparable models.
Co-authors: Fernando Rios-Avila (Levy Economics Institute)
We propose a method to analyze interval-censored data, using a multiple imputation based on a heteroskedastic interval regression approach. The proposed model aims to obtain a synthetic dataset that can be used for standard analysis, including standard linear regression, quantile regression, or poverty and inequality estimation. We present two applications to show the performance of our method. First, we run a Monte Carlo simulation to show the method's performance under the assumption of multiplicative heteroskedasticity, with and without conditional normality. Second, we use the proposed methodology to analyze labor income data in Grenada for 2013–2020, where the salary data are interval-censored according to the salary intervals prespecified in the survey questionnaire. The results obtained are consistent across both exercises.
The scientific committee is responsible for the Stata Conference program. With submissions encouraged from both new and long-time Stata users from all backgrounds, the committee will review all abstracts in developing an exciting, diverse, and informative program. We look forward to seeing you in DC!
Open to users of all disciplines and experience levels, Stata Conferences bring together a unique mix of experts and professionals. Develop a well-established network within the Stata Community.
Hear from Stata experts in the top of their fields, as well as Stata's own researchers and developers. Gain valuable insights, discover new commands, learn best practices, and improve your knowledge of Stata.
Presentation topics have included new community-contributed commands, methods and resources for teaching with Stata, new approaches for using Stata together with other software, and much more.