The Stata Conference was July 27-28, 2017, but you can view the program and presentation slides (below) and the conference photos.
Working with demographic life table data in Stata
Abstract: This presentation introduces two user-written Stata commands related to the data and calculations of demographic life tables, whose most prominent feature is the calculation of life expectancy at birth. The first command, hmddata, provides a convenient interface to the Human Mortality Database (HMD, www.mortality.org), a database widely used for mortality data by demographers, health researchers, and social scientists. Different subcommands of hmddata allow data from this database to be easily loaded, transformed, reshaped, tabulated, and graphed. The second command, lifetable, produces demographic period life tables. The main features are that life table columns can be flexibly calculated using any valid minimum starting information; abridged tables can be generated from complete ones; finally, a Stata dataset can hold any number of life tables, and the various lifetable subcommands can operate on any subset of them.
Daniel C. Schneider
Max Planck Institute for Demographic Research
Application of the MIMIC model to detect and predict differential item functioning
Abstract: There has been extensive research indicating gender-based differences among STEM subjects, particularly mathematics (Albano and Rodriguez, 2013; Lane, Wang, and Magone 1996). Similarly, gender-based differential item functioning (DIF) has been researched because of the disadvantages females face in STEM subjects when compared with their male counterparts. Given that, this study will apply the multiple indicators multiple causes (MIMIC) model, a type of structural equation model, to detect the presence of gender-based DIF using the Program for International Student Assessment (PISA) mathematics data from students in the United States of America and then predict the DIF using math-related covariates. This study will build upon a previous study that explored the same data using the hierarchical generalized linear model and will be confirmatory in nature. Based on the results of the previous study, it is expected that several items will exhibit DIF that disadvantages females and that mathematics-based self-efficacy will predict the DIF. However, additional covariates will also be explored, and the two models will be compared in terms of their DIF detection and the subsequent modeling of DIF. Implications of these results include females underachieving when compared with their male counterparts, thus continuing the current trend. These gender differences can further manifest at the national level, causing U.S. students as a whole to underperform at the international level. Last, the efficacy of the MIMIC model to detect and predict DIF will be illustrated and become increasingly used to model and better understand differences and DIF.
Abstract: In 2001, I gave a presentation on three-valued logic. Since then, I have developed some ideas that grew out of that investigation, leading to new insights about missing values and to the development of five-valued logic. I will also show how these notions extend to numeric computation and to an abstract generalization of the principles involved. This is not about analysis; this is about data construction and preparation, and it is a possibly interesting conceptual tool.
Data for Decisions
Uncomplicated Parallel Computing with Stata
Abstract: Parallel lets you run Stata faster, sometimes faster than MP itself. By organizing your job in several Stata instances, parallel allows you to work with out-of-the-box parallel computing. Using the 'parallel' prefix, you can get faster simulations, bootstrapping, reshaping big data, etc., without having to know a thing about parallel computing. With no need of having Stata/MP installed on your computer, parallel has showed to dramatically speed up computations up to two, four, or more times depending on how many processors your computer has.
George G. Vega Yon
University of Southern California
Stata extensibility with the Java API: Tools, examples, and advice
Abstract: The inclusion of the Java API for Stata provides users, and user programmers, with exciting opportunities to leverage a wide array of existing work in the context of their Stata workflow. This talk will introduce a few tools designed to help others wanting to integrate Java libraries into their workflow, the Stata Maven Archetype, and the StataJavaUtilities library. In addition to a higher-level overview, the presentation will also show examples of using existing Java libraries to expand statistical models in psychometrics and send yourself emails when your job is complete, of phonetic string encodings and string distances, of accessing file/operating system properties, and examples to use as starting points for developing Java plugins in Stata.
Fayette County Public Schools
Big data in Stata with the ftools package
Abstract: In recent years, very large datasets have become increasingly prevalent in most social sciences. However, some of the most important Stata commands (collapse, egen, merge, sort, etc.) rely on algorithms that are not well suited for big data. In my talk, I will present the ftools package, which contains plugin alternatives to these commands and performs up to 20 times faster on large datasets . Further, I will explain the underlying algorithm and Mata function and show how to use this function to create new Stata commands and to speed up existing packages. : See benchmarks here: https://github.com/sergiocorreia/ftools/#benchmarks
Board of Governors of the Federal Reserve System
On the shoulders of giants, or not reinventing the wheel
Abstract: Part of the art of coding is writing as little as possible to do as much as possible. The presentation expands on this truism. Examples are given of Stata code to yield graphs and tables in which most of the real work is happily delegated to workhorse commands. In graphics, a key principle is that graph twoway is the most general command, even when you do not want rectangular axes. Variations on scatter- and line plots are precisely that, variations on scatter- and line plots. More challenging illustrations include commands for circular and triangular graphics, in which x and y axes are omitted, with an inevitable but manageable cost in re-creating scaffolding, titles, labels, and other elements. In tabulations and listings, the better-known commands sometimes seem to fall short of what you want. However, a few preparation commands (such as generate, egen, collapse, or contract) followed by list, tabdisp, or _tab can get you a long way. The examples range in scope from a few lines of interactive code to fully developed programs. The presentation is thus pitched at all levels of Stata users.
Durham University, United Kingdom
Incorporating Stata into reproducible documents
Abstract: Part of reproducible research is eliminating manual steps such as hand-editing documents. Stata 15 introduces several commands which facilitate automated document production, including dyndoc for converting dynamic Markdown documents to web pages, putdocx for creating Word documents, and putpdf for creating PDF files. These commands allow you to mix formatted text and Stata output, and allow you to embed Stata graphs, in-line Stata results, and tables containing the output from selected Stata commands. We will show these commands in action, demonstrating automating the production of documents in various formats, and including Stata results in those documents.
Propensity scores and causal inference using machine learning methods
Abstract: We compare a variety of methods for predicting the probability of a binary treatment (the propensity score), with the goal of comparing otherwise like cases in treatment and control conditions for causal inference about treatment effects. Better prediction methods can under some circumstances improve causal inference by reducing both the finite sample bias and variability of estimators, but sometimes, better predictions of the probability of treatment can increase bias and variance, and we clarify the conditions under which different methods produce better or worse inference (in terms of mean squared error of causal impact estimates).
Now you see me: High school dropout and machine learning
Abstract: In this paper, we create an algorithm to predict which students are eventually going to drop out of U.S. high school using information available in ninth grade. We show that using a naive model—as implemented in many schools—leads to poor predictions. In addition to this, we explain how schools can obtain more precise predictions by exploiting the big data available to them, as well as more sophisticated quantitative techniques. We also compare the performances of econometric techniques like logistic regression with machine learning tools such as support vector machine, boosting and LASSO. We offer practical advice on how to apply machine learning methods using Stata to the high-dimensional datasets available in education. Model parameters are calibrated by taking into account policy goals and budget constraints.
Small area estimation and poverty map in Stata
Abstract: We present a new Stata package for small-area estimations of poverty and inequality implementing methodologies from Elbers, Lanjouw, and Lanjouw (2003). Small-area methods attempt to solve low representativeness of surveys within areas or the lack of data for specific areas and subpopulations. This is accomplished by incorporating information from outside sources. A common outside source is census data, which often lack detailed information on welfare. Thus far, a major limitation toward such analysis in Stata has been the memory required to work with census data . The povmap package introduces new Mata functions and a plugin used to circumvent memory limitations that will arise when working with big data.
Paul Andres Corral Rodas; Joao Pedro Wagner De Azevedo; Qinghua Zhao
Analyzing satellite data in Stata
Abstract: We provide examples of how one can use satellite or other remote sensing data in Stata, with a variety of analysis methods, including examples of measuring economic disadvantage using satellite imagery.
Computing occupational segregation indices with standard errors: An ado-file application with an illustration for Colombia
Abstract: We developed an ado-file to easily estimate three selected occupational segregation indicators with standard errors using a bootstrap procedure. The indicators are the Duncan and Duncan (1955) dissimilarity index, the Gini coefficient based on the distribution of jobs by gender (see Deutsch et al. ) and the Karmel and MacLachlan (1988) index of labor market segregation. This routine can be easily applied to conventional labor market microdata in which information regarding the occupation classification, industry, and occupational category variables is usually available. As an illustration of the application of this ado-file, we present estimates of both occupational and industry segregation by gender drawn from household surveys' Colombian microdata. The estimation of occupational segregation measures with standard errors proves to be useful in assessing statistical differences in segregation measures within labor market groups and over time.
Jairo G Isaza-Castro
Universidad de la Salle
Karen Hernandez; Karen Guerrero; Jessy Hemer
Universidad de la Salle
cvcrand and cptest: Efficient design and analysis of cluster randomized trials
Abstract: Cluster randomized trials (CRTs), where clusters (for example, schools or clinics) are randomized but measurements are taken on individuals, are commonly used to evaluate interventions in public health and social science. Because CRTs typically involve only a few clusters, simple randomization frequently leads to baseline imbalance of cluster characteristics across treatment arms, threatening the internal validity of the trial. In CRTs with a small number of clusters, classic approaches to balancing baseline characteristics—such as matching and stratification—have several drawbacks, especially when the number of baseline characteristics the researcher desires to balance is large (Ivers et al. 2012). An alternative approach is constrained randomization, whereby an allocation scheme is randomly selected from a subset of all possible allocation schemes based on the value of a balancing criterion (Raab and Butcher 2001). Subsequently, an adjusted permutation test can be used in the analysis, which provides increased efficiency under constrained randomization compared with simple randomization (Li et al. 2015). We describe constrained randomization and permutation tests for the design and analysis of CRTs and provide examples to demonstrate the use of our newly created Stata package (cvcrand), which uses Mata to efficiently process large allocation matrices—to implement constrained randomization and permutation tests.
Fan Li; Hengshi Yu; Elizabeth L. Turner
Using theory to define a computationally tractable specification space in confirmatory factor models
Abstract: Researchers constructing measurement models must decide how to proceed when an initial specification fits poorly. Common approaches include search algorithms that optimize fit and piecemeal changes to the item list or the error specification. The former approach may yield a good-fitting model that is inconsistent with theory or may fail to identify the best-fitting model because of local optimization issues. The latter suffers from poor reproducibility and may also fail to identify the optimal model. We outline a new approach that defines a computationally tractable specification space based on theory. We use the example of a hypothesized latent variable with 25 candidate indicators divided across 5 content areas. Using Stata’s tuples command, we identify all combinations of indicators containing >=1 indicator per content area. In our example, this yields 7,294 models. We estimate each model on a derivation dataset and select candidate models with fit statistics that are acceptable or could be rendered acceptable by permitting correlated errors. Eight models fit these criteria. We evaluate modification indices, respecify if there is theoretical justification for correlated errors, and select a final model based on fit statistics. In contrast to other methods, this approach is easily replicable and may result in a model that is consistent with theory and has acceptable fit.
Johns Hopkins Bloomberg School of Public Health
Dr. Lorraine Dean
Johns Hopkins Bloomberg School of Public Health
Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit-root test
Abstract: We present response surface coefficients for a large range of quantiles of the Elliott, Rothenberg, and Stock (Econometrica 1996) DF-GLS unit-root tests for different combinations of the number of observations and the lag order in the test regressions, where the latter can be either specified by the user or endogenously determined. The critical values depend on the method used to select the number of lags. The Stata command ersur is presented, and its use illustrated with an empirical example that tests the validity of the expectations hypothesis of the term structure of interest rates.
Boston College and DIW Berlin
Universidad del Rosario, Colombia
Estimating treatment effects in the presence of correlated binary outcomes and contemporaneous selection
Abstract: Estimating the causal effect of a treatment is challenging when selection into the treatment is based on contemporaneous unobservable characteristics, and the outcome of interest is represented by a series of correlated binary outcomes. Under these assumptions, traditional nonlinear panel-data models, such as the random-effects logistic model, will produce biased estimates of the treatment effect because of correlation between the treatment variable and model unobservables. In this presentation, I will introduce a new Stata estimation command, etxtlogit, that can estimate a model where the outcome is a series of J-correlated logistic binary outcomes and selection into the treatment is based on contemporaneous unobservable characteristics. The presentation will introduce the new estimation command, present Monte Carlo evidence, and offer empirical examples. Special cases of the model will be discussed, including applications based on the explanatory (behavioral) Rasch model, a model from item response theory (IRT).
Matthew P. Rabbitt
Economic Research Service, U.S. Department of Agriculture
crreg: A new command for generalized continuation ratio models
Abstract: A continuation ratio model represents a variant of an ordered regression model that is suited to modeling processes that unfold in stages, such as educational attainment. The parameters for covariates in continuation ratio models may be constrained to be equal, subject to a proportionality constraint across stages, or freely vary across stages. Currently, there are three user-written Stata commands that fit continuation ratio models. Each of these commands fits some subset of continuation ratio models involving parameter constraints, but none of them offer complete coverage of the range of possibilities. In addition, all the commands rely on reshaping the data into a stage-case format to facilitate estimation. The new crreg command expands the options for continuation ratio models to include the possibility for some or all of the covariates to be constrained to be equal, to freely vary, or to have a proportionality constraint across stages. The crreg command relies on Stata’s ML routines for estimation and avoids reshaping the data. The crreg command includes options for three different link functions (the logit, probit, and cloglog) and supports Stata’s survey and multiple imputation suites of commands.
Ball State University
Oklahoma State University
The multivariate dustbin
Abstract: When I was in graduate school, I was taught that multivariate methods were the future of data analysis. In that dark computer stone age, multivariate meant multivariate analysis of variance (MANOVA), linear discriminant function analysis (LDA), canonical correlation analysis (CA), and factor analysis (which will not be discussed in this presentation). Statistical software has evolved considerably since those ancient days. MANOVA, LDA, and CA are still around but have been eclipsed and pushed aside by newer, sexier methodologies. These three methods have been consigned to the multivariate dustbin, so to speak. This presentation will review MANOVA, LDA, and CA, discuss the connections among the three approaches, and highlight the positives and negatives of each approach.
Analyzing interval-censored survival-time data in Stata
Abstract: In survival analysis, right-censored data have been studied extensively and can be analyzed using Stata's extensive suite of survival commands, including streg for fitting parametric survival models. Right-censored data are a special case of interval-censored data. Interval-censoring occurs when the failure time of interest is not exactly observed but is only known to lie within some interval. Left-censoring, which occurs when the failure is known to happen some time before the observed time, is also a special case of interval-censoring. Survival data may contain a mixture of uncensored, right-censored, left-censored, and interval-censored observations. In this talk, I will describe basic types of interval-censored data and demonstrate how to fit parametric survival models to these data using Stata's new stintreg command. I will also discuss postestimation features available after this command.
Estimating effects from extended regression models
I use the new extended regression command eoprobit to esitmate the effect of an endogenous treatment on an ordinal profit outcome.
David M. Drukker
Wishes and grumbles
Registration is now closed.
Renaissance Baltimore Harborplace Hotel
202 East Pratt Street
Baltimore, MD 21202
The conference venue is near several tourist attractions, including the USS Constellation and other vessels in the harbor, the American Visionary Arts Museum, and the National Aquarium.