2012 Stata Conference San Diego

Home / Resources & support / Users Group meetings / 2012 Stata Conference San Diego

Last updated: 4 October 2012

2012 Stata Conference San Diego

26–27 July 2012

Manchester Grand Hyatt
One Market Place
San Diego, CA 92101

Photos

Proceedings

Custom Stata commands for semi-automatic confidentiality screening of Statistics Canada data

Jesse McCrosky

University of Saskatchewan

The use of Statistics Canada census and survey data in research data centers includes very specific and sometimes complex confidentiality requirements. Ensuring that statistical output meets these requirements adds an additional step to analysis that can be difficult and time consuming. Thanks to the flexibility of Stata, this additional step can sometimes be avoided. I present newly developed Stata commands that partially automate this process. Features include reporting of minimum unweighted frequencies for weighted output, automatic rounding of results as required by a given survey, and warnings when potentially unreleasable results are generated. These commands have the potential to save time and reduce error rates for researchers using Statistics Canada data as well as for Research Data Analysts, the Statistics Canada employees responsible for confidentiality screening.

Additional information
sd12_mccrosky.pdf

scdensity: A program for self-consistent density estimation

Joerg Luedicke

University of Florida and Yale University

Estimating the density of a distribution from a finite number of data points is an important tool in the statistician's and data analyst’s toolbox. In their recent paper, Bernacchia and Pigolotti (2011) introduce a new nonparametric method for the density estimation of univariate distributions. Whereas conventional methods, like plotting histograms or kernel density estimates, rely on the need to make arbitrary choices beforehand (for example, choosing a smoothing parameter), Bernacchia and Pigolotti’s approach does not rely on any a priori assumptions but instead estimates the density in a “self-consistent” way by iteratively finding an optimal shape of the kernel. The method of self-consistent density estimation is implemented in Stata as an ado-file (scdensity), with its main engine written in Mata. In this presentation, I will discuss the underlying theory and main features of this program. In addition, I will present results of Monte Carlo simulations that compare the performance of the self-consistent density estimate with various kernel estimates and maximum likelihood fits. Finally, I will evaluate the potential usefulness of the self-consistent estimator in other contexts, such as nonparametric regression modeling.

Reference:
Bernacchia, A., and Pigolotti, S. 2011. Self-consistent method for density estimation. Journal of the Royal Statistical Society, Series B 73: 407–422.

Additional information
sd12_luedicke.pdf

TMPM: The trauma mortality prediction model is robust to ICD-9, ICD-10, and AIS Coding lexicons

Alan Cook

Baylor University Medical Center

Many methods have been developed to predict mortality following trauma. Two classification systems are used to provide a taxonomy for diseases, including injuries. The ICD-9 is the classification system for administrative data in the United States. AIS was developed for characterization of injuries alone. The Trauma Mortality Prediction Model (TMPM) is based on empiric estimates of severity for each injury in the ICD-9 and AIS lexicons. Each probability of mortality is estimated from the five worst injuries per patient. TMPM has been rigorously tested against other mortality prediction models using ICD-9 and AIS data and has been found superior. The tmpm command allows Stata users to efficiently apply TMPM to datasets using ICD-9 or AIS. The command uses model-averaged regression coefficients that assign empirically derived severity measures for each of the 1,322 AIS codes and 1,579 ICD-9 injury codes. The injury codes are sorted into body regions and then merged with the table of model-averaged regression coefficients to assemble a set of regression coefficients. A logit model is generated to calculate the probability of death. tmpm accommodates either AIS or ICD-9 lexicons from a single command and adds the probability of mortality for each patient to the original dataset as a new variable.

Additional information
sd12_cook.pdf

Adoption: A new Stata routine for consistently estimating population technological adoption parameters

Aliou Diagne

Africa Rice Center

Diagne and Demont (2007) used a counterfactual outcomes framework to show that the observed sample technological adoption rate does not consistently estimate the population adoption rate even if the sample is random. Likewise, it is shown that a model of adoption with observed adoption outcome as a dependent variable and where exposure to the technology is not observed and controlled for cannot yield consistent estimates of the determinants of adoption. In this talk, I present a new user-written Stata command called adoption. My command is implemented by using Stata estimation commands internally to carry out the various estimations and by computing the correct standard errors for the average treatment effect (ATE) parameter estimates: population mean potential adoption in the exposed subpopulation (ATE1), population mean potential adoption in the non-exposed subpopulation (ATE0), population mean joint exposure and adoption (JEA), population adoption gap (GAP), and population selection bias (PSB). The ATE adoption parameters are estimated using the semiparametric method (that is, inverse probability weighting) or a parametric method that fits adoption outcome on independent variables using one of Stata’s parametric models, such as probit, logit, generalized linear models, ordinary least squares, Poisson, or tobit.

Reference:
Diagne, A., and M. Demont. 2007. Taking a new look at empirical models of adoption: Average treatment effect estimation of adoption rates and their determinants. Agricultural Economics 37: 201–210.

Additional information
sd12_diagne.pdf

Graphics (and numerics) for univariate distributions

Nicholas J. Cox

Durham University, UK

How to plot (and summarize) univariate distributions is a staple of introductory data analysis. Graphical (and numerical) assessment of marginal and conditional distributions remains important for much statistical modeling. Research problems can easily evoke needs for many comparisons, across groups, across variables, across models, and so forth. Over several centuries, many methods have been suggested, and their relative merits are a source of lively ongoing debate. I offer a selective but also detailed review of Stata functionality for univariate distributions. The presentation ranges from official Stata commands through various user-written commands, including some new programs, to suggestions on how to code your own graphics commands when other sources fail. I also discuss both continuous and discrete distributions. The tradeoff between showing detail and allowing broad comparisons is an underlying theme.

Additional information
sd12_cox.ppt

Binary choice models with endogenous regressors

Christopher Baum

Boston College and DIW Berlin

Yingying Dong

University of California–Irvine

Arthur Lewbel

Boston College

Dong and Lewbel have developed the theory of simple estimators for binary choice models with endogenous or mismeasured regressors, depending on a “special regressor” as defined by Lewbel (2000). “Control function” methods such as Stata’s ivprobit are generally only valid when endogenous regressors are consistent. The estimators proposed here can be used with limited, censored, continuous, or discrete endogenous regressors, and they have significant advantages over alternatives such as maximum likelihood and the linear probability model. These estimators are numerically straightforward to implement. We present and demonstrate an improved version of a Stata routine that provides both estimation and postestimation features.

Reference:
Lewbel, A. 2000. Semiparametric qualitative response model estimation with unknown heteroskedasticity and instrumental variables. Journal of Econometrics 97: 145–177.

Additional information
sd12_baum.pdf

An application of multiple imputation and sampling-based estimation

Haluk Gedikoglu

Lincoln University of Missouri

Missing data occurs frequently in agricultural household surveys, possibly leading to biased and inefficient regression estimates. Multiple imputation can be used to overcome the missing-data problem. Previous studies applied multiple imputation to datasets where only some of the variables have missing observations while the rest have no missing observations; in reality, however, all the variables in a survey might have missing observations. Currently, there is no theoretical or practical guidance for practitioners on how to apply multiple imputation when all the variables in a dataset have missing observations. The objective of this study is to evaluate the impact of alternative multiple-imputation application methods when all the variables have missing observations. The data for this study were collected through a mail survey of 2,995 farmers in Missouri and Iowa in spring 2011. Two multiple-imputation methods are applied in the imputation step: one using only the complete observations and the other using all the observations. The results of the current study show that using all the observations in the imputation step, even if they have missingness, produces estimates with lower standard errors. Hence, practitioners should use all the observations in the imputation step.

Additional information
sd12_gedikoglu.pdf

The application of Stata’s multiple-imputation techniques to analyze a design of experiments with multiple responses

Clara Novoa

Texas State University

In this talk, I exemplify the application of the multiple-imputation techniques available in Stata to analyze a design of experiments with multiple responses and missing data. No imputation and multiple-imputation methodologies are compared.

Additional information
sd12_novoa.pdf

EFA within a CFA context

Phil Ender

UCLA Statistical Consulting

EFA within a CFA framework combines aspects of both EFA and CFA. It uses CFA to produce a factor solution that is close to an EFA solution while providing features typically found in CFA, such as standard errors, statistical tests, and modification indices. In this presentation, I include an example using the sem command introduced in Stata 12.

Additional information
sd12_ender.pdf

Structural equation modeling using the SEM Builder and the sem command

Kristin MacDonald

StataCorp LP

In this talk, I will give a brief introduction to structural equation modeling (SEM) and Stata’s sem command. I will also introduce the SEM Builder—the graphical user interface for drawing path diagrams, fitting structural equation models, and analyzing the results. Using the SEM Builder, we will take a more detailed look at some of the models commonly fit within the SEM framework including confirmatory factor models, path models with observed variables, structural models with latent variables, and multiple group models.

Additional information
sd12_macdonald.pdf

Imagining a Stata/Python combination

James Fiedler

Universities Space Research Association

There are occasions when a task is difficult in Stata but fairly easy in a more general programming language. Python is a popular language for a range of uses. It is easy to use, has many high-quality packages, and programs can be written relatively quickly. Is there any advantage to combining Stata and Python within a single interface? Stata already offers support for user-written programs, which allow extensive control over calculations but somewhat less control over graphics. Also, except for specifying output, the user has minimal programmatic control over the user interface. Python can be used in ways that allow more control over the interface and graphics, and in so doing provide roundabout methods for satisfying some user requests (for example, transparency levels in graphics and the ability to clear the results window). My talk will explore these ideas, present a possible method for combining Stata and Python, and give examples to demonstrate how this combination might be useful.

Additional information
sd12_fiedler.pdf

Issues for analyzing competing-risks data with missing or misclassification in causes

Ronny Westerman

Philipps-University of Marburg

Competing-risks models have a various field of application in medical and public health studies. A challenging clue for applying cause-specific survival models yields on the problem of missing and misclassification in cause of death. The masked cause of death is related to incomplete or only partially identifiable information of death certificates. In this presentation, I will introduce some alternative issues for competing-risks models with some implemented Stata commands and also will discuss some limitations on some hands-on examples. Another purpose should be the introduction of some more sophisticated tools for modeling the long-term survival function in terms of competing risks. Data analysis will be provided with free-accessible SEER-DATA from the National Institute of Cancer.

Additional information
sd12_westerman.pptx

Generating survival data for fitting marginal structural Cox models using Stata

Ehsan Karim

University of British Columbia

Marginal structural models (MSMs) can be used to estimate the effect of a time-dependent exposure in the presence of time-dependent confounding. Previously, Fewell et al. (2004) described how to estimate this model in Stata based on a weighted pooled logistic model approximation. However, based on the current literature and some recent simulation study results, this model can be suitably fit in other ways too, and various new weighting schemes are proposed accordingly. In this presentation, I will first explain the idea behind MSMs and justify the use of various weighting schemes through simple examples and tabulations using Stata. Then I will illustrate the procedure of generating survival data from a Cox MSM by using existing Stata commands. I will compare the performance of simulated data generation and the procedure of fitting MSMs via Stata with other standard statistical packages such as SAS and R.

Reference:
Fewell, Z., M. A. Hernan, F. Wolfe, K. Tilling, H. Choi, and J. A. C. Sterne. Controlling for time-dependent confounding using marginal structural models. Stata Journal 4: 402–420.

Additional information
sd12_karim.pdf

Computing optimal strata bounds using dynamic programming

Eric Miller

Summit Consulting

Stratification is a sampling design that can improve efficiency. It works by first partitioning the population into homogeneous subgroups and then performing simple random sampling within each group. For a continuous variable, stratification involves determining strata boundaries. Holding the number of strata fixed, a reduction in the width of a given stratum reduces its associated variance at the expense of the variances from the other strata. Dynamic programming provides a method for simultaneously minimizing all the strata variances by determining optimal strata boundaries. In this presentation, I describe a new user-written command, optbounds, that uses dynamic programming to find optimal boundary points for a continuous stratification variable. The command uses the variance minimization technique developed by Khan, Nand, and Ahmad (2008). The user first chooses a known probability distribution that approximates the stratification variable. Parameter estimates are then generated from the data, and goodness-of-fit statistics are used to assess the quality of the approximation. A brief overview of the theory, a description of the command, and several illustrative examples will be provided.

Reference:
Khan, M. G. M., N. Nand, and N. Ahmad. 2008. Determining the optimum strata boundary points using dynamic programming. Survey Methodology 34: 205–214.

Additional information
sd12_miller.pdf

Correct standard errors for multistage regression-based estimators: A guide for practitioners with illustrations

Joseph Terza

University of North Carolina–Greensboro

With a view toward lessening the analytic and computational burden faced by practitioners seeking to correct the standard errors of two-stage estimators, I offer a heretofore unnoticed simplification of the conventional formulation for the most commonly encountered cases in empirical application—two-stage estimators involving maximum likelihood estimation or nonlinear least squares in either stage. Also with the applied researcher in mind, I cast the discussion in the context of nonlinear regression models involving endogeneity—a sampling problem whose solution often requires two-stage estimation. I detail simplified standard error formulations for three very useful estimators in applied contexts involving endogeneity in a nonlinear setting (endogenous regressors, endogenous sample selection, and causal effects). The analytics and Stata/Mata code for implementing the simplified formulae are demonstrated with illustrative real-world examples and simulated data.

Additional information
sd12_terza.pdf

Shrinkage estimators for structural parameters

Tirthankar Chakravarty

University of California–San Diego

Instrumental-variables estimators of parameters in single-equation structural models, like 2SLS and LIML, are the most commonly used econometric estimators. Hausman-type tests are commonly used to choose between OLS and IV estimators. However, recent research has revealed troublesome size properties of Wald tests based on these pre-test estimators. These problems can be circumvented by using shrinkage estimators, particularly James–Stein estimators. I introduce the ivshrink command, which encompasses nearly 20 distinct variants of the shrinkage-type estimators proposed in the econometrics literature, based on optimal risk properties, including fixed (k-class estimators are a special case) and data-dependent shrinkage estimators (random convex combinations of OLS and IV estimators, for example). Analytical standard errors to be used in Wald-type tests are provided where appropriate, and bootstrap standard errors are reported otherwise. Where the variance–covariance matrices of the resulting estimators are expected to be degenerate, options for matrix norm regularization are also provided. We illustrate the techniques using a widely used dataset in the econometric literature.

Additional information
sd12_chakravarty.pdf

Stata implementation of the nonparametric spatial heteroskedasticity- and autocorrelation-consistent covariance matrix estimator

P. Wilner Jeanty

Hobby Center for the Study of Texas/Kinder Institute for Urban Research, Rice University

In this talk, I introduce two Stata routines to implement the nonparametric spatial heteroskedasticity- and autocorrelation-consistent (SHAC) estimator of the variance–covariance matrix in a spatial context, as proposed by Conley (1999) and Kelejian and Prucha (2007). The SHAC estimator is robust against potential misspecification of the disturbance terms and allows for unknown forms of heteroskedasticity and correlation across spatial units. Heteroskedasticity is likely to arise when spatial units differ in size or structural features.

References:
Conley, T. 1999. GMM estimation with cross sectional dependence. Journal of Econometrics 92: 1–45.

Kelejian, H. H., and I. R. Prucha. 2010. Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67.

Additional information
sd12_jeanty.pdf

Big data, little spaces, high speed: Using Stata to analyze the determinants of broadband access in the United States

David Beede

U.S. Department of Commerce

Brittany Bond

U.S. Department of Commerce

This study brings together Census block-level data on broadband service availability, economics, demographics, regulations, and terrain to model the supply and demand of high-speed broadband service in the United States. While Stata is the primary tool for data management and multilevel modeling, other software tools, such as GIS, are used in conjunction with Stata to generate visually arresting pictures that help communicate the study’s findings.

Additional information
sd12_beede.pdf

A comparative analysis of lottery-, charter-, and traditional-based elementary schools within the Anchorage school district

Matthew McCauley

University of Alaska–Anchorage

The growing popularity of alternative choices to traditional-based public schools—such as public charter/lottery-based schools—has prompted nationwide research. In Anchorage, however, there are few quantitative studies that compare student performance across traditional public schools, charter schools, and lottery-based schools. The purpose of this project is to create and analyze panel data for all public elementary schools within the Anchorage School District and compare the achievement of charter/lottery with traditional-based schools. This will be done using public Terra Nova and SBA data from ASD for the years 2007–2010 in addition to U.S. Census data. Data will be imported into Stata, a robust statistical software application, and regression techniques will be used to compare student Terra Nova and SBA scores while controlling for other factors that also influence test scores.

Additional information
sd12_mccauley.pptx

Matching individuals in the Current Population Survey: A distance-based approach

Stuart Craig

Yale University

In this presentation, I introduce a set of Stata programs designed to match individuals from year to year in the Current Population Survey (CPS) using a distance-based measure of similarity. Unlike panel data, the CPS is a repeated cross section of geographic residences, which are continually surveyed regardless of whether the occupants are the same. Previous work has taken the person and household identifiers supplied in the datasets as given and validated or invalidated identifier-derived matches based on demographic variables. This work has focused on selecting the best set of demographic verifiers. Recognizing that there is substantial error in the supplied identifiers, the distance-based approach extends these methods by treating demographic variables as pseudo-identifiers and selecting matches based on a criterion of distance minimization. This approach possesses several advantages over prior methods. First, by reducing the weight placed on the survey-provided identifiers, the distance approach provides a matching technique that can be uniformly applied across the entire CPS series to create a consistent historical series of CPS matches, even in those years where the survey-provided identifiers are particularly error-prone. Second, this approach provides a flexible framework for matching individuals in the CPS, which allows for the selection of pseudo-identifiers to vary based on the measurement of interest. Third, it generates a matched series with low and consistent mismatch rates, which is ideal for measuring secular trends in dynamics, such as income volatility. Several measures of distance and the analytical decisions regarding acceptable year-to-year variation are discussed.

Additional information
sd12_craig.pdf

Allocative efficiency analysis using DEA in Stata

Choonjoo Lee

Korea National Defense University

In this presentation, I present a procedure and an illustrative application of a user-written Allocation Model (AE) in Stata. AE measures allocative efficiency and economic efficiency as well as technical efficiency when price and cost information of production are available. This model is an extension of basic DEA models that I also wrote.

Additional information
sd12_lee.pdf

Psychometric analysis using Stata

Chuck Huber

StataCorp LP

In this talk, I will provide an overview of Stata features that are typically used for the analysis of psychometric and educational testing data. Traditional multivariate tools such as canonical correlation, MANOVA, multivariate regression, Cronbach’s alpha, exploratory and confirmatory factor analysis, cluster analysis, and discriminant analysis will be discussed as well as more modern techniques based on latent trait models such as the Rasch model, multidimensional scaling, and correspondence analysis. Multilevel mixed-effects models for continuous, binary, and count outcomes will be described in the context of both ecological systems theory and longitudinal data analysis. Structural equation modeling will also be mentioned but not discussed in detail.

Additional information
sd12_huber.pdf
Huber_2012SanDiego.do
Huber_2012SanDiego.dta
Huber_2012SanDiego_Pilot.dta
Huber_2012SanDiego_SEM.stsem

Scientific organizers

Phil Ender, (chair) UCLA
A. Colin Cameron, UC Davis
Xiao Chen, UCLA
Estie Hudes, UC San Francisco
Michael Mitchell, U.S. Department of Veterans Affairs

Logistics organizers

Chris Farrar, StataCorp
Gretchen Farrar, StataCorp