Home  /  Resources & support  /  Users Group meetings  /  2007 West Coast Stata Users Group meeting

Last updated: 16 November 2007

2007 West Coast Stata Users Group meeting

25–26 October 2007

Marina del Rey

Marina del Rey Hotel
13534 Bali Way
Marina del Rey, CA 90292


Prediction of random effects and effects of misspecification of their distribution

Charles McCulloch
University of California, San Francisco
Statistical models that include random effects are commonly used to analyze longitudinal and clustered data. These models are often used to derive predicted values of the random effects, for example in predicting which physicians or hospitals are performing exceptionally well or exceptionally poorly. I start this talk with a brief introduction and several examples of the use of prediction of random effects in practice. In typical applications, the data analyst specifies a parametric distribution for the random effects (often Gaussian) although there is little information available to guide this choice. Are predictions sensitive to this specification? Through theory, simulations, and an example illustrating the prediction of who is likely to go on to develop high blood pressure, I show that misspecification can have a moderate impact on predictions of random effects and describe simple ways to diagnose such sensitivity.

Additional information
West_Coast_Stata_2007_talk_predict_random_effects.pdf (slides)

Panel data methods for microeconometrics using Stata

Colin Cameron
University of California, Davis
This presentation provides an overview of the subset of methods for panel data and the associated Stata xt commands most commonly used by microeconometricians. First, attention is focused on a short panel, meaning data on many individual units and few time periods. Examples include longitudinal surveys of many individuals and panel datasets on many firms. Then the data can be viewed as being clustered on the individual unit and panel methods used are also applicable to other forms of clustered data such as cross-section data from individual-level surveys conducted at many villages with clustering at the village level. Second, emphasis is placed on using the repeated measures aspect of panel data to estimate key marginal effects that can be interpreted as measuring causation rather than mere correlation. The leading methods assume time-invariant individual-specific effects (or “fixed effects”). Instrumental variables (IV) methods can also be used, with data from periods other than the current year potentially serving as instruments. Third, some analyses use dynamic models rather than static models. Particular interest lies in fitting models with both lagged dependent variables and fixed effects. The paper additionally surveys other panel methods used in econometrics, such as those for nonlinear models and those for dynamic panels with many periods of data.

Additional information
cameronwcsug.pdf (slides)

Repeated measures anova: The wide, the long, and the long

Phil Ender
Unversity of California, Los Angeles
This presentation will give an overview of the three main approaches to analyzing repeated measures analysis of variance: 1) multivariate models, 2) traditional anova models, and 3) linear mixed models along with discussion of the advantages and disadvantages of each. The presentation includes Stata code using manova, anova, regress, and xtmixed. The three approaches are illustrated through the use of a split-plot factorial design with one between-subjects factor and one repeated factor.

Additional information
repeated_anova.pdf (slides)

Survey data analysis with Stata 10: Accessible and comprehensive

Christine Wells
University of California, Los Angeles
The presentation will discuss Stata’s evolution into a comprehensive survey data analysis package by looking at its past, present, and possible future. Comparisons will be made with other survey data analysis software packages, such as SUDAAN, WesVar, SAS, and SPSS, with respect to both survey designs that can be analyzed as well as the types of analyses that can be conducted.

Additional information
Wells_Stata10talk.pdf (slides)

Multilevel modeling of complex survey data

Sophia Rabe-Hesketh
University of California, Berkeley
Survey data are often analyzed using multilevel or hierarchical models. For example, in education surveys, schools may be sampled at the first stage and students at the second stage and multilevel models used to model within-school and between-school variability. An important aspect of most surveys that is often ignored in multilevel modeling is that units at each stage are sampled with unequal probabilities. Standard maximum likelihood estimation can be modified to take the sampling probabilities into account, yielding pseudomaximum likelihood estimation, which is typically combined with robust standard errors based on the sandwich estimator. This approach is implemented in gllamm. I will introduce the ideas, discuss issues that arise such as the scaling of the weights, and illustrate the approach by applying it to data from the Program for International Student Assessment (PISA).

Additional information
stata_sophia.pdf (slides)

Calculating measures of comorbidity using administrative data

Vicki Stagg
University of Calgary
The development of a Stata program to calculate published measures of comorbidity will be of value to researchers working with inpatient discharge data coded in ICD-9-CM or ICD-10. The comorbid command calculates the weighted sum of comorbidities, as well as comorbidity scores based on the Charlson Index, which reflects the cumulative increase in likelihood of 1-year mortality from comorbidities. This allows for the calculation of three different comorbidity measures: ICD-9-CM, Enhanced ICD-9-CM, or ICD-10 (Quan et al 2005). Exclusion of less severe comorbidities can occur using an optional hierarchical method that excludes from the calculations a mild comorbidity when a patient has also exhibited a more severe form of the same diagnosis. The comparable elixhauser command calculates the sum of this alternate set of comorbidity measures, which may be associated with negative hospital outcomes (Elixhauser et al 1998). Both Stata algorithms can handle patients or visits as the observational unit. Options allow for a choice of summary output.

Additional information
Stagg_Stata_Presentation_final.ppt (slides)
stagg_notes_final.pdf (presentation notes)

Managing meta-data in Stata

Elliott Lowy
VA Health Services Research and Development
A collection of user-written commands will be presented, which in one way or another facilitate dealing with meta-data—from manipulation and presentation of variable names and types, through labels, notes, and other meta-data fields included with data files, and on to a command for accessing small text databases for interrelated datasets.

Additional information
The repository for the ado-files and packages used in this talk can be found at http://datadata.info/ado.
It is easier and more Stata-like to access the repository by typing
   net from http://datadata.info/ado
in the command window in Stata. This method also allows web access to individual help files.

An algorithm for creating models for imputation using the MICE approach: An application in Stata

Rose Medeiros
University of California, Los Angeles
It is generally advised that imputation models contain as many “predictor” variables as possible, since the greater the number of variables the greater the amount of information from which to make estimations (van Buuren, Boshuizen, and Knook 1999). Ideally, an imputation model might contain all variables in the dataset. Hence, the default in software packages that perform multivariate imputation by chained equations (e.g., ice in Stata) is often to use all other variables in the imputation model to predict missing values. However, in datasets with moderate to large numbers of variables, attempting to use all other variables in the dataset results in imputation models that are too large to actually run. One solution to this problem is to select a relatively large, but reasonable, number of predictors based on bivariate correlations and then drop predictors as necessary to create a regression model that is tractable using the complete data. This set of regression models form the imputation model for the entire dataset. This presentation outlines this approach in more detail and presents an overview of the Stata package that implements it.

Additional information
medeiros_mice.pdf (slides)

Modeling multiple source risk factor data and health outcomes in twins

Andy Bogart
Jack Goldberg
University of Washington, Seattle
One challenging feature of some medical research is the existence of multiple sources of exposure information about individual subjects. When an exposure of interest has been measured in a variety of ways or has been reported on by multiple informants, analysts must decide how best to estimate its association with some interesting outcome. Simply performing a multiple regression analysis of the outcome on all the sources together can be problematic, since those reports are likely to be highly correlated. Alternatively, collapsing the reports into one measure invariably implies an unfortunate loss of information and a nagging question as to whether one has done the right thing. Instead, we used Stata 9 to implement a novel application of complex sample survey methods (Pepe, Whitaker, and Seidel 1999; Horton and Fitzmaurice 2004), which allows simultaneous use of multiple reports in a single regression model. We further extended the method to accommodate estimation of within- and between-pair effects in twin research. My presentation will use Vietnam-era veteran twin data to explore the association between military service in Vietnam with post traumatic stress disorder and address within- and between-pair effects. We will gently explore how to properly reshape data, derive necessary variables, specify models, and implement Stata’s svy commands to apply the method.

Pepe, M. S., R. C. Whitaker, and K. Seidel. 1999. Estimating and comparing univariate associations with application to the prediction of adult obesity. Statistics in Medicine 18: 163–173.

Horton, N. J., and G. M. Fitzmaurice. Regression analysis of multiple source and multiple informant data from complex survey samples. 2004. Statistics in Medicine 23: 2911–2933.

Additional information
Bogart_WCSUG_2007_FINAL.ppt (slides)

Rapid formation of regression tables for research purposes

Roy Wada
University of California, Los Angeles
The ostensible reason for a preparation of regression tables is to have them submitted to journals for publication purposes. Contrary to this professed view, regression tables are mostly used during research and not after. Journals require regression tables because they allow visual comparisons across regressions. It is difficult to compare specifications without placing them in close proximities, even if it means printing hardcopies. Past users of statistical packages have often resorted to printing hundreds of pages and flipping them back and forth. The technology for postestimation display has historically lagged behind the production of estimation itself. A bottleneck existed in the research process when regressions were produced much faster than they could be interpreted. The next logical step in the development of statistical packages is to be able to produce regression tables as fast and as naturally as performing regressions themselves. Regression tables ought to be produced easily, rapidly, and sequentially; they need to be displayed immediately on the computer screen. The usefulness of regression tables is much reduced if postponed until the end of your research. outreg, a program by John Gallup, has been modified and augmented extensively for this purpose. outreg2 will immediately produce and open formatted regression tables in programs associated with LaTeX, Word, or Excel files. seeout will immediately display a regression table in the Stata Data Browser.

Additional information
Rapid_Formation_presentation.pdf (slides)
Rapid_Formation_article.pdf (article)

Syntax coloring, etc.

Elliott Lowy
VA Health Services Research and Development
I will present a sweet syntax coloring using jEdit, a free, open-source, Java-based, cross-platform text editor. The syntax coloring distinguishes commands, variables, macros, simple and compound quoted strings (and unquoted string literals), and different kinds of comments. This includes macros inside of strings, strings in expressions in macro functions, etc. Mata syntax coloring included. On the integration side, added bits allow a line, selection, or separately defined section of code (as well as the whole file) to be run in Stata with a keystroke. Semicolon delimited, and Mata, lines are recognized from context and run correctly. The code can also be run in do, run, or trace modes, as determined by a mode button in jEdit. Multiline commands (i.e., split with triple slashes) are also recognized and run as a whole without the need to select all lines.

Additional information
Find all the plug-ins and information about using jEdit.

Meta-analytical integration of diagnostic accuracy studies in Stata

Ben Dwamena
University of Michigan
This presentation will demonstrate how to perform diagnostic meta-analysis using midas, a user-written command. midas is comprehensive program of statistical and graphical routines for undertaking meta-analysis of diagnostic test performance in Stata. Primary data synthesis is performed within the bivariate mixed-effects binary regression modeling framework. Model specification, estimation (by adaptive Gaussian quadrature), and prediction are carried out with xtmelogit in Stata release 10 or gllamm (Rabe-Hesketh et. al) in Stata release 9. Using the model estimated coefficients and variance–covariance matrices, midas calculates summary operating sensitivity and specificity (with confidence and prediction contours in SROC space), summary likelihood and odds ratios. Global and relevant test performance metric-specific heterogeneity statistics are also provided. midas facilitates extensive statistical and graphical data synthesis and exploratory analyses of unobserved heterogeneity, covariate effects, publication bias, and subgroup analyses. Bayes’ nomograms, likelihood-ratio matrices, and conditional probability plots may be obtained and used to guide clinical decision making.

Additional information
Dwamena_WCSUG2007.pdf (slides)

Estimating heterogeneous choice models with Stata

Richard Williams
University of Notre Dame
When a binary or ordinal regression model incorrectly assumes that error variances are the same for all cases, the standard errors are wrong and (unlike OLS regression) the parameter estimates are biased. Heterogeneous choice/location-scale models explicitly specify the determinants of heteroskedasticity in an attempt to correct for it. These models are also useful when the variability of underlying attitudes is itself of substantive interest. This paper illustrates how Williams’ user-written command oglm (ordinal generalized linear models) can be used to fit heterogeneous choice and related models. It further shows how two other models that have appeared in the literature—Allison’s (1999) model for comparing logit and probit coefficients across groups, and Hauser and Andrew’s (2006) logistic response model with partial proportionality constraints (LRPPC)—are special cases of the heterogeneous choice model and/or algebraically equivalent to it and can also be fitted with oglm. Other key features of oglm that are illustrated include support for linear constraints, the use of prefix commands such as svy and stepwise, and the computation of predicted probabilities and marginal effects.

Additional information
rw_WCSUG2007.pdf (slides)
rw_WCSUG2007.ppt (slides)
rw_WCSUG2007_Handout.pdf (handout)

Using regular expressions for data management in Stata

Rose Medeiros
University of California, Los Angeles
Regular expressions make a number of data management operations involving string variables much easier. They do this by allowing the user to search for (and copy or replace) complex patterns of characters within a string. Examples of when regular expression are useful include extracting zip codes from addresses, reformatting dates if they were entered in an inconsistent manner, and removing excess spaces from string expressions. This presentation will give the user a basic introduction to the use of regular expressions, and the Stata functions related to regular expressions, as well as examples of applications where regular expressions can be used to streamline data management.

Additional information
medeiros_reg_ex.pdf (slides)

Teaching with Stata

Alan Acock
Tony Lachenbruch
Oregon State University
Stata is a useful tool to demonstrate statistical concepts to elementary (and advanced) statistics classes. For elementary classes, one of the issues is to avoid making the class one in how to use Stata but keep the focus on learning statistics. We have found a lab to be helpful to teach students how to use Stata. The basic commands need to be demonstrated, and since most students don’t have full Stata documentation, some simple command descriptions are useful. It is also a good idea to use datasets from real life to illustrate the ideas. Some pitfalls can be shown—our greatest goof (that we continue to do) is when using logical commands to create new variables—missing values are always an issue. Some moderately advanced ideas can be introduced into the elementary class. Tony Lachenbruch is experimenting with the permutation and bootstrap commands this year. Alan Acock is trying to find a way to move a college of SPSS and SAS users to Stata by getting students on the Stata bandwagon. Alan Acock is also trying to find which user-written commands should be incorporated in the first-year labs.

Additional information
Teaching_with_Stata_alan.ppt (slides by Alan Acock)
Teaching_with_Stata_Tony.ppt (slides by Tony Lachenbruch)

Graph Editing

Vince Wiggins
We will take a quick tour of the Graph Editor, covering the basic concepts: adding text, lines, and markers; changing the defaults for added objects; changing properties; working quickly by combining the contextual toolbars with the more complete object dialogs; and using the object browser effectively. Leveraging these concepts, we'll discuss how and when to use the grid editor and techniques for combined and by-graphs. Finally, we will look at some tricks and features that aren't apparent at first blush.

Creating self-validating datasets

Bill Rising
One of Stata’s great strengths is its data management abilities. When either assembling, sharing, or using shared datasets, some of the most time-consuming activities are validating the data and writing documentation for the data. Much of this futility could be avoided if datasets were selfcontained, i.e., if they could validate themselves. I will show how to achieve this goal within Stata by attaching validation rules to the variables themselves via Stata’s characteristics. I will show a dialog box that makes attaching simple validation rules to variables simple enough that for most rules no Stata expertise is needed, but which also allows arbitrarily complicated validation rules. Along with this I'll demonstrate commands for running error checks, or marking suspicious observations, as well as documenting the validation rules. The validation system is flexible enough that simple checks continue to work even if variable names change or if the data are reshaped, and it is rich enough that validation may depend on other variables in the dataset. Since the validation is at the variable level, the self validation continues to work if variables are recombined with data from other datasets. With these tools, Stata’s datasets can become truly self contained.

Additional information
ckvarTalk.beamer.pdf (slides)

Estimating average treatment effects in Stata

Guido Imbens
Harvard University
In this talk, I look at several methods for estimating average effects of a program, treatment, or regime, under unconfoundedness. The setting is one with a binary program. The traditional example in economics is that of a labor market program where some individuals receive training and others do not, and interest is in some measure of the effectiveness of the training. Unconfoundedness, a term coined by Rubin (1990), refers to the case where (nonparametrically) adjusting for differences in a fixed set of covariates removes biases in comparisons between treated and control units, thus allowing for a causal interpretation of those adjusted differences. This is perhaps the most important special case for estimating average treatment effects in practice.

Under the specific assumptions we make in this setting, the population-average treatment effect can be estimated at the standard parametric root-N rate without functional form assumptions. A variety of estimators, at first sight quite different, have been proposed for implementing this. The estimators include regression estimators, propensity score based estimators, and matching estimators. Many of these are used in practice, although rarely is this choice motivated by principled arguments. In practice, the differences between the estimators are relatively minor when applied appropriately, although matching in combination with regression is generally more robust and is probably the recommended choice. More important than the choice of estimator are two other issues. Both involve analyses of the data without the outcome variable. First, one should carefully check the extent of the overlap in covariate distributions between the treatment and control groups. Often there is a need for some trimming based on the covariate values if the original sample is not well balanced. Without this, estimates of average treatment effects can be sensitive to the choice of, and small changes in the implementation of, the estimators. In this part of the analysis, the propensity score plays an important role. Second, it is useful to do some assessment of the appropriateness of the unconfoundedness assumption. Although this assumption is not directly testable, its plausibility can often be assessed using lagged values of the outcome as pseudooutcomes. Another issue is variance estimation. For matching estimators bootstrapping, although widely used, has been shown to be invalid. I discuss general methods for estimating the conditional variance that do not involve resampling.

Additional information
stata_07oct_final.pdf (slides)

Scientific organizers

Colin Cameron, UC Davis

Xiao Chen, UCLA

Phil Ender, UCLA

Estie Hudes, UCSF

Tony Lachenbruch, Oregon State

Bill Mason, (cochair) UCLA

Sophia Rabe-Hesketh (cochair), UC Berkeley

Logistics organizers

Chris Farrar, StataCorp

Gretchen Farrar, StataCorp