Charles McCulloch

University of California, San Francisco

Statistical models that include random effects are commonly used to analyze
longitudinal and clustered data. These models are often used to derive
predicted values of the random effects, for example in predicting which
physicians or hospitals are performing exceptionally well or exceptionally
poorly. I start this talk with a brief introduction and several examples of
the use of prediction of random effects in practice. In typical
applications, the data analyst specifies a parametric distribution for the
random effects (often Gaussian) although there is little information
available to guide this choice. Are predictions sensitive to this
specification? Through theory, simulations, and an example illustrating the
prediction of who is likely to go on to develop high blood pressure, I show
that misspecification can have a moderate impact on predictions of random
effects and describe simple ways to diagnose such sensitivity.

**Additional information**

West_Coast_Stata_2007_talk_predict_random_effects.pdf (slides)

West_Coast_Stata_2007_talk_predict_random_effects.pdf (slides)

Colin Cameron

University of California, Davis

This presentation provides an overview of the subset of methods for panel data and
the associated Stata **xt** commands most commonly used by
microeconometricians. First, attention is focused on a short panel, meaning
data on many individual units and few time periods. Examples include
longitudinal surveys of many individuals and panel datasets on many firms.
Then the data can be viewed as being clustered on the individual unit and
panel methods used are also applicable to other forms of clustered data such
as cross-section data from individual-level surveys conducted at many
villages with clustering at the village level. Second, emphasis is placed on
using the repeated measures aspect of panel data to estimate key marginal
effects that can be interpreted as measuring causation rather than mere
correlation. The leading methods assume time-invariant individual-specific
effects (or “fixed effects”). Instrumental variables (IV) methods
can also be used, with data from periods other than the current year
potentially serving as instruments. Third, some analyses use dynamic models
rather than static models. Particular interest lies in fitting models
with both lagged dependent variables and fixed effects. The paper
additionally surveys other panel methods used in econometrics, such as those
for nonlinear models and those for dynamic panels with many periods of data.

**Additional information**

cameronwcsug.pdf (slides)

cameronwcsug.pdf (slides)

Phil Ender

Unversity of California, Los Angeles

This presentation will give an overview of the three main approaches to
analyzing repeated measures analysis of variance: 1) multivariate models, 2)
traditional anova models, and 3) linear mixed models along with discussion
of the advantages and disadvantages of each. The presentation includes Stata
code using **manova**, **anova**, **regress**, and **xtmixed**. The
three approaches are illustrated through the use of a split-plot factorial
design with one between-subjects factor and one repeated factor.

**Additional information**

repeated_anova.pdf (slides)

repeated_anova.pdf (slides)

Christine Wells

University of California, Los Angeles

The presentation will discuss Stata’s evolution into a comprehensive
survey data analysis package by looking at its past, present, and possible
future. Comparisons will be made with other survey data analysis software
packages, such as SUDAAN, WesVar, SAS, and SPSS, with respect to both survey
designs that can be analyzed as well as the types of analyses that can be
conducted.

**Additional information**

Wells_Stata10talk.pdf (slides)

Wells_Stata10talk.pdf (slides)

Sophia Rabe-Hesketh

University of California, Berkeley

Survey data are often analyzed using multilevel or hierarchical models. For
example, in education surveys, schools may be sampled at the first stage and
students at the second stage and multilevel models used to model
within-school and between-school variability. An important aspect of most
surveys that is often ignored in multilevel modeling is that units at
each stage are sampled with unequal probabilities. Standard maximum
likelihood estimation can be modified to take the sampling probabilities
into account, yielding pseudomaximum likelihood estimation, which is typically
combined with robust standard errors based on the sandwich estimator. This
approach is implemented in **gllamm**. I will introduce the ideas, discuss
issues that arise such as the scaling of the weights, and illustrate the
approach by applying it to data from the Program for International Student
Assessment (PISA).

**Additional information**

stata_sophia.pdf (slides)

stata_sophia.pdf (slides)

Vicki Stagg

University of Calgary

The development of a Stata program to calculate published measures of
comorbidity will be of value to researchers working with inpatient discharge
data coded in ICD-9-CM or ICD-10. The **comorbid** command
calculates the weighted sum of comorbidities, as well as comorbidity scores
based on the Charlson Index, which reflects the cumulative increase in
likelihood of 1-year mortality from comorbidities. This allows for
the calculation of three different comorbidity measures: ICD-9-CM, Enhanced
ICD-9-CM, or ICD-10 (Quan et al 2005). Exclusion of less severe
comorbidities can occur using an optional hierarchical method that excludes
from the calculations a mild comorbidity when a patient has also exhibited a
more severe form of the same diagnosis. The comparable
**elixhauser** command calculates the sum of this alternate set of
comorbidity measures, which may be associated with negative hospital outcomes
(Elixhauser et al 1998). Both Stata algorithms can handle patients or
visits as the observational unit. Options allow for a choice of summary
output.

**Additional information**

Stagg_Stata_Presentation_final.ppt (slides)

stagg_notes_final.pdf (presentation notes)

Stagg_Stata_Presentation_final.ppt (slides)

stagg_notes_final.pdf (presentation notes)

Elliott Lowy

VA Health Services Research and Development

A collection of user-written commands will be presented, which in one way or
another facilitate dealing with meta-data—from manipulation and
presentation of variable names and types, through labels, notes, and other
meta-data fields included with data files, and on to a command for accessing
small text databases for interrelated datasets.

**Additional information**

The repository for the ado-files and packages used in this talk can be found at http://datadata.info/ado.

It is easier and more Stata-like to access the repository by typing

The repository for the ado-files and packages used in this talk can be found at http://datadata.info/ado.

It is easier and more Stata-like to access the repository by typing

in the command window in Stata. This method also allows web access to individual help files.net from http://datadata.info/ado

Rose Medeiros

University of California, Los Angeles

It is generally advised that imputation models contain as many
“predictor” variables as possible, since the greater the number
of variables the greater the amount of information from which to make
estimations (van Buuren, Boshuizen, and Knook 1999). Ideally, an imputation
model might contain all variables in the dataset. Hence, the default in
software packages that perform multivariate imputation by chained equations
(e.g., **ice** in Stata) is often to use all other variables in the imputation
model to predict missing values. However, in datasets with moderate to
large numbers of variables, attempting to use all other variables in the
dataset results in imputation models that are too large to actually run.
One solution to this problem is to select a relatively large, but
reasonable, number of predictors based on bivariate correlations and then
drop predictors as necessary to create a regression model that is tractable
using the complete data. This set of regression models form the imputation
model for the entire dataset. This presentation outlines this approach in
more detail and presents an overview of the Stata package that implements
it.

**Additional information**

medeiros_mice.pdf (slides)

medeiros_mice.pdf (slides)

Andy Bogart

Jack Goldberg

University of Washington, Seattle

One challenging feature of some medical research is the existence of
multiple sources of exposure information about individual subjects. When an
exposure of interest has been measured in a variety of ways or has been
reported on by multiple informants, analysts must decide how best to
estimate its association with some interesting outcome. Simply performing a
multiple regression analysis of the outcome on all the sources together can
be problematic, since those reports are likely to be highly correlated.
Alternatively, collapsing the reports into one measure invariably
implies an unfortunate loss of information and a nagging question as to
whether one has done the right thing. Instead, we used Stata 9 to implement
a novel application of complex sample survey methods (Pepe, Whitaker, and
Seidel 1999; Horton and Fitzmaurice 2004), which allows simultaneous use of
multiple reports in a single regression model. We further extended the
method to accommodate estimation of within- and between-pair effects in twin
research. My presentation will use Vietnam-era veteran twin data to explore
the association between military service in Vietnam with post traumatic
stress disorder and address within- and between-pair effects. We will
gently explore how to properly reshape data, derive necessary variables,
specify models, and implement Stata’s **svy** commands to apply the
method.

**References:**

Pepe, M. S., R. C. Whitaker, and K. Seidel. 1999. Estimating and comparing univariate associations with application to the prediction of adult obesity.*Statistics in Medicine* 18: 163–173.

Horton, N. J., and G. M. Fitzmaurice. Regression analysis of multiple source and multiple informant data from complex survey samples. 2004.*Statistics
in Medicine* 23: 2911–2933.

**Additional information**

Bogart_WCSUG_2007_FINAL.ppt (slides)

Pepe, M. S., R. C. Whitaker, and K. Seidel. 1999. Estimating and comparing univariate associations with application to the prediction of adult obesity.

Horton, N. J., and G. M. Fitzmaurice. Regression analysis of multiple source and multiple informant data from complex survey samples. 2004.

Bogart_WCSUG_2007_FINAL.ppt (slides)

Roy Wada

University of California, Los Angeles

The ostensible reason for a preparation of regression tables is to have them
submitted to journals for publication purposes. Contrary to this professed
view, regression tables are mostly used during research and not after.
Journals require regression tables because they allow visual comparisons
across regressions. It is difficult to compare specifications without
placing them in close proximities, even if it means printing hardcopies.
Past users of statistical packages have often resorted to printing hundreds
of pages and flipping them back and forth. The technology for postestimation
display has historically lagged behind the production of estimation itself.
A bottleneck existed in the research process when regressions were produced
much faster than they could be interpreted. The next logical step in the
development of statistical packages is to be able to produce regression
tables as fast and as naturally as performing regressions themselves.
Regression tables ought to be produced easily, rapidly, and sequentially;
they need to be displayed immediately on the computer screen. The
usefulness of regression tables is much reduced if postponed until the end
of your research. **outreg**, a program by John Gallup, has been modified
and augmented extensively for this purpose. **outreg2** will immediately
produce and open formatted regression tables in programs associated with
LaTeX, Word, or Excel files. **seeout** will immediately display a
regression table in the Stata Data Browser.

**Additional information**

Rapid_Formation_presentation.pdf (slides)

Rapid_Formation_article.pdf (article)

Rapid_Formation_presentation.pdf (slides)

Rapid_Formation_article.pdf (article)

Elliott Lowy

VA Health Services Research and Development

I will present a sweet syntax coloring using jEdit, a free, open-source,
Java-based, cross-platform text editor. The syntax coloring
distinguishes commands, variables, macros,
simple and compound quoted strings (and unquoted string literals), and
different kinds of comments. This includes macros inside of strings, strings
in expressions in macro functions, etc. Mata syntax coloring included. On
the integration side, added bits allow a line, selection, or separately
defined section of code (as well as the whole file) to be run in Stata with
a keystroke. Semicolon delimited, and Mata, lines are recognized from
context and run correctly. The code can also be run in **do**,
**run**, or **trace** modes, as determined by a mode button in jEdit.
Multiline commands (i.e., split with triple slashes) are also recognized and
run as a whole without the need to select all lines.

**Additional information**

Find all the plug-ins and information about using jEdit.

Find all the plug-ins and information about using jEdit.

Ben Dwamena

University of Michigan

This presentation will demonstrate how to perform diagnostic meta-analysis
using **midas**, a user-written command. **midas** is comprehensive
program of statistical and graphical routines for undertaking meta-analysis
of diagnostic test performance in Stata. Primary data synthesis is
performed within the bivariate mixed-effects binary regression modeling
framework. Model specification, estimation (by adaptive Gaussian
quadrature), and prediction are carried out with **xtmelogit** in Stata
release 10 or **gllamm** (Rabe-Hesketh et. al) in Stata release 9. Using
the model estimated coefficients and variance–covariance matrices,
**midas** calculates summary operating sensitivity and specificity (with
confidence and prediction contours in SROC space), summary likelihood and
odds ratios. Global and relevant test performance metric-specific
heterogeneity statistics are also provided. **midas** facilitates
extensive statistical and graphical data synthesis and exploratory analyses
of unobserved heterogeneity, covariate effects, publication bias, and
subgroup analyses. Bayes’ nomograms, likelihood-ratio matrices, and
conditional probability plots may be obtained and used to guide clinical
decision making.

**Additional information**

Dwamena_WCSUG2007.pdf (slides)

Dwamena_WCSUG2007.pdf (slides)

Richard Williams

University of Notre Dame

When a binary or ordinal regression model incorrectly assumes that error
variances are the same for all cases, the standard errors are wrong and
(unlike OLS regression) the parameter estimates are biased. Heterogeneous
choice/location-scale models explicitly specify the determinants of
heteroskedasticity in an attempt to correct for it. These models are also
useful when the variability of underlying attitudes is itself of substantive
interest. This paper illustrates how Williams’ user-written command
**oglm** (ordinal generalized linear models) can be used to fit
heterogeneous choice and related models. It further shows how two other
models that have appeared in the literature—Allison’s (1999)
model for comparing logit and probit coefficients across groups, and Hauser
and Andrew’s (2006) logistic response model with partial
proportionality constraints (LRPPC)—are special cases of the
heterogeneous choice model and/or algebraically equivalent to it and can
also be fitted with **oglm**. Other key features of **oglm** that are
illustrated include support for linear constraints, the use of prefix
commands such as **svy** and **stepwise**, and the computation of
predicted probabilities and marginal effects.

**Additional information**

rw_WCSUG2007.pdf (slides)

rw_WCSUG2007.ppt (slides)

rw_WCSUG2007_Handout.pdf (handout)

rw_WCSUG2007.pdf (slides)

rw_WCSUG2007.ppt (slides)

rw_WCSUG2007_Handout.pdf (handout)

Rose Medeiros

University of California, Los Angeles

Regular expressions make a number of data management operations involving
string variables much easier. They do this by allowing the user to search
for (and copy or replace) complex patterns of characters within a string.
Examples of when regular expression are useful include extracting zip codes
from addresses, reformatting dates if they were entered in an inconsistent
manner, and removing excess spaces from string expressions. This
presentation will give the user a basic introduction to the use of regular
expressions, and the Stata functions related to regular expressions, as well
as examples of applications where regular expressions can be used to
streamline data management.

**Additional information**

medeiros_reg_ex.pdf (slides)

medeiros_reg_ex.pdf (slides)

Alan Acock

Tony Lachenbruch

Oregon State University

Stata is a useful tool to demonstrate statistical concepts to elementary
(and advanced) statistics classes. For elementary classes, one of the
issues is to avoid making the class one in how to use Stata but keep the
focus on learning statistics. We have found a lab to be helpful to teach
students how to use Stata. The basic commands need to be demonstrated, and
since most students don’t have full Stata documentation, some simple
command descriptions are useful. It is also a good idea to use datasets
from real life to illustrate the ideas. Some pitfalls can be
shown—our greatest goof (that we continue to do) is when using logical
commands to create new variables—missing values are always an issue.
Some moderately advanced ideas can be introduced into the elementary class.
Tony Lachenbruch is experimenting with the permutation and bootstrap
commands this year. Alan Acock is trying to find a way to move a college of
SPSS and SAS users to Stata by getting students on the Stata bandwagon. Alan
Acock is also trying to find which user-written commands should be
incorporated in the first-year labs.

**Additional information**

Teaching_with_Stata_alan.ppt (slides by Alan Acock)

Teaching_with_Stata_Tony.ppt (slides by Tony Lachenbruch)

Teaching_with_Stata_alan.ppt (slides by Alan Acock)

Teaching_with_Stata_Tony.ppt (slides by Tony Lachenbruch)

Vince Wiggins

StataCorp

We will take a quick tour of the Graph Editor, covering the basic concepts:
adding text, lines, and markers; changing the defaults for added objects;
changing properties; working quickly by combining the contextual toolbars
with the more complete object dialogs; and using the object browser
effectively. Leveraging these concepts, we'll discuss how and when to use
the grid editor and techniques for combined and by-graphs. Finally, we will
look at some tricks and features that aren't apparent at first blush.

Bill Rising

StataCorp

One of Stata’s great strengths is its data management abilities. When either
assembling, sharing, or using shared datasets, some of the most
time-consuming activities are validating the data and writing documentation
for the data. Much of this futility could be avoided if datasets were
selfcontained, i.e., if they could validate themselves. I will show how to
achieve this goal within Stata by attaching validation rules to the
variables themselves via Stata’s characteristics. I will show a
dialog box that makes attaching simple validation rules to variables simple
enough that for most rules no Stata expertise is needed, but which also
allows arbitrarily complicated validation rules. Along with this I'll
demonstrate commands for running error checks, or marking suspicious
observations, as well as documenting the validation rules. The validation
system is flexible enough that simple checks continue to work even if
variable names change or if the data are reshaped, and it is rich enough
that validation may depend on other variables in the dataset. Since the
validation is at the variable level, the self validation continues to work
if variables are recombined with data from other datasets. With these tools,
Stata’s datasets can become truly self contained.

**Additional information**

ckvarTalk.beamer.pdf (slides)

ckvarTalk.beamer.pdf (slides)

Guido Imbens

Harvard University

In this talk, I look at several methods for estimating average effects of a
program, treatment, or regime, under unconfoundedness. The setting is one
with a binary program. The traditional example in economics is that of a
labor market program where some individuals receive training and others do
not, and interest is in some measure of the effectiveness of the training.
Unconfoundedness, a term coined by Rubin (1990), refers to the case where
(nonparametrically) adjusting for differences in a fixed set of covariates
removes biases in comparisons between treated and control units, thus
allowing for a causal interpretation of those adjusted differences. This is
perhaps the most important special case for estimating average treatment
effects in practice.

Under the specific assumptions we make in this setting, the population-average treatment effect can be estimated at the standard parametric root-N rate without functional form assumptions. A variety of estimators, at first sight quite different, have been proposed for implementing this. The estimators include regression estimators, propensity score based estimators, and matching estimators. Many of these are used in practice, although rarely is this choice motivated by principled arguments. In practice, the differences between the estimators are relatively minor when applied appropriately, although matching in combination with regression is generally more robust and is probably the recommended choice. More important than the choice of estimator are two other issues. Both involve analyses of the data without the outcome variable. First, one should carefully check the extent of the overlap in covariate distributions between the treatment and control groups. Often there is a need for some trimming based on the covariate values if the original sample is not well balanced. Without this, estimates of average treatment effects can be sensitive to the choice of, and small changes in the implementation of, the estimators. In this part of the analysis, the propensity score plays an important role. Second, it is useful to do some assessment of the appropriateness of the unconfoundedness assumption. Although this assumption is not directly testable, its plausibility can often be assessed using lagged values of the outcome as pseudooutcomes. Another issue is variance estimation. For matching estimators bootstrapping, although widely used, has been shown to be invalid. I discuss general methods for estimating the conditional variance that do not involve resampling.

**Additional information**

stata_07oct_final.pdf (slides)

Under the specific assumptions we make in this setting, the population-average treatment effect can be estimated at the standard parametric root-N rate without functional form assumptions. A variety of estimators, at first sight quite different, have been proposed for implementing this. The estimators include regression estimators, propensity score based estimators, and matching estimators. Many of these are used in practice, although rarely is this choice motivated by principled arguments. In practice, the differences between the estimators are relatively minor when applied appropriately, although matching in combination with regression is generally more robust and is probably the recommended choice. More important than the choice of estimator are two other issues. Both involve analyses of the data without the outcome variable. First, one should carefully check the extent of the overlap in covariate distributions between the treatment and control groups. Often there is a need for some trimming based on the covariate values if the original sample is not well balanced. Without this, estimates of average treatment effects can be sensitive to the choice of, and small changes in the implementation of, the estimators. In this part of the analysis, the propensity score plays an important role. Second, it is useful to do some assessment of the appropriateness of the unconfoundedness assumption. Although this assumption is not directly testable, its plausibility can often be assessed using lagged values of the outcome as pseudooutcomes. Another issue is variance estimation. For matching estimators bootstrapping, although widely used, has been shown to be invalid. I discuss general methods for estimating the conditional variance that do not involve resampling.

stata_07oct_final.pdf (slides)