» Home » Resources & support » User Group meetings » 2008 Fall North American Stata Users Group meeting

*Last updated: 24 November 2008*

The Handlery Union Square Hotel

351 Geary Street

San Francisco, CA 94102

Roberto G. Gutierrez

StataCorp

Stata’s **xtmixed** command can be used to fit mixed models, models
that contain both fixed and random effects. The fixed effects are merely the
coefficients from a standard linear regression. The random effects are not
directly estimated but summarized by their variance components, which are
estimated from the data. As such, **xtmixed** is typically used to
incorporate complex and multilevel random-effects structures into standard
linear regression. **xtmixed**’s syntax is complex but versatile,
allowing it to be widely used, even for situations that do not fit the
classical “mixed” framework. In this talk, I will give a
tutorial on the uses of **xtmixed** not commonly considered, including
examples of heteroskedastic errors, group structures on random effects, and
smoothing via penalized splines.

**Additional information**

gutierrez.pdf

gutierrez.pdf

Minjeong Jeon

University of California–Berkeley

Sophia Rabe-Hesketh

University of California–Berkeley

We consider multilevel models for longitudinal data where membership in the
highest level units changes over time. The application is a four-year study
of Korean students who are in middle school during the first two waves and
in high school during the second two waves, where middle schools and high
schools are not nested. The model includes crossed random effects for middle
schools and high schools and can be estimated by using Stata’s
**xtmixed** command. An important consideration is how the impact of the
middle school and high school random effects on the response variable should
change over time.

John M. Neuhaus

University of California–San Francisco

Charles E. McCulloch

University of California–San Francisco

Generalized linear mixed models provide effective analyses of clustered and
longitudinal data and typically require the specification of the
distribution of the random effects. The consequences of misspecifying this
distribution are subject to debate; some authors suggest that large biases
can arise, while others show that there will typically be little bias for the
parameters of interest. Using analytic results, simulation studies, and
example data, I summarize the results of extensive assessments
of the bias in parameter estimates due to random-effects distribution
misspecification. I also present assessments of the accuracy of
random-effects predictions under misspecification. These assessments
indicate that random-effects distribution misspecification often produces
little bias when estimating slope coefficients but may yield biased
intercepts and variance-components estimators as well as mildly inaccurate
predicted random effects.

**Additional information**

neuhaus_stata2008.talk.pdf

neuhaus_stata2008.talk.pdf

Sophia Rabe-Hesketh

University of California–Berkeley

Anders Skrondal

Norwegian Institute of Public Health

This presentation focuses on predicted probabilities for multilevel models
for dichotomous or ordinal responses. For instance, in a three-level model
with patients nested in doctors nested in hospitals, predictions for
patients could be for new or existing doctors and, in the latter case, for
new or existing hospitals. In a new version of **gllamm**, these
different types of predicted probabilities can be obtained very easily. We
will give examples of graphs that can be used to help interpret an estimated
model. We will also introduce a program we have written to
construct 95% confidence intervals for predicted probabilities.

**Additional information**

rabe_hesketh_predict5.pdf

rabe_hesketh_predict5.pdf

Garrett Glasgow

University of California–Santa Barbara

Heterogeneous choice models are extensions of binary and ordinal regression
models that explicitly model the determinants of heteroskedasticity. I show
that, often, moderation (proximity to a choice threshold) will produce
empirical results identical to heteroskedasticity in binary heterogeneous
choice models, while extremity (a preference for endpoint categories) will
produce empirical results identical to heteroskedasticity in ordinal
heterogeneous choice models. I show how a simple extension of
Williams’ user-written **oglm** command can create
ordered heterogeneous choice models that can distinguish between
heteroskedasticity, extremity, and moderation.

**Additional information**

glasgow_stata.ppt

glasgow_stata.ppt

Ben Dwamena

University of Michigan Radiology and VA Nuclear Medicine

Methods for meta-analysis of diagnostic test accuracy studies must, in
addition to unobserved heterogeneity, account for covariate heterogeneity,
threshold effects, methodological quality, and small-study bias, which
constitute the major threats to the validity of meta-analytic results. These
have traditionally been addressed independent of each other. Two recent
methodological advances include 1) the bivariate random-effects model for
joint synthesis of sensitivity and specificity, which accounts for unobserved
heterogeneity and threshold variation using random effects, and covariate and
quality effects as independent variables in a metaregression; and 2) a
linear regression test for funnel plot asymmetry in which the diagnostic
odds ratio as an effect-size measure is regressed on effective sample size as a
precision measure. I propose a generalized framework for diagnostic
meta-analysis that integrates both developments based on a modification of
the bivariate Dale’s model in which two univariate random-effects logistic
models for sensitivity and specificity are associated through a log-linear
model of odds ratios with the effective sample size as an independent
variable. This framework unifies the estimation of the summary test
performance and the assessment of the presence, extent, and sources of
variability. Taking advantage of the ability of **gllamm** to model a
mixture of discrete and continous outcomes, I will discuss specification,
estimation, diagnostics, and prediction of the model, using a motivating
dataset of 43 studies investigating FDG-PET for staging the axilla in
patients with newly diagnosed breast cancer.

Christine Wells

UCLA

Analyzing survey data is different from analyzing data generated by
experiments in several important ways. I will discuss these differences
using the NHANES III adult dataset as an example. Topics will
include specifying the survey elements, analysis of subpopulations, model
diagnostics, and model comparison.

**Additional information**

wells_stata2008.pdf

wells_stata2008.pdf

Roy Costilla

LLECE/UNESCO Santiago

Stata is a very good tool for analyzing survey data. It considers
many important aspects of complex survey design and the availability of
alternative variance-estimation methods. Through the use of matrix and macro
language, it also allows the user to store and manage output results
conveniently to automate the entire estimation and testing process.
I will discuss the estimation of the main results of the
Second Regional Comparative and Explanatory Study, an
assessment of the performance in the domains of mathematics, reading, and
science of third- and sixth- grade students in 16 countries of Latin
America in 2005–2006. In particular, I will consider the estimation of the
mean scores and their variability by country, area, grade, and
subpopulation. I will also present the comparisons made to check
for the differences in performance among countries and subpopulations.

**Additional information**

costilla_serce_stata_sfo.pdf

costilla_serce_stata_sfo.pdf

Phil Ender

UCLA

I will present three approaches to understanding 3-way ANOVA interactions:
1) a conceptual approach, 2) an ANOVA approach, and 3) a regression approach
using dummy coding. I will illustrate the three approaches through the use of
a synthetic dataset with a significant 3-way interaction.

**Additional information**

ender_3way_anova.pdf

ender_3way_anova.pdf

Noori Akhtar-Danesh

McMaster University

In this presentation, I demonstrate some common challenges with large
datasets in survival analysis. I investigate the relationship between the age
of smoking initiation and some demographic factors in the Canadian Community
Health Survey, Cycle 3.1 (CCHS-3.1) dataset. CCHS-3.1 is a large dataset
that includes information for over 130,000 individuals. I used different
techniques for model fitting and model checking. Test-based techniques for
the assessment of PH assumption are not very useful because a small
deviation from the theoretical model leads to the rejection of PH
assumption. In contrast, graphical approaches seem to be more helpful.
However, not every diagnostic graph can be drawn because of the large
dataset. Preliminary results show that 63% of Canadians have ever smoked a
whole cigarette. Therefore, it seems more appropriate to use a cure fraction
model (Lambert, 2007, * Stata
Journal*, 7: 351–375) to handle the large proportion of
censored data. However, sampling weights cannot be used in this model. In
conclusion, survival analysis for large datasets cannot be done easily. Some
challenges include assessment of PH assumption and drawing diagnostic
graphs. Besides, use of the cure fraction model may not be appropriate if
sampling weights cannot be incorporated in the model estimation.

**Additional information**

akhtar_danesh_stata2008_meeting.ppt

akhtar_danesh_stata2008_meeting.ppt

Martin Weiss

University of Tuebingen, Germany

Using Stata, I have researched the market efficiency of the German 6/49
parimutuel lottery game. I investigate the existence of profit opportunities
for particularly unpopular combinations of numbers (Papachristou and
Karamanis [1998]), employing the covariates proposed by Henze and Riedwyl
(1998). Furthermore, I examine the time-series behavior of stakes bet in
relation to the size of the jackpot in the respective draw. In particular, I
attempt to verify the conjecture that the skewness of the payoff
distribution drives bettors' appetite for participation (Golec and Tamarkin
[1998]). Along the way, I show how one can set up Stata to retrieve data
from the Internet, unpack them automatically, and shape them for the
analysis. I also show how one can schedule tasks to automate the process
further.

**References**

Henze, N. and H. Riedwyl. (1998).*How to Win More:
Strategies for Increasing a Lottery Win.* Natnick, MA: A K Peters.

Papachristou, G. and D. Karamanis. (1998). Investigating efficiency in betting markets: Evidence from the Greek 6/49 lotto.*Journal of Banking & Finance*
22: 1597–1615.

Golec, J. and M. Tamarkin. (1998). Bettors love skewness, not risk, at the horse track.*Journal of Political Economy*
106: 205–225.

**Additional information**

Presentation_Nov_13_Martin_Weiss.pdf

Henze, N. and H. Riedwyl. (1998).

Papachristou, G. and D. Karamanis. (1998). Investigating efficiency in betting markets: Evidence from the Greek 6/49 lotto.

Golec, J. and M. Tamarkin. (1998). Bettors love skewness, not risk, at the horse track.

Presentation_Nov_13_Martin_Weiss.pdf

Yulia Marchenko

StataCorp

In the past decade, many statistical methods have been proposed for the
analysis of case–control genetic data with an emphasis on
haplotype-based disease association studies. Most of the methodology has
concentrated on the estimation of genetic (haplotype) main effects. Most
methods accounted for environmental and gene–environment interaction effects
by utilizing prospective-type analyses that may lead to biased estimates
when used with case–control data. Several recent publications
have addressed the issue of retrospective sampling in the analysis of
case–control genetic data in the presence of environmental factors by
developing new efficient semiparametric statistical methods. I present a
new Stata command, **haplologit**, that implements efficient
profile-likelihood semiparametric methods for fitting gene–environment
models in the very important special cases of 1) a rare disease, 2) a
single candidate gene in Hardy–Weinberg equilibrium, and 3)
the independence of genetic and environmental factors.

**Additional information**

marchenko_SF08.pdf

marchenko_SF08.pdf

Christopher F. Baum

Boston College and DIW Berlin

Stata’s matrix language, Mata, highlighted in Bill Gould’s Mata
Matters columns in the *Stata Journal*, is very useful and powerful in
its interactive mode. Stata users who write do-files or ado-files should
gain an understanding of the Stata–Mata interface: how Mata can be
called upon to do one or more tasks and return its results to Stata. Mata's
broad and extensible menu of functions offers assistance with many
programming tasks, including many that are not matrix-oriented. In this
tutorial, I will present examples of how do-file and ado-file writers might
effectively use Mata in their work.

**Additional information**

baum_StataMata.beamer.FNASUG08.pdf

baum_StataMata.beamer.FNASUG08.pdf

Elliott Lowy

VAPSHCS HSR&D

I will present a selection of user-written Mata functions that serve to
streamline the process of writing other Mata functions, and I will
demonstrate what makes them handy. I will present debugging/programming
functions for the following: dropping and re-creating one or a few functions
without clearing Mata of all other useful info; displaying the contents of
a matrix in a compact and informative way; and copying private function
information into the global space. I will present text-handling functions
for the following: concatenating and dividing blocks of text; processing
lists of file/directory paths; and converting between matrices of text and
ASCII values. I will present more general-purpose functions for the
following: combining matrices of different sizes; reading and writing Mata
matrices to spreadsheet files; generating a map of matching values in two
matrices; and returning an entire (small) matrix of values to Stata locals.
I will finish with a combined Stata/Mata command for storing Stata command
preferences.

**Additional information**

Package available by typing**net from http://datadata.info/ado** within
Stata.

Package available by typing

Colin Cameron

University of California–Davis

This talk will be an overview of how to estimate nonlinear regression models
that are not covered by Stata’s many built-in estimation commands.
The Mata **optimize()** function will be emphasized, and the Stata **ml**
command will also be covered. The material is drawn from chapter 11 of
Cameron and Trivedi’s (2009) *Microeconometrics Using Stata*,
Stata Press.

**Additional information**

cameronwcsug2008.pdf

cameronwcsug2008.pdf

Roy Epstein

Boston College

I will present a Stata program for improved quality control of
econometric models. Reported econometric results often
have unknown reliability because of selective reporting by the researcher.
In particular, *t*-statistics are often uninformative or misleading when
multiple models are estimated from the same dataset. Econometric best
practices should include routine stress tests to assess the robustness of
estimation results to reasonable perturbations of the model specification
and underlying data. It is feasible to implement these tests as standard
outputs from the statistical software. This information should lead to
greater transparency and greater ability of others to interpret a given
regression. The Stata program I will discuss can be used after commands
that perform cross-section, time-series, and panel regression. It is easily
extensible to include additional tests as desired.

**Additional information**

epstein_stata_november_2008.ppt

epstein_stata_november_2008.ppt

Elliott Lowy

VAPSHCS HSR&D

While Stata, of course, comes with a serviceable set of I/O commands,
I have found room for improvement. I will present a set of user-written
commands for using, saving, appending, and merging. Highlights include
wildcards in file paths, drastically reducing the amount that needs to be
typed; options to change the working directory to match the file specified;
quick reloading of the current analysis file; saving partial datasets;
using/appending sets of multiple data files; transparent use of Stat/Transfer
within all commands to use, save, append, and merge from and/or to other
formats such as SAS and Excel; maintaining a “recent file” list
through the command interface; and eliminating the irritating irregular need
for quotes.

The**merge** command, in particular, has an even larger set of
advantages, which together with the above advantages means never having to
open and fiddle with a file before merging it. These advantages include
merging on disparately named variables; automatic conversion of
string/numeric variables; case-insensitive merging; renaming variables added
to the current data; automatic tabulation of the (labeled) _merge variable
or summarized merge information with automatic deletion of the _merge
variable; automatic deletion of matched or unmatched records; merging with a
single record from a multiply-matching merge file; and true many-to-many
merging.

**Additional information**

Package available by typing**net from http://datadata.info/ado** within
Stata.

The

Package available by typing

Joseph Schafer

Penn State

Joseph Kang

Penn State

Literature on causal inference has emphasized the average causal effect,
defined as the mean difference in potential outcomes under different
treatment conditions. We consider marginal regression models that describe
how causal effects vary in relation to covariates. To estimate parameters,
we replace missing potential outcomes in estimating functions with fitted
values from imputation models that include confounders and prognostic
variables as predictors. When the imputation and analytic models are
linear, our procedure is equivalent to maximum likelihood for normally
distributed outcomes and covariates. Robustness to misspecification of the
imputation models is enhanced by including functions of propensity scores as
regressors. In simulations where the analytic, imputation, and propensity
models are misspecified, the method performs better than inverse-propensity
weighting. Using data from the National Longitudinal Study of Adolescent
Health, we analyze the effects of dieting on emotional distress in the
population of girls who diet, taking into account the study's complex sample
design.

Rose Medeiros

UCLA

Through the use of user-written programs, primarily **mim** (Carlin,
Galati, and Royston, 2008, *Stata Journal* 8: 49–67), Stata users can analyze multiply imputed (MI)
datasets. Among other capabilities, **mim** allows the user to estimate
a range of regression models and to perform a multiparameter hypothesis test
after model estimation using a Wald test. The program presented here allows
the user to perform likelihood-ratio tests on after **mim** models using
MI datasets. This provides an additional means of testing nested models
after estimation using MI data. The process used to perform the
likelihood-ratio tests is described in Meng and Rubin (1992, *Biometrika* 79: 103–111). The test
statistic is calculated based on two sets of likelihood-ratio tests. The
first involves calculating the likelihood ratio for the null versus
the alternative hypothesis in each of the MI datasets. The second involves
calculating the likelihood for the null and the alternative hypotheses in each
of the MI datasets, constraining the parameters to be the estimates based on
combining coefficient estimates from the MI datasets (i.e., the average of
the parameter estimates across the MI datasets). The current version allows
testing for a limited number of regression commands (i.e., **regress**,
**logit**, and **ologit**), but subsequent versions may include
compatibility with additional commands.

**Additional information**

medeiros_2008.pdf

medeiros_2008.pdf

Xiao Chen

UCLA

The Statistical Consulting Group provides a variety of resources to Stata
users on campus, from walk-in and email consulting to an extensive website
on materials related to Stata. In this presentation I will explain how the group
offers such services and will discuss the three major components of the
consulting process: consulting, learning, and documenting. I will also
discuss the benefits and challenges involved in sharing contributed Stata
packages with clients, and the role the Internet has played in shaping the
collaboration aspect of our consulting model.

Xiao Chen, (cochair) UCLASophia Rabe-Hesketh (cochair), UC Berkeley

Phil Ender, UCLA

Estie Hudes, UCSF

Tony Lachenbruch, Oregon State

Bill Mason, UCLA

Doug Steigerwald, UC Santa Barbara

Chris Farrar, StataCorpGretchen Farrar, StataCorp