Abstracts
Day 1: Users Group Meeting
Analysis of multiple source/multiple informant data in Stata
Nicholas Horton
Department of Mathematics, Smith College
Coauthor: Garrett Fitzmaurice, Harvard University
Abstract
We describe regression-based methods for analyzing multiple-source
data arising from complex sample survey designs in Stata. We use the term
multiple-source data to encompass all cases where data are simultaneously
obtained from multiple informants, or raters (e.g., self-reports, family
members, health care providers, administrators), or via different/parallel
instruments, indicators, or methods (e.g., symptom rating scales, standardized
diagnostic interviews, or clinical diagnoses). We review regression models
for analyzing multiple source risk factors or multiple source outcomes and
show that they can be considered special cases of generalized linear models,
albeit with correlated outcomes. We show how these methods can be extended to
handle the common survey features of stratification, clustering, and sampling
weights, as well as missing reports, and how they can be fitted within Stata.
The methods are illustrated using data from the Stirling County Study, a
longitudinal community study of psychopathology and mortality.
Additional information
Horton.ppt (Powerpoint presentation)
sim-tutorial.pdf (documentation/PDF)
gologit2: Generalized logistic regression models for ordinal dependent variables
Richard Williams
Sociology Department, University of Notre Dame
Abstract
gologit2 is a user-written program that fits
generalized logistic regression models for ordinal dependent variables. The
actual values taken on by the dependent variable are irrelevant, except that
larger values are assumed to correspond to "higher" outcomes. A major
strength of gologit2 is that it can also fit two special cases
of the generalized model: the proportional odds model and the partial
proportional odds model. Hence, gologit2 can fit models that
are less restrictive than the proportional odds/parallel lines models
fitted by ologit (whose assumptions are often violated) but are more
parsimonious and interpretable than those fitted by a nonordinal method,
such as multinomial logistic regression. The autofit option greatly
simplifies the process of identifying partial proportional odds models that
fit the data. Two alternative but equivalent parameterizations of the model
that have appeared in the literature are both supported. Other key
advantages of gologit include support for linear constraints, Stata
8.2 survey data (svy) estimation, and the computation of estimated
probabilities via the predict command. gologit is inspired by
Vincent Fu's gologit program and is backward compatible with it but
offers several additional powerful options.
Additional information
gologit2.pdf (PDF)
Williams_NASUG.pdf (PDF)
Williams_NASUG_handout.pdf (PDF)
L-statistics, especially L-moments, for fun and profit
Nicholas J. Cox
Geography Department, Durham University
Abstract
L-statistics are weighted combinations of order statistics. They have a
long history under that or other names and are simple to analyze, easy to
compute, and helpful in many applied problems. I give examples of their use
for numerical summary, graphical representation, and distribution fitting.
I discuss some Stata implementations for methods for quantile estimation
developed by Harrell and Davis, Kaigh and Lachenbruch, and others, and for
the method of L-moments formalized by Hosking. In particular, L-moments are
more resistant to outliers, and for higher moments less biased, than
classical estimators, yet they are in many ways less ad hoc than various
alternatives based on order statistics. They are also helpful for fitting
various distributions to data, particularly when maximum likelihood is
impractical. Datasets used for illustration are principally environmental.
Integrating Stata with database management systems
Ed Bassin
ProfSoft, Inc.
Abstract
Stata offers users a wide array of powerful data-management functions. The
array of functions allows users to manipulate their data with power,
flexibility, and ease. Relational database management systems (RDBMS),
such as Oracle and MySQL, also offer many useful capabilities for storing
and retrieving data, particularly when working with large datasets or
confidential information. In many cases, indexing saves a lot of time when
retrieving just a portion of the dataset. Database joins can provide a
much more efficient means for storing large datasets while also allowing
users to combine datasets together with greater flexibility than Stata's
merge command. Furthermore, databases generally have password
protection, allowing users of sensitive data, e.g., medical information, to
store data securely. Stata 8's ODBC functionality makes it possible to
have the best of both worlds. In this presentation, we show how Stata can
work with MySQL. We highlight the performance gains that can be derived by
retrieving data with indexed queries rather than "use if" commands. We
also compare two different methods for loading Stata files into MySQL, the
native MySQL command and Stata's ODBC command. Our presentation concludes
with suggestions about how Stata and MySQL could be more tightly integrated
in the future with substantial benefits to Stata users.
Mass producing appendices using Stata and word processor mail merge
Michael Blasnik
M. Blasnik & Associates
Abstract
Confronted with the task of producing a large appendix to a report
that involved a page of tables and 3 graphs for each of 186 panels, the
author discovered an approach to automate this process using Stata combined
with the mail-merge facilities of a word processor. A Stata do-file
produces all 558 graphs and writes an ASCII file of data that also includes
the graph file names for each panel. A one-page mail-merge document is set
up in the word processor, and the Stata output is used as the data source to
automatically create the entire 186-page appendix with all tables and graphs
placed as desired. This session will outline how to employ this approach
for such otherwise daunting tasks.
Additional information
mblasniknasug.ppt (Powerpoint presentation)
Reproducible research using Stata
Phil Schumm
Department of Health Studies, University of Chicago
Coauthor: Ronald A. Thisted
Abstract
In this presentation, we shall address two issues concerning the
research process: (1) the need for an efficient way to transfer results
obtained using a statistical package into a written report, and (2) the need
to organize and package one's work so that those results are easily
reproducible. Previous work addressing these issues has concentrated on
tools written specifically for use with S-Plus/R and LaTeX and has involved
mixing code and written material in a single document (e.g., Sweave). In
contrast, our approach involves a method for extracting results (e.g.,
estimation results, graphs, and even pieces of code and raw output)
generated by a suitably organized do-file (or files) into a series of
intermediary files that can then be easily imported into a second written
document. Although the document could be constructed in a variety of
formats, we shall demonstrate how this may done using
reStructuredText—an intuitive and easy-to-use markup syntax that can
then be automatically translated into a variety of final formats (e.g.,
LaTeX, HTML). This system may be used not only for research but also when
teaching statistics with Stata as a method for students to organize and to
submit their work.
Additional information
Schumm_NASUG-presentation.pdf (PDF)
Collaborative data management for longitudinal studies
Stephen Brehm
University of Chicago
Coauthor: L. Philip Schumm
Abstract
Efficient data cleaning and management are critical to the success of any
large research project. This is particularly true in the case of
longitudinal studies or those in which the data management tasks are
shared among many individuals. Faced with several such projects, we
developed a flexible, easy-to-use system for cleaning and managing research
datasets. The system is modular, making it easy for different individuals
to work on different parts of the process. This modularity also permits
substantial code reuse over multiple waves of a longitudinal study. A
central focus of the system is the idea of data testing; users write tests
for specific variables that may then be rerun when a new wave of data
becomes available or when changes to the data have been made. Although the
basic ideas could be implemented in any statistical package or programming
language, Stata is particularly well-suited to the task. In addition, we
have written an ado-file to automate the process of building a dataset and
another to generate basic tests automatically from an existing dataset.
Although the system was designed for use by large, collaborative projects,
individuals can also benefit from using it for personal research projects.
Additional information
CollaborativeDataManagementforLongitudinalStudies.ppt (Powerpoint presentation)
Creating valid and effective measures: Using -optifact- to create better summated rating scales
Paul Millar
University of Calgary
Abstract
While scales are an integral part of the measurements involved in many
research projects, few researchers make the effort to ensure that these
scales are valid, reliable, unidimensional, and consistent with relevant
correlates. This presentation outlines how the Stata routine
optifact can be used to create summated rating scales that are better
measures of the concept at issue, yet often require fewer questions. An
example using a major Canadian survey is presented to illustrate the
potential for better analysis and potentially substantial cost savings in
survey research.
Additional information
BostonScales.ppt (Powerpoint presentation)
Using and teaching Stata in a semester-length introduction to biostatistics course
Clinton Thompson
Public Health Program, Department of Family & Preventive Medicine, University of Utah
Coauthors: Stephen C. Alder, Justin Brown, Laurie Johnson
Abstract
The University of Utah's Public Health Program, housed in the Department of
Family and Preventive Medicine, uses Stata to supplement the core course
"Introduction to Biostatistics, I". All MPH/MSPH students are required to
enroll in the lab component of the Biostatistics course wherein Stata is (1)
used to juxtapose theory with practice, (2) provide students with the tools
necessary to complete the homework successfully, and (3) provide them with a
manual that is less dense than Stata documentation and less expensive than
the Stata users books, yet specific enough to address the problems
encountered in the course. The material in the Stata Lab Manual parallels
Pagano's and Gauvreau's Principles of Biostatistics and utilizes
many of the same examples and datasets. Each of the 16 lab sections has
the following format: a brief introduction to the topic du jour, an example
using a dataset, and an interpretation of the output. Although the
Lab Manual has been tailored to meet the specific needs of the course, many
sections rely heavily on explanations from the Stata online and print
documentation, and every effort is made to acquaint the student
with the help feature and how it is most efficiently accessed.
Additional information
Thompson.ppt (Powerpoint presentation)
Using Stata graphics as a method of understanding and presenting interaction effects
Joanne Garrett
Robert Wood Johnson Clinical Scholars Program, University of North Carolina at Chapel Hill
Abstract
It is fairly simple to add an interaction term (also known as "effect
modification" to epidemiologists) to a linear or logistic regression model
and test whether that term is statistically significant. However, it is
much more difficult to explain what a significant interaction means in an
intuitive way. A graphical representation of the interaction effect may
help. Stata graphics can be used to give students a better understanding of
what is actually happening when interaction is present. This can be helpful
before introducing the mathematical approach and interpretation in a model.
Graphing interactions also can be used as a simple method of exploratory
data analysis, or for reporting final results in a nonstatistical way in
presentations or journal articles.
Additional information
Garrett.ppt (Powerpoint presentation)
cron, perl and Stata: automated production and presentation of a business-daily index
Kit Baum
Boston College and RePEc
Coauthor: Atreya Chakraborty, UMass Boston
Abstract
In a Unix-based environment such as Mac OS X, it is very simple to
set up 'cron jobs' that run at any periodic interval. This presentation
illustrates how 'cron', the perl scripting language, and Stata can be used to
generate a database of daily stock quotes from a regional index, compute an
investor sentiment index based on a Spearman rank correlation, and "publish"
the results to a web page. Stata's file command is particularly
useful in generating the various formats needed in the web page
presentation.
Additional information
RAIM-NASUG2005aug.pdf (PDF)
Selecting the appropriate statistical distribution for the primary analysis: a case study
Peter A. Lachenbruch
FDA/CBER
Abstract
An article in The Lancet discussed a clinical trial of a product
for a rare disease. The authors had modified the primary analysis from an
unadjusted Wilcoxon rank-sum test to a Poisson regression. This led to
several questions: Was a Poisson distribution appropriate? How
were the covariates selected? What is the effect of outliers (there were
some)? If the Poisson model is not appropriate, can a permutation test
approach provide information regarding the effect of the treatment? The
presentation will show how Stata was used to evaluate these assumptions and
provide an alternative analysis. The implications of this analysis are
discussed.
Additional information
Lachenbruch.ppt (Powerpoint presentation)
Using Stata 9 to model complex nonlinear relationships with restricted cubic splines
William D. Dupont
Department of Biostatistics, Vanderbilt University School of Medicine
Coauthor: Dale Plummer
Abstract
Restricted cubic splines (RCSs) are used with generalized linear
models and other regression methods to model complex nonlinear
relationships. An RCS with k knots is linear before the first knot
and after the last knot, is a cubic polynomial between adjacent knots, and
is continuous and smooth. An RCS model with k knots can be fitted
with only k-1 covariates. rc_spline calculates these
covariates from an independent covariate and the knot values. Default
numbers of knots or knot values suggested by Harrell (2001) may be used. We
can then use the full power of Stata 9 to build models, construct graphs,
and perform residual analyses using these covariates. RCSs are illustrated
by modeling length of stay (LOS) and discharge mortality as a function of
admission blood pressure (BP) from the SUPPORT study. LOS and log-odds of
death are highly nonlinear functions of BP. Multiple linear and logistic
regression with RCSs are used to model these data. Plots of expected
outcome with 95% confidence bands are easily overlaid on scatterplots using
standard Stata graphics. These regression curves are little affected by the
knot placements. This robust methodology is easily taught to
nonstatisticians and greatly expands the modeling capacity of standard
regression methods.
Additional information
RCsplines.pdf (presentation slides/PDF)
RCsplines.ppt (Powerpoint presentation)
support.do (sample program; text/plain)
support.dta (sample dataset)
support.log (sample log file)
Adjusting for unequal selection probability in multilevel models: A comparison of software packages
Kim Chantala
Carolina Population Center, UNC at Chapel Hill
Coauthors: C. M. Suchindran, Dan Blanchette
Abstract
Most surveys collect data using complex sampling plans that involve
selection of both clusters and individuals with unequal probability of
selection. Research in methods of using multilevel modeling (MLM)
procedures to analyze such data is relatively new. Often sampling weights
based on selection probabilities of individuals are used to fit
population-based models. However, sampling weights used for fitting
multilevel models need to be constructed differently than weights used for
single-level (population-average) models. This paper compares the
capabilities of MLwiN, Mplus, LISREL, PROC MIXED (SAS), and gllamm
(Stata) for fitting MLM from data collected with a complex sampling plan.
We illustrate how sampling weights for fitting multilevel models with these
software packages can be constructed from population average weights.
Finally, we use data from the National Longitudinal Survey of Adolescent
Health to contrast the results from these packages.
Additional information
Chantala.ppt (Powerpoint presentation)
Day 2: Training courses from StataCorp
Analysis of survey data and correlated data
Jeff Pitblado, StataCorp
Abstract
This talk discusses Stata's features for analyzing survey data and correlated
data, and will explain how and when to use the three major variance estimators
for survey and correlated data: the linearization estimator, balanced repeated
replications, and the clustered jackknife (the latter two having been added in
Stata 9). The talk will also discuss sampling designs and stratification,
including Stata's new features for estimation with data from multistage
designs and for applying poststratification. A theme of the seminar will be
how you can make inferences with correct coverage from data collected by
single stage or multistage surveys or from data with inherent correlation,
such as data from longitudinal studies.
Additional information
svyNASUG2005.do
(installer for talk materials; text/plain)
Mata — matrix programming language
William Gould, StataCorp
Abstract
Mata is both an interactive environment for manipulating matrices and a full
development environment that produces compiled and optimized code. This talk
will cover both applications, with an emphasis on how you can use Mata to
quickly program solutions and how you can easily create new Stata commands
with Mata. (Mata is fully integrated with Stata). As you learn how to use
Mata, it will become clear why Stata developers chose to implement some of the
major new features in Stata 9 using Mata, including linear mixed models and
multinomial probit.
Additional information
bostonMataTalk.do
(installer for Mata talk materials; text/plain)
|
Meetings
Stata Conference
User Group meetings
Proceedings
|