3rd North American Stata Users Group meetings: Abstracts
Monday, August 23, 2004
Use of Gaussian integration in
Stata
Alan Feiveson
NASA—Johnson Space Center

Abstract
Gaussian integration can be used to obtain surprisingly accurate evaluations
of definite integrals with as few as 10 or 20 function evaluations. In this
presentation, it will be shown how to incorporate tables of Gaussian
integration weights into Stata datasets and use them to evaluate integrals
for each observation. The same approach can be used to incorporate integrals
involving model parameters as part of a maximum likelihood or nonlinear
leastsquares estimation process. Examples will be given using data from
NASA's biomedical research for developing countermeasures to the adverse
effects of prolonged spaceflight on astronauts.
Additional information
feiveson.ppt
feiveson.pdf
Generating random variables
from the N/I distributions
Peter A. Lachenbruch
U.S. FDA

Abstract
The N/I distributions are the ratio of a normal distribution to an independent
distribution; they include the normal, Cauchy, t, and slash for various cases
of the denominator distribution. The author has developed a program that will
generate these distributions for use in simulations. Additionally, mixtures
are allowed, and one can obtain the distribution of the inverse of the I
distribution by setting the numerator normal to have mean 1 and standard
deviation 0. These were originally developed for the robust estimation study
of Andrews et al. (1972).
Additional information
nirand.ppt
nirand.ado
nirand.hlp
Econometric techniques for estimating
treatment effects
Zhehui Luo
Department of Epidemiology, Michigan State University

Abstract
One way to evaluate the econometric techniques of estimating treatment effects
is to use experimental data to gauge results of different methods (LaLonde
1986). There has been heated debate since LaLonde's seminal paper as to
whether the propensityscore techniques overcome the selection problem (Smith
and Todd 2003; Dehejia 1999, 2002). This study uses a randomized trial of
cognitive behavioral intervention on reducing the severity of symptoms and
their impact on emotional distress and physical function for cancer patients.
We use several other datasets from which cancer patients were selected as
comparison groups. We estimate the "true" treatment effect on physical
function and mental health (SF36) with the randomized trial and compare the
results of the following econometric techniques using the comparison groups:
(1) differenceindifferences (DID) method, (2) instrumental variables, and
(3) propensity score matching estimators (including nearest neighbor, radius
matching, stratification, and kernel matching) (Becker & Ichino, 2002). The
results show that the propensityscore matching depends on the comparison
samples and the outcome compared and the bias is larger when the sample is
more different from the treated group.
Additional information
luo.ppt
Samplesize calculation for
longitudinal studies
Phil Schumm
Department of Health Studies, University of Chicago

Abstract
Consider a longitudinal study designed to estimate the difference in the rates
of change in some outcome between two different groups. In this case, the
variance of the estimator depends on several factors, including the
variability in the outcome, the amount of missing data due to dropout, the
distribution of additional covariates, and the degree and structure of the
withinunit correlation across time. Although it is often possible to compute
the variance (or an approximation to it) directly from a mathematical formula,
this can be unwieldy for those unfamiliar with such computations. In this
presentation, I will demonstrate (using real examples) how xtgee can be
used to compute the variance, from which an estimate of power may be obtained.
By creating an appropriate pseudodataset, it is possible to specify virtually
any covariate distribution and pattern of dropout. In addition, because
xtgee will accept an arbitrary fixed correlation matrix, it is easy to
specify whatever correlation structure is considered most plausible. This
method is intuitive and makes it easy for researchers to explore the effects
that changes in their assumptions have on a study's power. A comparison of the
results of this method with those generated by other sample size software will
also be presented.
Using Stata for questionnaire
development
Theodore Pollari
Phil Schumm
Department of Health Studies, University of Chicago

Abstract
In studies that collect survey data, the investigator(s) often construct the
questionnaire using a word processor and then deliver it to a survey
organization, which translates it into an electronic data collection instrument
(e.g., CAPI or CATI). Unfortunately, this approach suffers from the following
problems: (1) a word processor is not well suited to the development of a
complex questionnaire, (2) time is wasted and errors may occur when
translating the questionnaire into CAPI, and (3) background information about
the individual questions that is often relevant for analysis of the data
(e.g., question source and rationale, scoring instructions, etc.) is not
preserved in the final data file. We will describe a system that permits an
investigator to construct a questionnaire in Stata by representing questions
as variables and using labels and characteristics to specify attributes such
as question text, response categories, and background information together
with specifications regarding the structure of the interview (e.g., skip
patterns and loops). The resulting .dta file is automatically translated
into a variety of useful forms, including a humanreadable version of the
questionnaire and a format that may be imported directly into CAPI. The file
also serves as a shell into which the actual data may be placed so that
researchers analyzing the data have easy access to question attributes.
Translating data between MySQL
and Stata
Michael Johnson
Phil Schumm
Department of Health Studies, University of Chicago

Abstract
As webbased and other electronic data collection methods become more widely
used in research, the opportunities to use statistical software in conjunction
with conventional database systems are increasing. Among such systems, MySQL
is particularly well suited for research purposes. For example, MySQL's ENUM
and SET column types are ideal for storing data collected via the multiple
choice questions typically used in social surveys. At the same time, Stata is
uniquely suited for working in conjunction with a database; for example, its
implementation of characteristics makes it possible to preserve (in a usable
form) important information about how the database and frontend application
are constructed (e.g., column types and other attributes). In this
presentation, we shall describe a Python script we have developed for
translating data from MySQL to Stata and will indicate briefly how we are
using it in the development of tools for the collection and management of
research data.
Working with ODBC data sources in
Stata: tips and techniques
Joseph Coveney
Cobridge Co., Ltd., Tokyo

Abstract
With its suite of ODBCrelated commands, Stata can now be used directly with
many popular database management systems (DBMSs) and other ODBC data sources.
Stat/Transfer's ODBC capabilities have permitted indirect access for some
time. ODBC has advantages over copy and paste or save as/insheet operations
for reproducibility of the analysis and for documentation of its trail of
events. The suite's ability to use Structured Query Language statements also
facilitates use of DBMSs for storage and organization of massive sets of data,
while economizing on memory during analysis with Stata by limiting what is
loaded into the active dataset to only pertinent rows and columns. This
presentation will briefly review the suite, and then, using a case study
approach, it will illustrate the use of this suite in solutions to selected
data management problems, for example, when clients deliver data for analysis
in spreadsheets laid out in an unfavorable manner, or when datasets are
delivered in illdesigned relational databases or in those that are subject to
frequent updating. The presentation will also share tips and precautions from
experience with its use with two popular spreadsheet and database packages,
and give pointers on using ODBC data sources that contain text in doublebyte
character sets (Unicode).
Using Stata with large datasets
in corporate America: lessons learned
Ed Bassin
ProfSoft, Inc.

Abstract
While Stata has gained wide acceptance in academia, its use in corporate
environments lags far behind. For academic Stata users, the product's
inability to penetrate business has important consequences. Students trained
in Stata have fewer opportunities than those trained in other tools,
particularly SAS, which are widely used by businesses. Academic statisticians
have fewer opportunities to collaborate and consult with colleagues in
business. For the past five years, ProfSoft has been developing and marketing
a medical claims analysis system that builds on the data management and data
analysis capabilities in Stata. Our experience has shown that Stata can
greatly enhance the analytic capabilities of health plans, provider
organizations, and purchaser coalitia and that Stata is a very powerful tool
for working with large corporate databases, sometimes in excess of 100 million
records. During that time, we have learned many lessons about how Stata can
gain acceptance in mainstream corporate America. In this presentation, we
discuss factors that helped gain acceptance for a software product that is
based on Stata. We discuss the features in Stata that are most important to our
customers and those that have little interest. We demonstrate business
applications with webbased graphical user interfaces that unleash the power
of Stata to users who have little or no interest in learning how to use Stata
directly.
Additional information
bassin.ppt
Graphics for categories and
compositions
Nicholas J. Cox
Department of Geography, Durham University, UK

Abstract
Graphics and categorical data are odd bedfellows. A pie chart of the
frequencies of a categorical variable may be the first statistical technique
taught to young children, and there is a very substantial if selfcontained
literature on biplots and related methods. Yet in between many texts and
papers on categorical data make little or no use of graphical methods. Is this
because appropriate graphs do not exist, or are they too trivial or too
ineffective to be worth attention? I shall discuss various Stata
implementations of graphs for categorical data, both familiar and unfamiliar,
old and new, including bar and dot charts, cumulative and sliding plots,
triangular plots, and tabular plots. Subsidiary themes will include, on the
statistical side, support for logit and other appropriate nonlinear scales,
respect for ordinal structure, smoothers for categorical data and
transformations of the simplex; and on the Stata side the strategy and
trickery of writing userwritten graphics programs as wrappers for the new
graphics of Stata 8, aiming both to maximize user choice and to minimize
userprogrammer effort.
Metagraphiti by Stata: Visuographical exploration and presentation of metaanalytic data using Stata
Ben Dwamena
University of Michigan Medical School

Abstract
Metaanalysis is considered the highest level of evidence on effectiveness of
healthcare interventions. It provides important information by capitalizing on
the large numbers of studies performed to assess the impact of healthcare
interventions, helps reduce variability and uncertainty among published
reports of efficacy, produce summary estimates of effectiveness for clinical
decision making, and evaluate the quality of the published evidence. However, a
large proportion of metaanalyses pose a surprising challenge for the
uninitiated user: in order to figure out what the researchers found, the user
must struggle through a maze of textual jargon, statistical formulas and
lengthy lists of actual studies and extensive tables of overall average effect
size and mean effect sizes for important subgroups of studies. On the premise
that "a picture is worth more than a thousand words but a 'metagraphita' is
worth more than a thousand words and statistical tests", the purpose of this
presentation is to provide an idiotproof overview of statistical
graphics/diagnostic plots for exploration of publication bias, data
distribution, heterogeneity and for summarizing overall datasets. Discussion
will include the construction and interpretation of general graphical displays
such as weighted histograms, normal quantile plots, forest plots, funnel
graphs, scatter diagrams, as well as plots unique to diagnostic metaanalysis
(e.g., ROC plane graphic, AccuracyThreshold regression plots, summary receiver
operator characteristic curves and likelihoodratio scattergrams).
Presentation will consist of didactic slide presentation supplemented by
handouts and an annotated bibliography and illustration of derivation and
interpretation of visual displays from published metaanalyses using Stata.
Additional information
metagraphitinotes.pdf
Densitydistribution sunflower plots in Stata 8
William D. Dupont
Department of Biostatistics, Vanderbilt University School of Medicine

Abstract
Density distribution sunflower plots are used to display highdensity
bivariate data. They are useful for data where a conventional scatter plot is
difficult to read due to overstriking of the plot symbol. The xy plane is
subdivided into a lattice of regular hexagonal bins of width w specified by
the user. The user also specifies the values of l, d, and k that affect the
plot as follows. Individual observations are plotted when there are less than
l observations per bin as in a conventional scatterplot. Each bin with from l
to d observations contains a light sunflower. Other bins contain a dark
sunflower. In a light sunflower, each petal represents one observation. In a
dark sunflower, each petal represents k observations. The user can control the
sizes and colors of the sunflowers. By selecting appropriate colors and sizes
for the light and dark sunflowers, plots can be obtained that give both the
overall sense of the data density distribution as well as the number of data
points in any given region. The use of this graphic is illustrated with data
from the Framingham Heart Study. Stata version 8.2 contains a program, called
sunflower, which draws these graphs.
Additional information
sunflower.pdf
Replication methods for complex survey analysis in Stata
Nicholas Winter
Department of Government, Cornell University

Abstract
This talk will discuss the svr suite of userwritten commands in
Stata. These commands facilitate the analysis of data from surveys with
complex sampling plans and represent an alternative to official Stata's
Taylor series linearizationbased svy commands. I will touch briefly on
the theoretical basis for these techniques and contrast them with Taylor
series. The heart of the talk will present the commands. I will conclude with
some observations of the joys and sorrows of constructing addon commands to
official Stata.
Additional information
winter.ppt
Rolling regressions in Stata
Kit Baum
Department of Economics, Boston College and RePEc

Abstract
This talk will describe some work underway to add a "rolling regression"
capability to Stata's suite of timeseries features. Although commands such as
statsby permit analysis of nonoverlapping subsamples in the time
domain, they are not suited to the analysis of overlapping (e.g., "moving
window") samples. Both movingwindow and wideningwindow techniques are often
used to judge the stability of time series regression relationships. We will
present an implementation of a rolling regression command and illustrate with
examples from the empirical literature.
Additional information
baum.pdf
rollreg_X.do
rollreg_X2.do
mvcorr_X.do
Implementation of quasileast
squares using xtgee in Stata
Justine Shults
Department of Biostatistics, University of Pennsylvania

Abstract
Liang and Zeger's original formulation of generalized estimating equations
(GEE) has been widely applied since its introduction in 1986 because it
extends the application of generalized linear models to clustered data. In
this presentation, we discuss a method, quasileast squares (QLS), that is in
the framework of GEE and builds on this popular approach by allowing for
consideration of correlation matrices that were previously difficult to apply.
In particular, we describe how to QLS in a straightforward fashion by making
use of Stata's xtgee procedure. We also discuss some data analysis
examples.
Additional information
shults.ppt
To help others in teaching
statistics using the Stata software
Susan Hailpern
Albert Einstein College of Medicine

Abstract
This presentation will discuss the issues involved with teaching statistics
with Stata to physicians in a MS program at Albert Einstein College of
Medicine (AECOM). The Clinical Research Training Program (CRTP) at AECOM is a
2year course of study for physicians wishing to earn a Master of Science
degree in Clinical Research Methods. The program has two complementary
components: a) didactic program with emphasis on epidemiology, biostatistics,
study design, and ethics, and b) a mentored clinical research experience.
Since its beginning in 1998, basic statistics was taught using the SPSS
statistical software. SPSS was felt to be easy to teach and learn because of
the "pulldown" menus. However, as students advanced, SPSS was found to be too
limited in its application to their clinical research. In particular, Stata
has the capability to perform multinomial and ordinal logistic regressions,
frailty models for multivariate survival analysis (semiparametric and
parametric), and immediate commands—all of which SPSS does not. This
summer, Stata 8 will be taught to CRTP students for the first time. Our
experience with the new Stata has convinced us that Stata 8 will be
easy to learn and use with the addition of "pulldown" menus. The fact that
the instructors teaching statistics with Stata come from very different
backgrounds will make this an interesting challenge. The senior instructor has
had extensive experience using SPSS and is a relative newcomer to Stata. The
other instructor has had extensive experience using Stata, but with expertise
in writing Stata programs (and is unfamiliar with using the "pulldown" menus
available in version 8). This presentation will discuss the course changes
planned in converting to Stata, as well as the successes and failures of
teaching statistics with Stata to physicians in a MS program at Albert
Einstein College of Medicine.
Additional information
hailpern.ppt
Sensitivity analysis on traffic
crash prediction models by using Stata
Deo Chimba
Department of Civil Engineering, Florida State University

Abstract
Traffic accidents result from the interaction of different parameters that
includes highway geometrics, traffic characteristics, and human
factors—geometric variables include number of lanes, lane width, median
width, shoulder width, roadway length, number of intersections, access
density, and shoulder width, while traffic characteristics include AADT and
speed. The effect of these parameters can be correlated by predictive models
that predict crash rates at particular roadway section. Stata software
commands can be used to test the sensitivity of these variables on crash
rate after modeling. In the current research sponsored by Florida Department
of Transportation titled "Evaluation of Geometric and Operational
Characteristics affecting the safety of Sixlane divided Roadways", we use
these commands to determine the effect in crash rate as the result of change
on these independent variables. We selected our model based on the
userwritten command nbvargr, which gives dispersion factor between
Poisson and negative binomial. By using Vuong's value, we were able to choose
between zeroinflated and normal models. With the listcoef, percent
command, we determine percent change in crash rate for unit and standard
deviation increase in independent variables. By using the mfx, compute
command, we were able to determine numerically the marginal effects or the
elasticities between crash rate and the independent variables. These commands,
and other builtin commands, reveal if the increase in size or dimension for
roadway geometrics will result in higher crash rate or reduction.
Additional information
chimba.pdf
Tuesday, August 24, 2004
Stata Graphics
Vince Wiggins
StataCorp LP

Abstract
This course will cover in detail the basic commands and concepts for
building highquality Stata graphs from scratch. You will learn new
approaches to creating graphs, including organizing and managing your
data, and creating custom schemes.
Additional information
There are annotated materials for this talk that can be viewed and run from within
Stata. To find, install, and begin the marterials, type the following commands
in Stata:
net from http://www.stata.com/users/vwiggins
net describe boston04
net install boston04
bgrtalk
whelp bgrtalk