(Last updated: 25 April 2001)
First North American meeting: announcement and abstracts
Longwood Galleria Conference Center
342 Longwood Avenue
Boston, Massachusetts
March 12 and 13, 2001
Meeting overview
by Bill Gould
First, I would like to add a few words about what these meetings mean to
StataCorp.
For those who have never attended a user group meeting, let me briefly
describe the format. The presentations are interspersed with relaxed
discussion over coffee and food and the first day's meeting was followed by a
dinner. Presentations varied from descriptions of statistical methods, how to
perform such analyses in Stata, and comparisons of the results so obtained
with those of other packages, to descriptions of new userprogrammed features,
to advice on the use of Stata in teaching.
In all of these meetings, user presentations have come first and Stata's
presentations at the end. StataCorp is always happy to participate, but the
user meetings are organized by users for users.
What makes each of these meetings worthwhile is the high quality of the work
that is performed by users for users. In my talk at the end, I mentioned that
a goal of Stata has been to (re)open the development of statistical software
to the users (remember that, originally, statistical software was written by
the users). Based on the evidence presented in the meetings, they have had
much success.
For those interested, here is a more detailed summary of the meetings:
In the first session on Monday morning, we opened with a review of fitting GEE
models in Stata by Nicholas Horton (Boston University), followed by an
introduction with software to social network analysis (QAP or dyadic data) by
William Simpson (Harvard Business School), and closing with Stas Kolenikov (U.
North Carolina) discussing the use of ml to fit normal mixture
composition models.
After a break, we returned to hear Jeremy Freese (U. of Wisconsin) present his
and Scott Long's post estimation commands for use with regression models for
categorical and count data, followed by Jeroen Weesie (U. of Utrecht, but
currently visiting StataCorp) discussing his work on a new command for testing
for omitted variables, which is to say, verifying specification of models.
We broke for lunch, and thereafter heard Michael Duggan (Suffolk U.)
and then Alicia Dowd (U. Mass. Boston) discuss survey analysis, the main point
being to compare (correct) answers calculated by Stata's svy commands
with the expost F deflator approach for adjusting results which is popularly
used with SPSS and SAS. Michael Blasnik (Blasnik Associates) then spoke on
using Stata adofiles to produce reams of tables based on many statistics
calculated by the svy commands. Richard Goldstein (Qualitas Inc.)
closed the session by showing how random coefficient models could be estimated
in Stata which, beyond its obvious practical benefits, served the purpose of
emphasizing from where the estimates of random coefficients come.
After a final break for the day, Rino Bellocco (Karolinska Institutet) spoke on
the analysis of longitudinal data and compared results produced by Stata, SAS
and SPLUS. Harriet Griesinger (Wellesley Child Care Res.), in "Date and time
tags for filenames in WinXX", provided a solution for a problem that
originally appeared on Statalist, and Kit Baum (Boston College) provided a
summary of his work in analyzing multifrequency panel data with Stata, which
data was basically a i x i x t data (sic) and introduced us to the concepts of
"longlong data", "longwide" data, etc. Petia Petrova (Boston College) spoke
on using crossyear family individual files from the PSID, which is one of the
most popular datasets used by economists these days.
Such was the first day; we broke, some went for drinks, others to attend to
personal matters, and we met some hours later at the Indian restaurant.
The second day opened with a talk by Phil Ender (UCLA) on teaching with Stata,
which realtime teaching tools he demonstrated to the delight of all. That
was followed by David Kantor (Johns Hopkins Univ.) on threevalued logic
(which talk probably produced the most questions and comments at its
conclusion). Following that, Nicholas J. Cox (Durham University) spoke on
analyzing circular data with Stata (circular referring to the fact that there
are 360 degrees in a circle, and an important point being to emphasize that
Geography is *NOT* about providing answers to questions such as "Why is Albany
the state capital of New York?"). David Drukker (StataCorp) then spoke on
paneldata analysis, with an emphasis on the Arellano–Bond estimator.
After a break, the final formal session focused on Stata. Joe Newton (Texas
A&M Univ. and Editor, STB) spoke on the STB and its future, and then Bill
Gould (President, StataCorp) spoke on Stata and its future, which I
refer to as "Report to Users" and which was first popularized in London.
(These talks tend to be fairly honest assessments of recent successes and
failures, so do not expect a written summary).
After the break, the Wishes and Grumbles session opened with Kit Baum as the
moderator. These sessions are always of great use to StataCorp. Right now, I
know others are putting together a detailed summary of this section, so I will
leave that for later.
The description above does not do an adequate job of explaining just how much
I enjoyed the meeting and was impressed by the presentations. I want to
express my gratitude to all of those presented and attended, and I want to
thank the organizers, Kit Baum, Nicholas Cox, and Marcello Pagano. The
organizers found just the right balance between casualness and formality, fun
and work, and succeeded in putting together a mix of presentations that
surprised, delighted, informed, and taught.
I came away having learned something about Stata, and that's saying
something.
Session 1, 12 March 2001
Estimation and fitting
Fitting Generalized Estimating Equation (GEE) regression models in Stata
Nicholas Horton,
Boston University School of Public Health

Abstract
Researchers are often interested in analyzing data which arise from a
longitudinal or clustered design. While there are a variety of standard
likelihoodbased approaches to analysis when the outcome variables are
approximately multivariate normal, models for discretetype outcomes
generally require a different approach. Liang and Zeger formalized an
approach to this problem using Generalized Estimating Equations (GEEs) to
extend Generalized Linear Models (GLMs ) to a regression setting with
correlated observations within subjects. In this talk, I will briefly
review the GEE methodology, introduce some examples, and provide a
tutorial on how to fit models using xtgee in Stata.
Additional information
Handouts/slides
The Quadratic Assignment Procedure (QAP)
William Simpson,
Harvard Business School

Abstract
Some datasets contain observations corresponding to pairs of entities
(people, companies, countries, etc.). Conceptually, each observation
corresponds to a cell in a square matrix, where the rows and columns are
labeled by the entities. For example, consider a square matrix where the
rows and columns are the 50 U.S. states. Each observation would contain
numbers such as the distance between the pair of states, exports from one
state to the other, etc. The observations are not independent, so
estimation procedures designed for independent observations will
calculate incorrect standard errors. The quadratic assignment procedure
(QAP), which is commonly used in social network analysis, is a
resamplingbased method, similar to the bootstrap, for calculating the
correct standard errors. This talk explains the QAP algorithm and
describes the command, with syntax similar to the bstrap command,
which implements the quadratic assignment procedure and allows running
any estimation command using QAP samples.
Additional information
Handouts/slides
The normal mixture decomposition
Stanislav Kolenikov,
University of North Carolina at Chapel Hill

Abstract
This talk will present the program for univariate normal mixture maximum
likelihood estimation developed by the author. It will demonstrate the
use of the ml lf estimation method, as well as a number of
programming tricks, including global macros manipulation and dynamic
definition of the program to be used by ml. The merits and
limitations of Stata's ml optimizer will be discussed. The
application to income distribution analysis with a real dataset will
also be shown.
Additional information
Handouts/slides
ex1.do
ex2.do
ex3.do
ex4.do
ex5.do
Session 2, 12 March 2001
Model testing
Postestimation commands for regression models for categorical and count outcomes
Jeremy Freese,
University of Wisconsin
J. Scott Long,
University of Indiana

Abstract
Although Stata has made estimating regression models for categorical and
count outcomes virtually as fast and easy as estimating the familiar
regression model for continuous outcomes, interpreting the results from
the former is complicated by the nonlinear relationship between the
independent variables and the dependent quantities of interest (i.e.,
predicted probabilities and predicted counts). As a consequence, the
change in the predicted value associated with a unit change in the
independent variable depends on the specific values of all of the
independent variables. We have developed a series of tools that are
intended to facilitate the effective use and interpretation of these
models. Our command listcoef presents lists of different types
of transformed coefficients from these models, and also provides a guide
to their interpretation. A suite of commands, known collectively as
pr*, computes predicted values and the discrete change for
specified values of the independent variables. Our command
fitstat computes a large number of goodnessoffit statistics.
Specifically for the multinomial logit model, the command
mlogtest performs a number of commonly desired tests, and
mlogview creates discrete change and/or odds ratio plots.
Additional information
Handouts/slides
Testing for omitted variables
Jeroen Weesie,
Utrecht University

Abstract
Testing for omitted variables should play an important part in
specification analyses of statistical "linear form" models. Such
omissions may comprise terms in variables that were included themselves
(e.g., a quadratic term or a categorical specification instead of a
metric one), interactions between variables in the model, and variables
that were left out in the beginning. Reestimating models with
additional variables and performing, for example, likelihoodratio tests
is timeconsuming. Score tests provide an attractive alternative,
since the tests can be computed using only results from the model
already estimated. We present a Stata command for performing score
testing after most Stata estimation commands (e.g., logit,
heckman, streg, etc.). This command supports
multipleequation models, clustered observations, and adjusted
pvalues for simultaneous testing.
Additional information
Handouts/slides
Session 3, 12 March 2001
Survey and multilevel data analysis
Computing variances from data with complex sampling designs:
A comparison of Stata and SPSS
Alicia C. Dowd,
Univ. Mass. Boston, Graduate College of Education
Michael B. Duggan,
Suffolk University

Abstract
Most of the datasets available through the National Center for Education
Statistics (NCES) are based on complex sampling designs involving
multistage sampling, stratification, and clustering. These complex
designs require appropriate statistical techniques to calculate the
variance. Stata employs specialized methods that appropriately adjust
for the complex designs, while SPSS does not. Researchers using SPSS
must obtain the design effects through NCES and adjust the standard
errors generated by SPSS with these values. This presentation addresses
the pros and cons of recommending Stata or SPSS to novice researchers.
The first presenter teaches research models to doctoral students and
uses Stata to conduct research with NCES data. She uses SPSS to teach
her research methods course due to its userfriendly interface. The
second presenter is a doctoral student conducting dissertation research
with NCES data. In his professional life as an institutional
researcher, he uses SPSS. NCES datasets are a rich resource, but the
complex sampling designs create conceptual issues beyond the immediate
grasp of most doctoral candidates in the field. The session considers
and invites comments on the best approaches to introducing new
researchers to complex sampling designs in order to enable them to use
NCES data.
Additional information
Handouts/slides
svytabs: A program for producing complex survey tables
Michael Blasnik,
Blasnik & Associates

Abstract
Stata's svytab command is quite limited because tables that users
need to produce for reports often involve extracting a single point
estimate (and standard error, confidence intervals, or pvalue)
from each of dozens or hundreds of svytab commands. svytab
was designed to produce these tables directly. It sets up and performs
many svytab commands and grabs the appropriate output to create
formatted tables ready to export to word processor or spreadsheet. The
added features include: 1) allows a full varlist for the rowvar if they
are dichotomous (sequencing through and grabbing the estimate of
interest from each); 2) allows either dichotomous or multivalue rowvars
(if multivalued, then varlist is restricted to one); 3) allows multiple
subpops and cycles through them; 4) doesn't require — but allows
— a columnvar (allowing subpops to substitute); 5) formats the
output into a log file for exporting a CSV (with table titling options);
6) uses characteristics to provide "nice" naming of rows and columns; 7)
provides options for outputting standard errors, confidence intervals,
asterisking significance levels, deff, etc.... I think anyone producing
complex survey tables would find svytabls quite useful.
Additional information
svytabs.ado
svytabs.hlp
Simple cases of multilevel models
Rich Goldstein

Abstract
While much has been made of multilevel models and specialized software
for such models, in many cases standard methods can be used in estimating
these models. Use of such standard methods is faster and easier, in many
cases, than use of specialized software; further, use of standard
methods helps clarify what these models actually are estimating. I limit
my discussion here to linear regression models and include a new adofile
that puts together the steps to match multilevel models, in certain
cases. If time allows, a comparison with the much slower gllamm6,
for these limited situations, will be briefly presented.
Additional information
Handouts/slides
Session 4, 12 March 2001
Longitudinal data analysis
Date and time tags for filenames in WinXX
Harriet E. Griesinger,
Wellesley Child Care Research Partnership

Abstract
I receive several (ir)regular deliveries of data files for the ongoing
development of a panel dataset. Both the delivering agency systems and
the targets of our research group change over time — by the hour
and/or by the year. I need to be able to identify from the filenames
which Stata.dta files were created with which .do files leaving which
.log files. I use the Stata shell facility and DOS rename to attach an
adogenerated global macros datetag and global macro hourminutetag.
Additional information
Handouts/slides
Efficient management of multifrequency panel data with Stata
Christopher F. Baum,
Boston College

Abstract
This presentation discusses how the tasks involved with carrying out a
sizable research project, involving panel data at both monthly and daily
frequencies, could be efficiently managed by making use of builtin and
usercontributed features of Stata. The project entails the
construction of a dataset of crosscountry monthly measures for 18
nations and the evaluation of bilateral economic activity between each
distinct pair of countries. One measure of volatility, at a monthly
frequency, is calculated from daily spot exchange rate data and
effectively merged back to the monthly dataset. Nonlinear least squares
models are estimated for every distinct bilateral relationship, and the
results of those 300+ models are organized for further analysis and
production of summary tables and graphics using a postfile. The various
laborsaving techniques used to carry periods and data to be integrated
with the panel dataset with ease.
Additional information
Handouts/slides
Challenges of creating and working with crossyearfamilyindividual files:
An example from the PSID dataset
Petia Petrova,
Boston College

Abstract
Often researchers need to build longitudinal datasets in order to study
individuals and families or firms and plants across time. No matter if
individuals or firms are points of interest, the resulting matrix is no
longer rectangular due to the changes in family or firm composition.
Many times, the data come into a different format and simply merging,
for example, family and person IDs lead to wrong records. Here, we are
using the Panel Study of Income Dynamics to illustrate some of the
pitfalls in creating a CrossYearFamilyIndividual file. In order to
create a CrossYearFamilyIndividual file, one has to merge the family
files with the individual files. As of 1990 the file format of PSID
consists of singleyear files with familylevel data collected in each
wave (i.e., 26 family files for data collected from 1968 through 1993)
and one crossyear individual file with the individuallevel data
collected from 1968 to the most recent interviewing wave. Attaching
family records to the individual ones, without taking into
consideration splitoffs and movers in and out of the family, however,
lead to some cases in which members of the same family appear to have
different information for family income. The core of the problem is
that some of the information reported in the interview year refers to
the previous year. If a person is splitoff, he reports, for example,
the family income of the family he is currently. This income then is
incorrectly attached to his record of the previous year when he was in
a different family. We suggest a way to fix problems like this one.
The idea is to extract separately all variables referring to the year
previous to the year of the interview, and then using the splitoff
indicator to attach them to the individuals' records.
Additional information
Handouts/slides
Analysis of longitudinal data in Stata, SPLUS, and SAS
Rino Bellocco,
Karolinska Institutet

Abstract
Longitudinal data are commonly collected in experimental and
observational studies, where both disease and risk factors are measured
at different times. The goal of this project is to compare analyses
performed using Stata, SPLUS, and SAS under two different families of
distributions: normal and logistic. I will show the results obtained
from the analyses of two sample datasets; these will be analyzed using
both Generalized Estimating Equations (gee) and a randomeffects
model. In Stata, I will use both the xt programs and the routine
provided by RabeHesketh (gllamm6): confidence intervals,
hypothesis testing, and model fitting will be discussed. Missing data
issues will be raised and discussed as well.
Additional information
Handouts/slides
Session 5, 1 3 March 2001
Assorted topics
Stata teaching tools
Phil Ender,
UCLA Department of Education

Abstract
This presentation will cover a collection of statistics teaching tools
written in Stata. These programs involve demonstrations or simulations
of various statistical topics that are used both in the classroom and
individually by the students. Topics include probability (coin, dice,
box models), common probability distributions (normal, t,
chisquare, F), sampling distributions, central limit theorem,
confidence intervals, correlation, regression, and other topics. These
programs are currently being used in introductory and intermediate
research methods courses being taught in the UCLA Department of
Education. The presentation will conclude with a short review on my
experiences using Stata in the classroom over the past two years.
Threevalued logic operations in Stata
David Kantor,
Institute for Policy Studies, Johns Hopkins University

Abstract
Stata uses numeric quantities as logical values and provides logical
operators (&, , ~) to build expressions from basic entities. These
operators can be regarded as faulty when missing values are present in
the operands. In this context, missing is equivalent to true, which is
often not the desired result. Instead, one may want to obtain the
maximal set of nonmissing results of all combinators of operand values,
while preserving the behavior of the operators on twovalued operands
— in other words, one should adopt threevalued logic. I have
developed a set of egen functions that provide this capability. As
such, they can only do one type of operation at a time, so that complex
expressions would need to be build in stages. They can be a great help
when you wish to generate indicator variables and want the maximal set
of nonmissing results.
Additional information
Handouts/slides
Analyzing circular data in Stata
Nicholas J. Cox,
University of Durham

Abstract
Circular data are a large class of directional data, which are of interest
to scientist in many fields, including biologists (movements of migrating
animals), meteorologists (winds), geologists (directions of joints and
faults), and geomorphologists (landforms, oriented stones). Such
examples are all recordable as compass bearings relative to North. Other
examples include phenomena that are periodic in time, including daily and
seasonal rhythms. The analysis of circular data is an odd corner of
statistical science which many never visit, even though it has a long and
curious history. Perhaps for that reason, it seems that no major
statistical language provides direct support for circular statistics,
although there is a commercially available specialpurpose program
called Oriana. This paper describes the development and use of some
routines which have been written in Stata, primarily to allow graphical
and exploratory analyses. They include commands for data management,
summary statistics and significance tests, univariate graphics, and
bivariate relationships. The graphics routines were developed partly
with gph. (By the time of the meeting, it may be possible to
enhance these using new facilities in Stata 7.) Collectively, they offer
about as many facilities as does Oriana.
Additional information
Handouts/slides
Session 6, 13 March 2001
Econometric analysis of panel data in Stata
Econometric analysis of panel data in Stata
David Drukker,
StataCorp

Abstract
Many researchers need to estimate panel data models in which either the
idiosyncratic term is autocorrelated or the model includes a lagged
dependent variable. This talk will review some of the estimation and
inference methods that have appeared in the econometric literature to
deal with these problems. These issues will be discussed in the context
of an extended example base on the same data used by Arellano and Bond in
their 1991 Review of Economic Studies paper. In the course of the
example, some twostage least squares estimators for simultaneous
equations with panel data will also be discussed.
Additional information
Handouts/slides
xt_pres_out.do
abdata.dta
Session 7, 13 March 2001
Stata
The evolving nature of the Stata Technical Bulletin
H. Joseph Newton,
Texas A&M University
Report to users
William W. Gould,
StataCorp
Session 8, 13 March 2001
Christopher F. Baum (moderator),
Boston College and RePEc