Home  /  Resources & support  /  Users Group meetings  /  2003 North American Stata Users Group meeting

Last updated: 4 April 2003

2003 North American Stata Users Group meeting

18–19 March 2003


Longwood Galleria Conference Center
342 Longwood Avenue
Boston, Massachusetts


Session 1, 0830–0945

Generalized latent class modeling using gllamm

Sophia Rabe–Hesketh, Institute of Psychiatry, King's College
Andrew Pickles, University of Manchester
Anders Skrondal, Norwegian Institute of Public Health


gllamm can estimate both conventional and unconventional latent class models. Models are specified using discrete latent variables whose values determine the conditional response distributions for the classes. A new feature of gllamm is that latent class probabilities can depend on covariates. We will first discuss the conventional exploratory latent class model. When a number of fallible diagnoses of some disease are available, this model can be used to estimate the prevalence of the disease as well as the sensitivities and specificities of the tests in the absence of a gold standard. After estimating the model in gllamm, gllapred can be used to diagnose individual subjects based on their posterior class probabilities. An advantage of using gllamm is that a wide range of response types can be accommodated. To illustrate this, we consider the analysis of rankings of political goals in the study of value orientations. We will also discuss confirmatory models such as latent class factor models and apply them to attitudes to abortion data, taking the survey design into account by using probability weighting and robust standard errors. Finally, we consider latent trajectory models for investigating distinct patterns of change in longitudinal data.

Additional information

Case–control study power and sample size calculations using Stata

Katie Saunders, Cancer Research UK, Genetic Epidemiology Division, University of Leeds
Tim Bishop, Cancer Research UK, Genetic Epidemiology Division, University of Leeds
Jenny Barrett, Cancer Research UK, Genetic Epidemiology Division, University of Leeds


We use Stata's npnchi2 and nchi2 functions to calculated power and required sample size for case–control studies. Following the method described by Self et al. (1992), a large exemplary dataset with expected risk factor frequencies among cases and controls under any alternative hypothesis is created. The likelihood-ratio test statistic for the hypothesis of interest is distributed as a non-central chi-squared statistic under the alternative hypothesis, and the likelihood ratio test statistic from the analysis of the exemplary dataset is an approximation to the non-centrality parameter for this distribution. We apply these methods to power and sample-size calculations for case–control studies of gene-gene and gene-environment interactions. Because of the low power of case-control studies to detect interactions, a wide range of different strategies have been proposed. Required sample size depends on several design parameters and so the simplicity of these methods means that the efficiency of many designs can be compared over different ranges, a valuable tool at the planning stage of a study. Results are presented for population based, family and matching schemes that have been proposed to improve power, and comparisons of the power of different designs are made. Stata programs are available for these comparisons.

Additional information

The effects of self-perception on students' mathematics and science achievement in 36 countries

Ce Shen, Academic Technology Services, Boston College
Oleksandr Talavera, Academic Technology Services, Boston College

Earlier studies based on the analyses on the data from the Third International Mathematics and Science Study (TIMSS) identified an interesting but conflicting finding of the effects of three self-perception measures on students' achievement in the two subjects at two different levels: within-country data generally show a positive correlation between the three measures and students' actual achievement, while at the country level, the direction is just opposite. The three measures of self-perception include how much students like the two subjects, how difficult they perceive the two subjects, and how well they think they are doing with the two subjects. Because TIMSS' sample design was a two-stage stratified design, this study uses Stata's svyreg procedure (for complex survey analysis) to replicate earlier analyses. We find that on an individual level, when the number of books at home, school resources and indicators of school management are controlled for, the three self-perceptions demonstrate positive effects on students' achievement for most countries; while at the school level, the picture becomes mixed. For most countries, the effect of perceived easiness of the two subjects became negative. We suggest this inconsistency reflects differences in culture and in academic standards from country to country.

Additional information

Session 2, 1015-1200

Generalized linear models for prediction: some principles, some programs and some practice

Nicholas J. Cox, Durham University

Despite a history now over 30 years long, the adoption of generalized linear models (GLMs) remains patchy: they are well-known in several fields, but used little if at all in many others. One major advantage of GLMs is that they return predictions on the scale of the response. The use of link functions avoids the need for prior transformation of the response for back-transformation of predictions, and above all for bias corrections to back-transformations, whether systematic or ad hoc. Case studies from environmental applications (suspended sediment concentrations of rivers, heights of forest trees) are introduced in which predictions on the response scale are of paramount scientific and practical interest. Heavy use is made of a suite of Stata programs written by the author producing graphic and numeric diagnostics after regression-type models, which extend and complement commands in official Stata. Most of these programs have uses beyond GLMs and they will also be discussed directly.

Additional information

Using Stata to manage and create a research data bank

Frederick Wolfe, National Data Bank for Rheumatic Diseases
Kaleb Michaud, National Data Bank for Rheumatic Diseases

We manage a longitudinal research data bank containing 3,000 variables that adds 25,000 observations per year. Data are batch converted from SQL to Stata on a daily basis, resulting in the creation of 20 preliminary datasets. We then use Stata to quality control the data and to prepare a single research dataset that can be augmented as required by the data analyst by calls to specialized programs that access the additional datasets. Our philosophy is to that most of the quality control and programming and dataset preparation should be built into the dataset creation process rather than requiring the data user to do this. For example, data quality checks and complex data preparation of items such as costs and hospital and mortality codes are programmed into the dataset creation process, and relevant additional datasets are automatically created to reflect such new data. The basic dataset consists of research and control variables that are needed for most analyses. With simple programming statements such as getwork and getcosts, preprocessed work and cost data, for example, are merged with the basic set. Global macros identify file locations, database versions, and variable sets, making updating and sharing simple.

Asymptotic confidence intervals (CIs) for a difference between two independent proportions

Joseph Coveney, Cobridge Co., Ltd.

Binary response variables arise in a variety of studies. It is often of interest to summarize a treatment effect in terms of the difference in proportions of successes between groups. Stata produces the Wald-type asymptotic CI for differences in proportions in cs and glm , family(binomial) link(identity). The Wald-type CI is easy to compute, but it is sometimes desirable to have a large-sample CI with better coverage properties. Alternative asymptotic CI methods have appeared in the literature with claims of better performance. Miettinen and Nurminen (1985) describe an iterative method for such an improved CI requiring repeatedly solving a cubic equation. A noniterative approximation by Wallenstein (1997) requires only solution to a quadratic equation. Newcombe (1998) favored a method involving Wilson intervals of each of the two proportions (see ciw). Agresti and Caffo (2000) describe a simple method inspired by an analogous Wilson-type interval (see propci) for single proportions. Each of these CIs' implementations in Stata will be illustrated in the context of a therapeutic equivalence clinical trial.

Extending xi

Phil Ender, UCLA Department of Education
Michael Mitchell, UCLA Academic Technology Services

Stata's xi command performs dummy (indicator) coding on the fly and with the "*" operator allows for the interaction of two categorical variables or a categorical with a continuous variable. xi3 extends the capabilities of xi to include a number of additional coding systems and can create codings that allow for testing simple contrasts and simple effects. In addition to indicator coding, xi3 supports the following coding schemes:

simple coding - compares each level to a reference level
deviation coding - deviations from the grand mean
Helmert coding - compares levels of a variable with the mean of subsequent levels
reverse Helmert coding - compares levels of a variable with the mean of previous levels
forward differences - adjacent levels, each versus next
backward differences - adjacent levels, each versus previous
orthogonal polynomial coding

Additionally, xi3 supports user defined coding schemes which allow virtually any type of contrast to be used. Like xi, xi3 can be used in conjunction with any of the estimation commands. xi3 will do three-way interactions with categorical variables, a mixture of categorical and continuous variables, or with continous variables alone. xi3 can be issued as a stand alone command. In addition to the "*" operator for interactions, xi3 adds the "@" operator which performs the coding separately for each level of the second variable to allow for simple contrasts and simple effects.

Additional information

Session 3, 1330-1500

Teaching Stata for data management

Phil Bardsley, Carolina Population Center, University of North Carolina
Dan Blanchette, Caroline Population Center, University of North Carolina

The Carolina Population Center is a SAS shop, and its 25 programmers have long favored SAS for data management. Its research faculty, however, encourage use of Stata to adjust for survey sampling effects. In an effort to introduce both SAS programmers and new trainees to Stata, we wrote a web-based Stata tutorial. It focuses on the subset of Stata commands necessary to manage survey research files. This includes commands to clean data that are out of range, find duplicate identifiers that should not exist, recode variables and create new ones, and document the data. Because the surveys often involve hierarchical file structures, the tutorial covers merging and reshaping. It also introduces the very powerful for command and its variants as labor-saving devices. We have used this tutorial to teach a short course on Stata, many trainees have used it to teach themselves Stata, and it has been used in training programs overseas. The talk will show the format of the tutorial on the web and quickly review the range of commands that the tutorial covers. We will also talk about adding a "Rosetta Stone" to help SAS programmers convert their code to Stata.

Teaching Stata through guided practice

Estelle Young, Bowie State University
Stacy Gibbs, Bowie State University
Michael Wynn, Bowie State University

Bowie State University requires a software course in their 3-course research sequence. The course covers descriptive and inferential statistics and some data management using Stata and SPSS syntax. The professor provides a diskette with the datasets and all class syntax files and a course notebook containing: the syllabus; syntax/output files for each weeks' material; background notes on hypothesis testing; and in-class practice exercises, homework and a final presentation. Copies are posted on the course website. To login, use the userid eyoung and the password marvin, and click on Data Analysis Main. Go to course documents to find the folders and files associated with the course. The structure of each lesson is as follows: I present a brief summary of the statistical material and corresponding syntax/output files. The students follow on their computers, using the syntax files on their diskettes and the hard copies in their notebooks. Students then practice the material and a mini-homework assignment via the in-class exercise. The following week, each student presents a part of the homework assignment to the rest of the class. For the Stata Users Group meeting, the student and I would present one mock lesson as well as distribute sample course notebooks.

Additional information

Instrumental variables and GMM: Estimation and testing

Christopher F. Baum, Boston College
Mark E. Schaffer, Heriot-Watt University
Steven Stillman, New Zealand Department of Labour

We discuss instrumental variables (IV) estimation in the broader context of the generalized method of moments (GMM), and describe an extended IV estimation routine that provides GMM estimates as well as additional diagnostic tests. Stand-alone test procedures for heteroskedasticity, overidentification, and endogeneity in the IV context are also described.

Additional information

Building a collection of programs for Stata

Henrik Schmiediche, Texas A&M University

Introduction and overview of the technical aspects of building a collection of programs for Stata. In particular, we will focus on the collection of programs developed for nonlinear measurement error models to be presented at the workshop following the users group meeting.

Multivariate data exploration with Stata: Evaluation and wish list

Stephen Soldz, Boston Graduate School of Psychoanalysis

Stata is a general purpose statistical package with especially strong data manipulation and regression modeling capabilities. It appears to be especially strong in statistical techniques used by econometricians and biostatisticians. As psychologists, among others, adopt it, certain relative weaknesses in the existing set of implemented procedures become apparent. In particular, multidimensional exploratory data analyses are a set of data analytic procedures — including principal components and factor analysis, correspondence analysis, optimal scaling, and multidimensional scaling, — commonly used to explore the structure of data sets and derive variables (e.g., principal components or factors) that summarize the data in a small number of variables. While Stata, as delivered or through user add-ons, has many of the basic capabilities in these areas, many are implemented in a fairly rudimentary fashion and others are implemented in the Stata executable, without sufficient hooks for users to be able to expand them. This talk will discuss some of these procedures and will evaluate Stata capabilities in these areas. It is hoped that it will help stimulate StataCorp or the user community to expand Stata capabilities in these areas.

Additional information

Session 4, 1530–1730

Stata Journal Editors' report

H. Joseph Newton, Texas A&M University
Nicholas J. Cox, University of Durham  

Report to Stata users: Stata 8

William W. Gould, StataCorp

Wishes and grumbles

William W. Gould, StataCorp
Chinh Nguyen, StataCorp  

Scientific organizers

Elizabeth Allred, Harvard School of Public Health
[email protected]

Kit Baum, Boston College
[email protected]

Nicholas J. Cox, Durham University
[email protected]

Marcello Pagano, Harvard School of Public Health
[email protected]