Home  /  Resources & support  /  Users Group meetings  /  2005 North American Stata Users Group meeting

Last updated: 20 July 2005

2005 North American Stata Users Group meeting

11–12 July 2005

USS Constitution

Longwood Galleria Conference Center
342 Longwood Avenue
Boston, Massachusetts


Analysis of multiple source/multiple informant data in Stata

Nicholas Horton
Department of Mathematics, Smith College
Coauthor: Garrett Fitzmaurice, Harvard University

We describe regression-based methods for analyzing multiple-source data arising from complex sample survey designs in Stata. We use the term multiple-source data to encompass all cases where data are simultaneously obtained from multiple informants, or raters (e.g., self-reports, family members, health care providers, administrators), or via different/parallel instruments, indicators, or methods (e.g., symptom rating scales, standardized diagnostic interviews, or clinical diagnoses). We review regression models for analyzing multiple source risk factors or multiple source outcomes and show that they can be considered special cases of generalized linear models, albeit with correlated outcomes. We show how these methods can be extended to handle the common survey features of stratification, clustering, and sampling weights, as well as missing reports, and how they can be fitted within Stata. The methods are illustrated using data from the Stirling County Study, a longitudinal community study of psychopathology and mortality.

Additional information

Horton.ppt (Powerpoint presentation)
sim-tutorial.pdf (documentation/PDF)

gologit2: Generalized logistic regression models for ordinal dependent variables

Richard Williams
Sociology Department, University of Notre Dame

gologit2 is a user-written program that fits generalized logistic regression models for ordinal dependent variables. The actual values taken on by the dependent variable are irrelevant, except that larger values are assumed to correspond to "higher" outcomes. A major strength of gologit2 is that it can also fit two special cases of the generalized model: the proportional odds model and the partial proportional odds model. Hence, gologit2 can fit models that are less restrictive than the proportional odds/parallel lines models fitted by ologit (whose assumptions are often violated) but are more parsimonious and interpretable than those fitted by a nonordinal method, such as multinomial logistic regression. The autofit option greatly simplifies the process of identifying partial proportional odds models that fit the data. Two alternative but equivalent parameterizations of the model that have appeared in the literature are both supported. Other key advantages of gologit include support for linear constraints, Stata 8.2 survey data (svy) estimation, and the computation of estimated probabilities via the predict command. gologit is inspired by Vincent Fu's gologit program and is backward compatible with it but offers several additional powerful options.

Additional information

gologit2.pdf (PDF)
Williams_NASUG.pdf (PDF)
Williams_NASUG_handout.pdf (PDF)

L-statistics, especially L-moments, for fun and profit

Nicholas J. Cox
Geography Department, Durham University

L-statistics are weighted combinations of order statistics. They have a long history under that or other names and are simple to analyze, easy to compute, and helpful in many applied problems. I give examples of their use for numerical summary, graphical representation, and distribution fitting. I discuss some Stata implementations for methods for quantile estimation developed by Harrell and Davis, Kaigh and Lachenbruch, and others, and for the method of L-moments formalized by Hosking. In particular, L-moments are more resistant to outliers, and for higher moments less biased, than classical estimators, yet they are in many ways less ad hoc than various alternatives based on order statistics. They are also helpful for fitting various distributions to data, particularly when maximum likelihood is impractical. Datasets used for illustration are principally environmental.

Integrating Stata with database management systems

Ed Bassin
ProfSoft, Inc.

Stata offers users a wide array of powerful data-management functions. The array of functions allows users to manipulate their data with power, flexibility, and ease. Relational database management systems (RDBMS), such as Oracle and MySQL, also offer many useful capabilities for storing and retrieving data, particularly when working with large datasets or confidential information. In many cases, indexing saves a lot of time when retrieving just a portion of the dataset. Database joins can provide a much more efficient means for storing large datasets while also allowing users to combine datasets together with greater flexibility than Stata's merge command. Furthermore, databases generally have password protection, allowing users of sensitive data, e.g., medical information, to store data securely. Stata 8's ODBC functionality makes it possible to have the best of both worlds. In this presentation, we show how Stata can work with MySQL. We highlight the performance gains that can be derived by retrieving data with indexed queries rather than "use if" commands. We also compare two different methods for loading Stata files into MySQL, the native MySQL command and Stata's ODBC command. Our presentation concludes with suggestions about how Stata and MySQL could be more tightly integrated in the future with substantial benefits to Stata users.

Mass producing appendices using Stata and word processor mail merge

Michael Blasnik
M. Blasnik & Associates

Confronted with the task of producing a large appendix to a report that involved a page of tables and 3 graphs for each of 186 panels, the author discovered an approach to automate this process using Stata combined with the mail-merge facilities of a word processor. A Stata do-file produces all 558 graphs and writes an ASCII file of data that also includes the graph file names for each panel. A one-page mail-merge document is set up in the word processor, and the Stata output is used as the data source to automatically create the entire 186-page appendix with all tables and graphs placed as desired. This session will outline how to employ this approach for such otherwise daunting tasks.

Additional information

mblasniknasug.ppt (Powerpoint presentation)

Reproducible research using Stata

Phil Schumm
Department of Health Studies, University of Chicago
Coauthor: Ronald A. Thisted

In this presentation, we shall address two issues concerning the research process: (1) the need for an efficient way to transfer results obtained using a statistical package into a written report, and (2) the need to organize and package one's work so that those results are easily reproducible. Previous work addressing these issues has concentrated on tools written specifically for use with S-Plus/R and LaTeX and has involved mixing code and written material in a single document (e.g., Sweave). In contrast, our approach involves a method for extracting results (e.g., estimation results, graphs, and even pieces of code and raw output) generated by a suitably organized do-file (or files) into a series of intermediary files that can then be easily imported into a second written document. Although the document could be constructed in a variety of formats, we shall demonstrate how this may done using reStructuredText—an intuitive and easy-to-use markup syntax that can then be automatically translated into a variety of final formats (e.g., LaTeX, HTML). This system may be used not only for research but also when teaching statistics with Stata as a method for students to organize and to submit their work.

Additional information

Schumm_NASUG-presentation.pdf (PDF)

Collaborative data management for longitudinal studies

Stephen Brehm
University of Chicago
Coauthor: L. Philip Schumm

Efficient data cleaning and management are critical to the success of any large research project. This is particularly true in the case of longitudinal studies or those in which the data management tasks are shared among many individuals. Faced with several such projects, we developed a flexible, easy-to-use system for cleaning and managing research datasets. The system is modular, making it easy for different individuals to work on different parts of the process. This modularity also permits substantial code reuse over multiple waves of a longitudinal study. A central focus of the system is the idea of data testing; users write tests for specific variables that may then be rerun when a new wave of data becomes available or when changes to the data have been made. Although the basic ideas could be implemented in any statistical package or programming language, Stata is particularly well-suited to the task. In addition, we have written an ado-file to automate the process of building a dataset and another to generate basic tests automatically from an existing dataset. Although the system was designed for use by large, collaborative projects, individuals can also benefit from using it for personal research projects.

Additional information

CollaborativeDataManagementforLongitudinalStudies.ppt (Powerpoint presentation)

Creating valid and effective measures: Using -optifact- to create better summated rating scales

Paul Millar
University of Calgary

While scales are an integral part of the measurements involved in many research projects, few researchers make the effort to ensure that these scales are valid, reliable, unidimensional, and consistent with relevant correlates. This presentation outlines how the Stata routine optifact can be used to create summated rating scales that are better measures of the concept at issue, yet often require fewer questions. An example using a major Canadian survey is presented to illustrate the potential for better analysis and potentially substantial cost savings in survey research.

Additional information

BostonScales.ppt (Powerpoint presentation)

Using and teaching Stata in a semester-length introduction to biostatistics course

Clinton Thompson

Public Health Program, Department of Family & Preventive Medicine, University of Utah
Coauthors: Stephen C. Alder, Justin Brown, Laurie Johnson

The University of Utah's Public Health Program, housed in the Department of Family and Preventive Medicine, uses Stata to supplement the core course "Introduction to Biostatistics, I". All MPH/MSPH students are required to enroll in the lab component of the Biostatistics course wherein Stata is (1) used to juxtapose theory with practice, (2) provide students with the tools necessary to complete the homework successfully, and (3) provide them with a manual that is less dense than Stata documentation and less expensive than the Stata users books, yet specific enough to address the problems encountered in the course. The material in the Stata Lab Manual parallels Pagano's and Gauvreau's Principles of Biostatistics and utilizes many of the same examples and datasets. Each of the 16 lab sections has the following format: a brief introduction to the topic du jour, an example using a dataset, and an interpretation of the output. Although the Lab Manual has been tailored to meet the specific needs of the course, many sections rely heavily on explanations from the Stata online and print documentation, and every effort is made to acquaint the student with the help feature and how it is most efficiently accessed.

Additional information

Thompson.ppt (Powerpoint presentation)

Using Stata graphics as a method of understanding and presenting interaction effects

Joanne Garrett
Robert Wood Johnson Clinical Scholars Program, University of North Carolina at Chapel Hill

It is fairly simple to add an interaction term (also known as "effect modification" to epidemiologists) to a linear or logistic regression model and test whether that term is statistically significant. However, it is much more difficult to explain what a significant interaction means in an intuitive way. A graphical representation of the interaction effect may help. Stata graphics can be used to give students a better understanding of what is actually happening when interaction is present. This can be helpful before introducing the mathematical approach and interpretation in a model. Graphing interactions also can be used as a simple method of exploratory data analysis, or for reporting final results in a nonstatistical way in presentations or journal articles.

Additional information

Garrett.ppt (Powerpoint presentation)

cron, perl and Stata: automated production and presentation of a business-daily index

Kit Baum
Boston College and RePEc
Coauthor: Atreya Chakraborty, UMass Boston

In a Unix-based environment such as Mac OS X, it is very simple to set up 'cron jobs' that run at any periodic interval. This presentation illustrates how 'cron', the perl scripting language, and Stata can be used to generate a database of daily stock quotes from a regional index, compute an investor sentiment index based on a Spearman rank correlation, and "publish" the results to a web page. Stata's file command is particularly useful in generating the various formats needed in the web page presentation.

Additional information

RAIM-NASUG2005aug.pdf (PDF)

Selecting the appropriate statistical distribution for the primary analysis: a case study

Peter A. Lachenbruch

An article in The Lancet discussed a clinical trial of a product for a rare disease. The authors had modified the primary analysis from an unadjusted Wilcoxon rank-sum test to a Poisson regression. This led to several questions: Was a Poisson distribution appropriate? How were the covariates selected? What is the effect of outliers (there were some)? If the Poisson model is not appropriate, can a permutation test approach provide information regarding the effect of the treatment? The presentation will show how Stata was used to evaluate these assumptions and provide an alternative analysis. The implications of this analysis are discussed.

Additional information

Lachenbruch.ppt (Powerpoint presentation)

Using Stata 9 to model complex nonlinear relationships with restricted cubic splines

William D. Dupont
Department of Biostatistics, Vanderbilt University School of Medicine
Coauthor: Dale Plummer

Restricted cubic splines (RCSs) are used with generalized linear models and other regression methods to model complex nonlinear relationships. An RCS with k knots is linear before the first knot and after the last knot, is a cubic polynomial between adjacent knots, and is continuous and smooth. An RCS model with k knots can be fitted with only k-1 covariates. rc_spline calculates these covariates from an independent covariate and the knot values. Default numbers of knots or knot values suggested by Harrell (2001) may be used. We can then use the full power of Stata 9 to build models, construct graphs, and perform residual analyses using these covariates. RCSs are illustrated by modeling length of stay (LOS) and discharge mortality as a function of admission blood pressure (BP) from the SUPPORT study. LOS and log-odds of death are highly nonlinear functions of BP. Multiple linear and logistic regression with RCSs are used to model these data. Plots of expected outcome with 95% confidence bands are easily overlaid on scatterplots using standard Stata graphics. These regression curves are little affected by the knot placements. This robust methodology is easily taught to nonstatisticians and greatly expands the modeling capacity of standard regression methods.

Additional information

RCsplines.pdf (presentation slides/PDF)
RCsplines.ppt (Powerpoint presentation)
support.do (sample program; text/plain)
support.dta (sample dataset)
support.log (sample log file)

Adjusting for unequal selection probability in multilevel models: A comparison of software packages

Kim Chantala
Carolina Population Center, UNC at Chapel Hill
Coauthors: C. M. Suchindran, Dan Blanchette

Most surveys collect data using complex sampling plans that involve selection of both clusters and individuals with unequal probability of selection. Research in methods of using multilevel modeling (MLM) procedures to analyze such data is relatively new. Often sampling weights based on selection probabilities of individuals are used to fit population-based models. However, sampling weights used for fitting multilevel models need to be constructed differently than weights used for single-level (population-average) models. This paper compares the capabilities of MLwiN, Mplus, LISREL, PROC MIXED (SAS), and gllamm (Stata) for fitting MLM from data collected with a complex sampling plan. We illustrate how sampling weights for fitting multilevel models with these software packages can be constructed from population average weights. Finally, we use data from the National Longitudinal Survey of Adolescent Health to contrast the results from these packages.

Additional information

Chantala.ppt (Powerpoint presentation)

Day 2: Training courses from StataCorp

Analysis of survey data and correlated data

Jeff Pitblado, StataCorp

This talk discusses Stata's features for analyzing survey data and correlated data, and will explain how and when to use the three major variance estimators for survey and correlated data: the linearization estimator, balanced repeated replications, and the clustered jackknife (the latter two having been added in Stata 9). The talk will also discuss sampling designs and stratification, including Stata's new features for estimation with data from multistage designs and for applying poststratification. A theme of the seminar will be how you can make inferences with correct coverage from data collected by single stage or multistage surveys or from data with inherent correlation, such as data from longitudinal studies.

Additional information

svyNASUG2005.do (installer for talk materials; text/plain)

Mata — matrix programming language

William Gould, StataCorp

Mata is both an interactive environment for manipulating matrices and a full development environment that produces compiled and optimized code. This talk will cover both applications, with an emphasis on how you can use Mata to quickly program solutions and how you can easily create new Stata commands with Mata. (Mata is fully integrated with Stata). As you learn how to use Mata, it will become clear why Stata developers chose to implement some of the major new features in Stata 9 using Mata, including linear mixed models and multinomial probit.

Additional information

bostonMataTalk.do (installer for Mata talk materials; text/plain)

Scientific organizers

Elizabeth Allred, Harvard School of Public Health

Kit Baum, Boston College

Rich Goldstein, Consultant

Logistics organizers

Chris Farrar, StataCorp

Gretchen Farrar, StataCorp