Home  /  Resources & support  /  Users Group meetings  /  2004 North American Stata Users Group meeting

Last updated: 14 September 2004

2004 North American Stata Users Group meeting

23–24 August 2004

USS Constitution

Longwood Galleria Conference Center
342 Longwood Avenue
Boston, Massachusetts


Use of Gaussian integration in Stata

Alan Feiveson
NASA—Johnson Space Center

Gaussian integration can be used to obtain surprisingly accurate evaluations of definite integrals with as few as 10 or 20 function evaluations. In this presentation, it will be shown how to incorporate tables of Gaussian integration weights into Stata datasets and use them to evaluate integrals for each observation. The same approach can be used to incorporate integrals involving model parameters as part of a maximum likelihood or nonlinear least-squares estimation process. Examples will be given using data from NASA's biomedical research for developing countermeasures to the adverse effects of prolonged spaceflight on astronauts.

Additional information


Generating random variables from the N/I distributions

Peter A. Lachenbruch

The N/I distributions are the ratio of a normal distribution to an independent distribution; they include the normal, Cauchy, t, and slash for various cases of the denominator distribution. The author has developed a program that will generate these distributions for use in simulations. Additionally, mixtures are allowed, and one can obtain the distribution of the inverse of the I distribution by setting the numerator normal to have mean 1 and standard deviation 0. These were originally developed for the robust estimation study of Andrews et al. (1972).

Additional information


Econometric techniques for estimating treatment effects

Zhehui Luo
Department of Epidemiology, Michigan State University

One way to evaluate the econometric techniques of estimating treatment effects is to use experimental data to gauge results of different methods (LaLonde 1986). There has been heated debate since LaLonde's seminal paper as to whether the propensity-score techniques overcome the selection problem (Smith and Todd 2003; Dehejia 1999, 2002). This study uses a randomized trial of cognitive behavioral intervention on reducing the severity of symptoms and their impact on emotional distress and physical function for cancer patients. We use several other datasets from which cancer patients were selected as comparison groups. We estimate the "true" treatment effect on physical function and mental health (SF-36) with the randomized trial and compare the results of the following econometric techniques using the comparison groups: (1) difference-in-differences (DID) method, (2) instrumental variables, and (3) propensity score matching estimators (including nearest neighbor, radius matching, stratification, and kernel matching) (Becker & Ichino, 2002). The results show that the propensity-score matching depends on the comparison samples and the outcome compared and the bias is larger when the sample is more different from the treated group.

Additional information


Sample-size calculation for longitudinal studies

Phil Schumm
Department of Health Studies, University of Chicago

Consider a longitudinal study designed to estimate the difference in the rates of change in some outcome between two different groups. In this case, the variance of the estimator depends on several factors, including the variability in the outcome, the amount of missing data due to dropout, the distribution of additional covariates, and the degree and structure of the within-unit correlation across time. Although it is often possible to compute the variance (or an approximation to it) directly from a mathematical formula, this can be unwieldy for those unfamiliar with such computations. In this presentation, I will demonstrate (using real examples) how xtgee can be used to compute the variance, from which an estimate of power may be obtained. By creating an appropriate pseudo-dataset, it is possible to specify virtually any covariate distribution and pattern of dropout. In addition, because xtgee will accept an arbitrary fixed correlation matrix, it is easy to specify whatever correlation structure is considered most plausible. This method is intuitive and makes it easy for researchers to explore the effects that changes in their assumptions have on a study's power. A comparison of the results of this method with those generated by other sample size software will also be presented.

Using Stata for questionnaire development

Theodore Pollari
Phil Schumm
Department of Health Studies, University of Chicago

In studies that collect survey data, the investigator(s) often construct the questionnaire using a word processor and then deliver it to a survey organization, which translates it into an electronic data collection instrument (e.g., CAPI or CATI). Unfortunately, this approach suffers from the following problems: (1) a word processor is not well suited to the development of a complex questionnaire, (2) time is wasted and errors may occur when translating the questionnaire into CAPI, and (3) background information about the individual questions that is often relevant for analysis of the data (e.g., question source and rationale, scoring instructions, etc.) is not preserved in the final data file. We will describe a system that permits an investigator to construct a questionnaire in Stata by representing questions as variables and using labels and characteristics to specify attributes such as question text, response categories, and background information together with specifications regarding the structure of the interview (e.g., skip patterns and loops). The resulting .dta file is automatically translated into a variety of useful forms, including a human-readable version of the questionnaire and a format that may be imported directly into CAPI. The file also serves as a shell into which the actual data may be placed so that researchers analyzing the data have easy access to question attributes.

Translating data between MySQL and Stata

Michael Johnson
Phil Schumm
Department of Health Studies, University of Chicago

As web-based and other electronic data collection methods become more widely used in research, the opportunities to use statistical software in conjunction with conventional database systems are increasing. Among such systems, MySQL is particularly well suited for research purposes. For example, MySQL's ENUM and SET column types are ideal for storing data collected via the multiple choice questions typically used in social surveys. At the same time, Stata is uniquely suited for working in conjunction with a database; for example, its implementation of characteristics makes it possible to preserve (in a usable form) important information about how the database and front-end application are constructed (e.g., column types and other attributes). In this presentation, we shall describe a Python script we have developed for translating data from MySQL to Stata and will indicate briefly how we are using it in the development of tools for the collection and management of research data.

Working with ODBC data sources in Stata: tips and techniques

Joseph Coveney
Cobridge Co., Ltd., Tokyo

With its suite of ODBC-related commands, Stata can now be used directly with many popular database management systems (DBMSs) and other ODBC data sources. Stat/Transfer's ODBC capabilities have permitted indirect access for some time. ODBC has advantages over copy and paste or save as/insheet operations for reproducibility of the analysis and for documentation of its trail of events. The suite's ability to use Structured Query Language statements also facilitates use of DBMSs for storage and organization of massive sets of data, while economizing on memory during analysis with Stata by limiting what is loaded into the active dataset to only pertinent rows and columns. This presentation will briefly review the suite, and then, using a case study approach, it will illustrate the use of this suite in solutions to selected data management problems, for example, when clients deliver data for analysis in spreadsheets laid out in an unfavorable manner, or when datasets are delivered in ill-designed relational databases or in those that are subject to frequent updating. The presentation will also share tips and precautions from experience with its use with two popular spreadsheet and database packages, and give pointers on using ODBC data sources that contain text in double-byte character sets (Unicode).

Using Stata with large datasets in corporate America: lessons learned

Ed Bassin
ProfSoft, Inc.

While Stata has gained wide acceptance in academia, its use in corporate environments lags far behind. For academic Stata users, the product's inability to penetrate business has important consequences. Students trained in Stata have fewer opportunities than those trained in other tools, particularly SAS, which are widely used by businesses. Academic statisticians have fewer opportunities to collaborate and consult with colleagues in business. For the past five years, ProfSoft has been developing and marketing a medical claims analysis system that builds on the data management and data analysis capabilities in Stata. Our experience has shown that Stata can greatly enhance the analytic capabilities of health plans, provider organizations, and purchaser coalitia and that Stata is a very powerful tool for working with large corporate databases, sometimes in excess of 100 million records. During that time, we have learned many lessons about how Stata can gain acceptance in mainstream corporate America. In this presentation, we discuss factors that helped gain acceptance for a software product that is based on Stata. We discuss the features in Stata that are most important to our customers and those that have little interest. We demonstrate business applications with web-based graphical user interfaces that unleash the power of Stata to users who have little or no interest in learning how to use Stata directly.

Additional information


Graphics for categories and compositions

Nicholas J. Cox
Department of Geography, Durham University, UK

Graphics and categorical data are odd bedfellows. A pie chart of the frequencies of a categorical variable may be the first statistical technique taught to young children, and there is a very substantial if self-contained literature on biplots and related methods. Yet in between many texts and papers on categorical data make little or no use of graphical methods. Is this because appropriate graphs do not exist, or are they too trivial or too ineffective to be worth attention? I shall discuss various Stata implementations of graphs for categorical data, both familiar and unfamiliar, old and new, including bar and dot charts, cumulative and sliding plots, triangular plots, and tabular plots. Subsidiary themes will include, on the statistical side, support for logit and other appropriate non-linear scales, respect for ordinal structure, smoothers for categorical data and transformations of the simplex; and on the Stata side the strategy and trickery of writing user-written graphics programs as wrappers for the new graphics of Stata 8, aiming both to maximize user choice and to minimize user-programmer effort.

Metagraphiti by Stata: Visuographical exploration and presentation of meta-analytic data using Stata

Ben Dwamena
University of Michigan Medical School

Meta-analysis is considered the highest level of evidence on effectiveness of healthcare interventions. It provides important information by capitalizing on the large numbers of studies performed to assess the impact of healthcare interventions, helps reduce variability and uncertainty among published reports of efficacy, produce summary estimates of effectiveness for clinical decision making, and evaluate the quality of the published evidence. However, a large proportion of meta-analyses pose a surprising challenge for the uninitiated user: in order to figure out what the researchers found, the user must struggle through a maze of textual jargon, statistical formulas and lengthy lists of actual studies and extensive tables of overall average effect size and mean effect sizes for important subgroups of studies. On the premise that "a picture is worth more than a thousand words but a 'metagraphita' is worth more than a thousand words and statistical tests", the purpose of this presentation is to provide an idiot-proof overview of statistical graphics/diagnostic plots for exploration of publication bias, data distribution, heterogeneity and for summarizing overall datasets. Discussion will include the construction and interpretation of general graphical displays such as weighted histograms, normal quantile plots, forest plots, funnel graphs, scatter diagrams, as well as plots unique to diagnostic meta-analysis (e.g., ROC plane graphic, Accuracy-Threshold regression plots, summary receiver operator characteristic curves and likelihood-ratio scattergrams). Presentation will consist of didactic slide presentation supplemented by handouts and an annotated bibliography and illustration of derivation and interpretation of visual displays from published meta-analyses using Stata.

Additional information


Density-distribution sunflower plots in Stata 8

William D. Dupont
Department of Biostatistics, Vanderbilt University School of Medicine

Density distribution sunflower plots are used to display high-density bivariate data. They are useful for data where a conventional scatter plot is difficult to read due to overstriking of the plot symbol. The x-y plane is subdivided into a lattice of regular hexagonal bins of width w specified by the user. The user also specifies the values of l, d, and k that affect the plot as follows. Individual observations are plotted when there are less than l observations per bin as in a conventional scatterplot. Each bin with from l to d observations contains a light sunflower. Other bins contain a dark sunflower. In a light sunflower, each petal represents one observation. In a dark sunflower, each petal represents k observations. The user can control the sizes and colors of the sunflowers. By selecting appropriate colors and sizes for the light and dark sunflowers, plots can be obtained that give both the overall sense of the data density distribution as well as the number of data points in any given region. The use of this graphic is illustrated with data from the Framingham Heart Study. Stata version 8.2 contains a program, called sunflower, which draws these graphs.

Additional information


Replication methods for complex survey analysis in Stata

Nicholas Winter
Department of Government, Cornell University

This talk will discuss the svr suite of user-written commands in Stata. These commands facilitate the analysis of data from surveys with complex sampling plans and represent an alternative to official Stata's Taylor series linearization-based svy commands. I will touch briefly on the theoretical basis for these techniques and contrast them with Taylor series. The heart of the talk will present the commands. I will conclude with some observations of the joys and sorrows of constructing add-on commands to official Stata.

Additional information


Rolling regressions in Stata

Kit Baum
Department of Economics, Boston College and RePEc

This talk will describe some work underway to add a "rolling regression" capability to Stata's suite of time-series features. Although commands such as statsby permit analysis of non-overlapping subsamples in the time domain, they are not suited to the analysis of overlapping (e.g., "moving window") samples. Both moving-window and widening-window techniques are often used to judge the stability of time series regression relationships. We will present an implementation of a rolling regression command and illustrate with examples from the empirical literature.

Additional information


Implementation of quasi-least squares using xtgee in Stata

Justine Shults
Department of Biostatistics, University of Pennsylvania

Liang and Zeger's original formulation of generalized estimating equations (GEE) has been widely applied since its introduction in 1986 because it extends the application of generalized linear models to clustered data. In this presentation, we discuss a method, quasi-least squares (QLS), that is in the framework of GEE and builds on this popular approach by allowing for consideration of correlation matrices that were previously difficult to apply. In particular, we describe how to QLS in a straight-forward fashion by making use of Stata's xtgee procedure. We also discuss some data analysis examples.

Additional information


To help others in teaching statistics using the Stata software

Susan Hailpern
Albert Einstein College of Medicine

This presentation will discuss the issues involved with teaching statistics with Stata to physicians in a MS program at Albert Einstein College of Medicine (AECOM). The Clinical Research Training Program (CRTP) at AECOM is a 2-year course of study for physicians wishing to earn a Master of Science degree in Clinical Research Methods. The program has two complementary components: a) didactic program with emphasis on epidemiology, biostatistics, study design, and ethics, and b) a mentored clinical research experience. Since its beginning in 1998, basic statistics was taught using the SPSS statistical software. SPSS was felt to be easy to teach and learn because of the "pull-down" menus. However, as students advanced, SPSS was found to be too limited in its application to their clinical research. In particular, Stata has the capability to perform multinomial and ordinal logistic regressions, frailty models for multivariate survival analysis (semi-parametric and parametric), and immediate commands—all of which SPSS does not. This summer, Stata 8 will be taught to CRTP students for the first time. Our experience with the new Stata has convinced us that Stata 8 will be easy to learn and use with the addition of "pull-down" menus. The fact that the instructors teaching statistics with Stata come from very different backgrounds will make this an interesting challenge. The senior instructor has had extensive experience using SPSS and is a relative newcomer to Stata. The other instructor has had extensive experience using Stata, but with expertise in writing Stata programs (and is unfamiliar with using the "pull-down" menus available in version 8). This presentation will discuss the course changes planned in converting to Stata, as well as the successes and failures of teaching statistics with Stata to physicians in a MS program at Albert Einstein College of Medicine.

Additional information


Sensitivity analysis on traffic crash prediction models by using Stata

Deo Chimba
Department of Civil Engineering, Florida State University

Traffic accidents result from the interaction of different parameters that includes highway geometrics, traffic characteristics, and human factors—geometric variables include number of lanes, lane width, median width, shoulder width, roadway length, number of intersections, access density, and shoulder width, while traffic characteristics include AADT and speed. The effect of these parameters can be correlated by predictive models that predict crash rates at particular roadway section. Stata software commands can be used to test the sensitivity of these variables on crash rate after modeling. In the current research sponsored by Florida Department of Transportation titled "Evaluation of Geometric and Operational Characteristics affecting the safety of Six-lane divided Roadways", we use these commands to determine the effect in crash rate as the result of change on these independent variables. We selected our model based on the user-written command nbvargr, which gives dispersion factor between Poisson and negative binomial. By using Vuong's value, we were able to choose between zero-inflated and normal models. With the listcoef, percent command, we determine percent change in crash rate for unit and standard deviation increase in independent variables. By using the mfx, compute command, we were able to determine numerically the marginal effects or the elasticities between crash rate and the independent variables. These commands, and other built-in commands, reveal if the increase in size or dimension for roadway geometrics will result in higher crash rate or reduction.

Additional information


Tuesday, August 24, 2004

Stata Graphics

Vince Wiggins
StataCorp LP

This course will cover in detail the basic commands and concepts for building high-quality Stata graphs from scratch. You will learn new approaches to creating graphs, including organizing and managing your data, and creating custom schemes.

Additional information

There are annotated materials for this talk that can be viewed and run from within Stata. To find, install, and begin the marterials, type the following commands in Stata:

        net from http://www.stata.com/users/vwiggins
        net describe boston04
        net install boston04

        whelp bgrtalk

Scientific organizers

Elizabeth Allred, Harvard School of Public Health
[email protected]

Kit Baum, Boston College
[email protected]

Nicholas J. Cox, Durham University
[email protected]

Marcello Pagano, Harvard School of Public Health
[email protected]

Rich Goldstein, Consultant
[email protected]

Peter A. Lachenbruch, Director of OBE/CBER/FDA
[email protected]

Logistics organizers

Chris Farrar, StataCorp

Gretchen Farrar, StataCorp