Home  /  Resources & support  /  User Group meetings  /  2008 Fall North American Stata Users Group meeting

Last updated: 24 November 2008

2008 Fall North American Stata Users Group meeting

13–14 November 2008

Golden Gate Bridge

The Handlery Union Square Hotel
351 Geary Street
San Francisco, CA 94102


Tricks of the trade: Getting the most out of xtmixed

Roberto G. Gutierrez
Stata’s xtmixed command can be used to fit mixed models, models that contain both fixed and random effects. The fixed effects are merely the coefficients from a standard linear regression. The random effects are not directly estimated but summarized by their variance components, which are estimated from the data. As such, xtmixed is typically used to incorporate complex and multilevel random-effects structures into standard linear regression. xtmixed’s syntax is complex but versatile, allowing it to be widely used, even for situations that do not fit the classical “mixed” framework. In this talk, I will give a tutorial on the uses of xtmixed not commonly considered, including examples of heteroskedastic errors, group structures on random effects, and smoothing via penalized splines.

Additional information

Multilevel modeling of educational longitudinal data with crossed random effects

Minjeong Jeon
University of California–Berkeley
Sophia Rabe-Hesketh
University of California–Berkeley
We consider multilevel models for longitudinal data where membership in the highest level units changes over time. The application is a four-year study of Korean students who are in middle school during the first two waves and in high school during the second two waves, where middle schools and high schools are not nested. The model includes crossed random effects for middle schools and high schools and can be estimated by using Stata’s xtmixed command. An important consideration is how the impact of the middle school and high school random effects on the response variable should change over time.

The consequences of misspecifying the random-effects distribution when fitting generalized linear mixed models

John M. Neuhaus
University of California–San Francisco
Charles E. McCulloch
University of California–San Francisco
Generalized linear mixed models provide effective analyses of clustered and longitudinal data and typically require the specification of the distribution of the random effects. The consequences of misspecifying this distribution are subject to debate; some authors suggest that large biases can arise, while others show that there will typically be little bias for the parameters of interest. Using analytic results, simulation studies, and example data, I summarize the results of extensive assessments of the bias in parameter estimates due to random-effects distribution misspecification. I also present assessments of the accuracy of random-effects predictions under misspecification. These assessments indicate that random-effects distribution misspecification often produces little bias when estimating slope coefficients but may yield biased intercepts and variance-components estimators as well as mildly inaccurate predicted random effects.

Additional information

Prediction in multilevel logistic regression

Sophia Rabe-Hesketh
University of California–Berkeley
Anders Skrondal
Norwegian Institute of Public Health
This presentation focuses on predicted probabilities for multilevel models for dichotomous or ordinal responses. For instance, in a three-level model with patients nested in doctors nested in hospitals, predictions for patients could be for new or existing doctors and, in the latter case, for new or existing hospitals. In a new version of gllamm, these different types of predicted probabilities can be obtained very easily. We will give examples of graphs that can be used to help interpret an estimated model. We will also introduce a program we have written to construct 95% confidence intervals for predicted probabilities.

Additional information

Heteroskedasticity, extremity, and moderation in heterogeneous choice models

Garrett Glasgow
University of California–Santa Barbara
Heterogeneous choice models are extensions of binary and ordinal regression models that explicitly model the determinants of heteroskedasticity. I show that, often, moderation (proximity to a choice threshold) will produce empirical results identical to heteroskedasticity in binary heterogeneous choice models, while extremity (a preference for endpoint categories) will produce empirical results identical to heteroskedasticity in ordinal heterogeneous choice models. I show how a simple extension of Williams’ user-written oglm command can create ordered heterogeneous choice models that can distinguish between heteroskedasticity, extremity, and moderation.

Additional information

A generalized meta-analysis model for binary diagnostic test performance

Ben Dwamena
University of Michigan Radiology and VA Nuclear Medicine
Methods for meta-analysis of diagnostic test accuracy studies must, in addition to unobserved heterogeneity, account for covariate heterogeneity, threshold effects, methodological quality, and small-study bias, which constitute the major threats to the validity of meta-analytic results. These have traditionally been addressed independent of each other. Two recent methodological advances include 1) the bivariate random-effects model for joint synthesis of sensitivity and specificity, which accounts for unobserved heterogeneity and threshold variation using random effects, and covariate and quality effects as independent variables in a metaregression; and 2) a linear regression test for funnel plot asymmetry in which the diagnostic odds ratio as an effect-size measure is regressed on effective sample size as a precision measure. I propose a generalized framework for diagnostic meta-analysis that integrates both developments based on a modification of the bivariate Dale’s model in which two univariate random-effects logistic models for sensitivity and specificity are associated through a log-linear model of odds ratios with the effective sample size as an independent variable. This framework unifies the estimation of the summary test performance and the assessment of the presence, extent, and sources of variability. Taking advantage of the ability of gllamm to model a mixture of discrete and continous outcomes, I will discuss specification, estimation, diagnostics, and prediction of the model, using a motivating dataset of 43 studies investigating FDG-PET for staging the axilla in patients with newly diagnosed breast cancer.

It’s a little different with survey data

Christine Wells
Analyzing survey data is different from analyzing data generated by experiments in several important ways. I will discuss these differences using the NHANES III adult dataset as an example. Topics will include specifying the survey elements, analysis of subpopulations, model diagnostics, and model comparison.

Additional information

Using Stata’s capabilities to assess the performance of Latin American students in mathematics, reading, and science

Roy Costilla
Stata is a very good tool for analyzing survey data. It considers many important aspects of complex survey design and the availability of alternative variance-estimation methods. Through the use of matrix and macro language, it also allows the user to store and manage output results conveniently to automate the entire estimation and testing process. I will discuss the estimation of the main results of the Second Regional Comparative and Explanatory Study, an assessment of the performance in the domains of mathematics, reading, and science of third- and sixth- grade students in 16 countries of Latin America in 2005–2006. In particular, I will consider the estimation of the mean scores and their variability by country, area, grade, and subpopulation. I will also present the comparisons made to check for the differences in performance among countries and subpopulations.

Additional information

3-way ANOVA interactions: Deconstructed

Phil Ender
I will present three approaches to understanding 3-way ANOVA interactions: 1) a conceptual approach, 2) an ANOVA approach, and 3) a regression approach using dummy coding. I will illustrate the three approaches through the use of a synthetic dataset with a significant 3-way interaction.

Additional information

Some challenges in survival analysis with large datasets

Noori Akhtar-Danesh
McMaster University
In this presentation, I demonstrate some common challenges with large datasets in survival analysis. I investigate the relationship between the age of smoking initiation and some demographic factors in the Canadian Community Health Survey, Cycle 3.1 (CCHS-3.1) dataset. CCHS-3.1 is a large dataset that includes information for over 130,000 individuals. I used different techniques for model fitting and model checking. Test-based techniques for the assessment of PH assumption are not very useful because a small deviation from the theoretical model leads to the rejection of PH assumption. In contrast, graphical approaches seem to be more helpful. However, not every diagnostic graph can be drawn because of the large dataset. Preliminary results show that 63% of Canadians have ever smoked a whole cigarette. Therefore, it seems more appropriate to use a cure fraction model (Lambert, 2007, Stata Journal, 7: 351–375) to handle the large proportion of censored data. However, sampling weights cannot be used in this model. In conclusion, survival analysis for large datasets cannot be done easily. Some challenges include assessment of PH assumption and drawing diagnostic graphs. Besides, use of the cure fraction model may not be appropriate if sampling weights cannot be incorporated in the model estimation.

Additional information

Stata and the one-armed bandit

Martin Weiss
University of Tuebingen, Germany
Using Stata, I have researched the market efficiency of the German 6/49 parimutuel lottery game. I investigate the existence of profit opportunities for particularly unpopular combinations of numbers (Papachristou and Karamanis [1998]), employing the covariates proposed by Henze and Riedwyl (1998). Furthermore, I examine the time-series behavior of stakes bet in relation to the size of the jackpot in the respective draw. In particular, I attempt to verify the conjecture that the skewness of the payoff distribution drives bettors' appetite for participation (Golec and Tamarkin [1998]). Along the way, I show how one can set up Stata to retrieve data from the Internet, unpack them automatically, and shape them for the analysis. I also show how one can schedule tasks to automate the process further.


Henze, N. and H. Riedwyl. (1998). How to Win More: Strategies for Increasing a Lottery Win. Natnick, MA: A K Peters.

Papachristou, G. and D. Karamanis. (1998). Investigating efficiency in betting markets: Evidence from the Greek 6/49 lotto. Journal of Banking & Finance 22: 1597–1615.

Golec, J. and M. Tamarkin. (1998). Bettors love skewness, not risk, at the horse track. Journal of Political Economy 106: 205–225.

Additional information

Semiparametric analysis of case–control genetic data in the presence of environmental factors

Yulia Marchenko
In the past decade, many statistical methods have been proposed for the analysis of case–control genetic data with an emphasis on haplotype-based disease association studies. Most of the methodology has concentrated on the estimation of genetic (haplotype) main effects. Most methods accounted for environmental and gene–environment interaction effects by utilizing prospective-type analyses that may lead to biased estimates when used with case–control data. Several recent publications have addressed the issue of retrospective sampling in the analysis of case–control genetic data in the presence of environmental factors by developing new efficient semiparametric statistical methods. I present a new Stata command, haplologit, that implements efficient profile-likelihood semiparametric methods for fitting gene–environment models in the very important special cases of 1) a rare disease, 2) a single candidate gene in Hardy–Weinberg equilibrium, and 3) the independence of genetic and environmental factors.

Additional information

Using Mata to work more effectively with Stata: A tutorial

Christopher F. Baum
Boston College and DIW Berlin
Stata’s matrix language, Mata, highlighted in Bill Gould’s Mata Matters columns in the Stata Journal, is very useful and powerful in its interactive mode. Stata users who write do-files or ado-files should gain an understanding of the Stata–Mata interface: how Mata can be called upon to do one or more tasks and return its results to Stata. Mata's broad and extensible menu of functions offers assistance with many programming tasks, including many that are not matrix-oriented. In this tutorial, I will present examples of how do-file and ado-file writers might effectively use Mata in their work.

Additional information

Mata utilities

Elliott Lowy
I will present a selection of user-written Mata functions that serve to streamline the process of writing other Mata functions, and I will demonstrate what makes them handy. I will present debugging/programming functions for the following: dropping and re-creating one or a few functions without clearing Mata of all other useful info; displaying the contents of a matrix in a compact and informative way; and copying private function information into the global space. I will present text-handling functions for the following: concatenating and dividing blocks of text; processing lists of file/directory paths; and converting between matrices of text and ASCII values. I will present more general-purpose functions for the following: combining matrices of different sizes; reading and writing Mata matrices to spreadsheet files; generating a map of matching values in two matrices; and returning an entire (small) matrix of values to Stata locals. I will finish with a combined Stata/Mata command for storing Stata command preferences.

Additional information
Package available by typing net from http://datadata.info/ado within Stata.

Estimating user-defined nonlinear regression models in Stata and in Mata

Colin Cameron
University of California–Davis
This talk will be an overview of how to estimate nonlinear regression models that are not covered by Stata’s many built-in estimation commands. The Mata optimize() function will be emphasized, and the Stata ml command will also be covered. The material is drawn from chapter 11 of Cameron and Trivedi’s (2009) Microeconometrics Using Stata, Stata Press.

Additional information

Automated stress tests for econometric models

Roy Epstein
Boston College
I will present a Stata program for improved quality control of econometric models. Reported econometric results often have unknown reliability because of selective reporting by the researcher. In particular, t-statistics are often uninformative or misleading when multiple models are estimated from the same dataset. Econometric best practices should include routine stress tests to assess the robustness of estimation results to reasonable perturbations of the model specification and underlying data. It is feasible to implement these tests as standard outputs from the statistical software. This information should lead to greater transparency and greater ability of others to interpret a given regression. The Stata program I will discuss can be used after commands that perform cross-section, time-series, and panel regression. It is easily extensible to include additional tests as desired.

Additional information

Data I/O commands

Elliott Lowy
While Stata, of course, comes with a serviceable set of I/O commands, I have found room for improvement. I will present a set of user-written commands for using, saving, appending, and merging. Highlights include wildcards in file paths, drastically reducing the amount that needs to be typed; options to change the working directory to match the file specified; quick reloading of the current analysis file; saving partial datasets; using/appending sets of multiple data files; transparent use of Stat/Transfer within all commands to use, save, append, and merge from and/or to other formats such as SAS and Excel; maintaining a “recent file” list through the command interface; and eliminating the irritating irregular need for quotes.

The merge command, in particular, has an even larger set of advantages, which together with the above advantages means never having to open and fiddle with a file before merging it. These advantages include merging on disparately named variables; automatic conversion of string/numeric variables; case-insensitive merging; renaming variables added to the current data; automatic tabulation of the (labeled) _merge variable or summarized merge information with automatic deletion of the _merge variable; automatic deletion of matched or unmatched records; merging with a single record from a multiply-matching merge file; and true many-to-many merging.

Additional information
Package available by typing net from http://datadata.info/ado within Stata.

Causal regression with imputed estimating equations

Joseph Schafer
Penn State
Joseph Kang
Penn State
Literature on causal inference has emphasized the average causal effect, defined as the mean difference in potential outcomes under different treatment conditions. We consider marginal regression models that describe how causal effects vary in relation to covariates. To estimate parameters, we replace missing potential outcomes in estimating functions with fitted values from imputation models that include confounders and prognostic variables as predictors. When the imputation and analytic models are linear, our procedure is equivalent to maximum likelihood for normally distributed outcomes and covariates. Robustness to misspecification of the imputation models is enhanced by including functions of propensity scores as regressors. In simulations where the analytic, imputation, and propensity models are misspecified, the method performs better than inverse-propensity weighting. Using data from the National Longitudinal Study of Adolescent Health, we analyze the effects of dieting on emotional distress in the population of girls who diet, taking into account the study's complex sample design.

Likelihood-ratio tests for multiply imputed datasets: Introducing milrtest

Rose Medeiros
Through the use of user-written programs, primarily mim (Carlin, Galati, and Royston, 2008, Stata Journal 8: 49–67), Stata users can analyze multiply imputed (MI) datasets. Among other capabilities, mim allows the user to estimate a range of regression models and to perform a multiparameter hypothesis test after model estimation using a Wald test. The program presented here allows the user to perform likelihood-ratio tests on after mim models using MI datasets. This provides an additional means of testing nested models after estimation using MI data. The process used to perform the likelihood-ratio tests is described in Meng and Rubin (1992, Biometrika 79: 103–111). The test statistic is calculated based on two sets of likelihood-ratio tests. The first involves calculating the likelihood ratio for the null versus the alternative hypothesis in each of the MI datasets. The second involves calculating the likelihood for the null and the alternative hypotheses in each of the MI datasets, constraining the parameters to be the estimates based on combining coefficient estimates from the MI datasets (i.e., the average of the parameter estimates across the MI datasets). The current version allows testing for a limited number of regression commands (i.e., regress, logit, and ologit), but subsequent versions may include compatibility with additional commands.

Additional information

UCLA ATS/Stat consulting service model for Stata users

Xiao Chen
The Statistical Consulting Group provides a variety of resources to Stata users on campus, from walk-in and email consulting to an extensive website on materials related to Stata. In this presentation I will explain how the group offers such services and will discuss the three major components of the consulting process: consulting, learning, and documenting. I will also discuss the benefits and challenges involved in sharing contributed Stata packages with clients, and the role the Internet has played in shaping the collaboration aspect of our consulting model.

Scientific organizers

Xiao Chen, (cochair) UCLA

Sophia Rabe-Hesketh (cochair), UC Berkeley

Phil Ender, UCLA

Estie Hudes, UCSF

Tony Lachenbruch, Oregon State

Bill Mason, UCLA

Doug Steigerwald, UC Santa Barbara

Logistics organizers

Chris Farrar, StataCorp

Gretchen Farrar, StataCorp