Home  /  Resources & support  /  User Group meetings  /  2009 Stata Conference DC

Last updated: 21 September 2009

2009 Stata Conference DC

30–31 July 2009

Capitol Building

Hotel Monaco
700 F St. NW
Washington, DC 20004


Generalized method of moments estimators in Stata

David Drukker
Stata 11 has new command gmm for estimating parameters by the generalized method of moments (GMM). gmm can estimate the parameters of linear and nonlinear models for cross-sectional, panel, and time-series data. In this presentation, I provide an introduction to GMM and to the gmm command.

Additional information

Mixed-process models with cmp

David Roodman
Center for Global Development
At the heart of many econometric models is a linear function and a normal error. Examples include the classical small-sample linear regression model and the probit, ordered probit, multinomial probit, tobit, interval regression, and truncated distribution regression models. Because the normal distribution has a natural multidimensional generalization, such models can be combined into multiequation systems in which the errors share a multivariate normal distribution. The literature has historically focused on multistage procedures for estimating mixed models, which are more efficient computationally, if less so statistically, than maximum likelihood (ML). But faster computers and simulated likelihood methods such as the Geweke, Hajivassiliou, and Keane (GHK) algorithm for estimating higher-dimensional cumulative normal distributions have made direct ML estimation practical. ML also facilitates a generalization to switching, selection, and other models in which the number and types of equations vary by observation. The Stata module cmp fits seemingly unrelated regressions models of this broad family. Its estimator is also consistent for recursive systems in which all endogenous variables appear on the right-hand sides as observed. If all the equations are structural, then estimation is full-information maximum likelihood. If only the final stage or stages are structural, then it is limited-information maximum likelihood. cmp can mimic a dozen built-in Stata commands and several user-written ones. It is also appropriate for a panoply of models previously hard to estimate. Heteroskedasticity, however, can render it inconsistent. In this presentation, I explain the theory and implementation of cmp and of a related Mata function, ghk2(), that implements the GHK algorithm.

Additional information

New multivariate time-series estimators in Stata

David Drukker
Stata 11 has new commands sspace and dvech for estimating the parameters of space-space models and diagonal-vech multivariate GARCH models, respectively. In this presentation, I provide an introduction to space-space models, diagonal-vech multivariate GARCH models, the implemented estimators, and the new Stata commands.

Additional information

Survey data analysis in Stata

Jeff Pitblado
In this presentation, I cover how to use Stata for survey data analysis assuming a fixed population. We will begin by reviewing the sampling methods used to collect survey data, and how they affect the estimation of totals, ratios, and regression coefficients. We will then cover the three variance estimators implemented in Stata's survey estimation commands. Strata with a single sampling unit, certainty sampling units, subpopulation estimation, and poststratification will be also covered in some detail.

Additional information

Regression diagnostics for survey data

Rick Valliant
University of Maryland
Diagnostics for linear regression models are included as options in Stata and many other statistical packages and are now readily available to analysts. However, these tools are generally aimed at ordinary or weighted least-squares regression and do not account for stratification, clustering, and survey weights that are features of datasets collected in complex sample surveys. The ordinary least-squares diagnostics can mislead users because the variances of model parameter estimates will usually be estimated incorrectly by the standard procedures. The variance or standard-error estimates are an intimate part of many diagnostics. In this presentation, I summarize research that has been done to extend some of the existing diagnostics to complex survey data. Among the linear regression techniques I cover are leverages, DFBETAS, DFFITS, the forward search method for identifying influential points, and collinearity diagnostics, like variance inflation factors and variance decompositions.

Additional information

Using Stata for subpopulation analysis of complex sample survey data

Brady West
University of Michigan
In this presentation, I provide an overview of important considerations that analysts of large public-use survey datasets must keep in mind when attempting to make inferences for finite subpopulations of research interest. I will discuss several examples of possible subpopulation analysis approaches that analysts could take using the Stata svy: commands, and I will emphasize the implications of each approach for making inferences. Participants will have time for a question-and-answer session building upon the examples.

Additional information

Implementing econometric estimators with Mata

Christopher F. Baum
Boston College
I will discuss how econometric estimators may be efficiently programmed in Mata. The prevalence of matrix-based analytical derivations of estimation techniques and the computational improvements available from just-in-time compilation combine to make Mata the tool of choice for econometric implementation. I will give two examples: computing the seemingly unrelated regression estimator for an unbalanced panel, a multivariate linear approach, and computing the continuously updated GMM estimator (GMM-CUE) for a linear instrumental-variables model. The GMM-CUE estimator makes use of Mata’s optimize suite of functions. Both illustrate the power and effectiveness of a Mata-based approach.

Additional information

Estimating high-dimensional fixed-effects models

Paulo Guimaraes
University of South Carolina
In this presentation, I describe an alternative iterative approach for the estimation of linear regression models with high-dimensional fixed-effects, such as large employer–employee datasets. This approach is computationally intensive but imposes minimum memory requirements. I also show that the approach can be extended to nonlinear models and potentially to more than two high-dimensional fixed effects. Note: The presentation is based on a paper that is currently under review at the Stata Journal.

Additional information

Data envelopment analysis in Stata

Choonjoo Lee
Ji Yong-bae
Korea National Defense University
In this presentation, we present a procedure and an illustrative application of a user-written Data Envelopment Analysis (DEA) program in Stata. DEA is a linear programming method for assessing the efficiency and productivity of units and a popular managerial tool for measuring performance of organizations. It has been used widely for assessing the efficiency of public and private sectors, such as banks, airlines, hospitals, universities, defense firms, and manufacturers. The DEA program in Stata will allow DEA users to easily access the Stata system and to conduct not only the standard optimization procedure but also more extended managerial analysis. The Mata programming, an extension of the DEA program code developed in the Stata programming language, will be discussed for the cases where the data capacity matters. We will also discuss the returns to scale options in DEA. Unfortunately, to date no DEA options are available in Stata, but an SFA model is available. The user-written DEA approach in Stata will provide some possible future extensions of Stata programming in DEA.

Additional information

Estimating the fractional response model with an endogenous count variable

Hoa Nguyen
Michigan State University
Minh Nguyen
American University
In this presentation, we introduce the command frcount for estimating the fractional response model with an endogenous count variable. The endogeneity of the right-hand-side count variable is controlled for under the presence of unobserved heterogeneity. We briefly discuss the model, estimation method, and implementation of the frcount command in Stata. More importantly, we provide useful summary statistics of parameter estimates, adjusted standard errors, and average partial effects, which can be comparable among nonlinear models.

Additional information

Threshold regression with threg

Mei-Ling Ting Lee
University of Maryland
In this presentation, I introduce a new Stata command called threg. The command estimates regression coefficients of a threshold regression model based on the first hitting time of a boundary by the sample path of a Wiener diffusion process. The regression methodology is well suited to applications involving survival and time-to-event data. This new command uses the MLE routine in Stata for calculating regression coefficient estimates, asymptotic standard errors, and p-values. An initialization option is also allowed, as in the conventional MLE routine. The threg command can be carried out with either calendar or analytical time scales. Hazard ratios at selected time points for specified scenarios (based on given categories or value settings of covariates) can also be calculated by this command. Furthermore, curves of estimated hazard functions, survival functions, and probability distribution functions of the first hitting time can be plotted. Function curves corresponding to different scenarios can be overlaid in the same plot for a comparative analysis to give added research insights.

Additional information

Causal inference

Austin Nichols
Urban Institute
In this presentation, I provide a brief overview of quasiexperimental methods of estimating causal impacts using Stata: panel data, matching and reweighting, instrumental variables, and regression discontinuity designs, emphasizing practical considerations. I pay particular attention to the regression discontinuity method, which is the least widely known but the most well regarded of the quasiexperimental methods in those circumstances where it is appropriate.

Additional information

New factor variables features in Stata

Jeff Pitblado
In this presentation, I cover how to use the new factor variables features in Stata 11. Stata’s new factor variables notation allows you to identify categorical covariates as factor variables, provides a convenient notation for specifying indicator variables without having to generate them, and allows interactions of factor variables with other factor variables or continuous covariates.

We will also cover the new margins postestimation command. margins is a powerful yet easy-to-use command for computing expected marginal means, predictive margins, adjusted predictions, average marginal effects, and conditional marginal effects. Standard errors in margins can be estimated conditionally on the observed/specified covariate values or unconditionally via linearization.

Additional information

Between tables and graphs

Nicholas J. Cox
Durham University (UK)
The display of data or of results often entails the preparation of a variety of table-like graphs showing both text labels and numeric values. I will present basic techniques, tips, and tricks using both official Stata and various user-written commands. The main message is that whenever graph bar, graph dot, or graph box commands fail to give what you want, then you can knit your own customized displays using twoway as a general framework.

Additional information

Easy and efficient data management in Stata

Bill Rising
There are many different ways to work in Stata depending on your desires: You can work using the menus, dialog boxes, Command window, or via the Do-file Editor. Stata 11 adds to this list with its new Variables Manager and much-improved Data Editor, both of which provide tools that make tasks such as managing value labels or entering and editing dates much easier. I will show off these new features and explain how they can be used to produce do-files for reproducibility through the use of command logs and the improved Do-file Editor.

Additional information

Stata in large-scale development

Michael Lokshin
The World Bank
I will present and discuss the development of the large software project ADePT, which combines the computation kernel of Stata and the user interface written in C#. ADePT is a software platform for applied economic analysis. It is used widely in the World Bank and in many research institutions around the world to produce a standardized set of tables and graphs in different areas of applied economic analysis. Currently, ADePT includes modules on poverty, labor market, inequality, gender, education, social protection, and health.

I will demonstrate various stages of the project development, discuss the software routines (both Stata and C#) developed for interaction between ADePT and Stata, and demonstrate various tools we developed in Stata and C#. Many of these routines are currently available for Stata users.

Additional information

Stata for microtargeting using C++ and ODBC

Masahiko Aida
Greenberg Quinlan Rosner
In U.S. political campaigns, the use of propensity scores of voters, predicted attributes, such as partisanship or turnout likelihood, became quite popular in recent years. Such applications, often called microtargeting, range from survey sampling to voter contacts via direct mail, phone, or canvassing. To create such models, analysts first recode the original dataset into statistical software and then create statistical models by using data mining tools. When the mining models are validated against validation data, then analysts need to append propensity scores with a database of millions of voters (such databases typically contain information from voter files, census data, and consumer data). While database software offers a strong capacity to store and manipulate a large volume of data, carrying out basic data transformation such as recoding or creating an index by PCA is not easy using database software. I will demonstrate an example of using Stata as a front-end tool to connect to database software, calculate propensity scores using a C++ plug-in, and return the propensity scores back to the database. This approach combines the strengths of three different platforms: the flexibility of Stata as a general statistical package, the speed of C++ to conduct complex calculations, and the capacity of database software to manipulate gigabytes of data with relative ease.

Additional information

Stata commands for moving data between PHASE and HaploView

Chuck Huber
Texas A&M Health Science Center School of Rural Public Health
Abstract genetic association studies often explore the relationship between diseases and collections of contiguous genetic markers located on the same chromosome (known as haplotypes). Haplotypes are usually not observed directly but are inferred statistically using a variety of algorithms. One of the most popular haplotype inference programs is PHASE (Stephens and Scheet 2005; Stephens, Smith, and Donnelly 2001) and one of the most popular programs for examining characteristics of the resulting haplotypes is HaploView (Barrett, et al. 2005). I will present a set of Stata commands for exporting genotype data from Stata into PHASE, importing the resulting haplotypes back into Stata for association analysis, and exporting the haplotype data from Stata into HaploView for further exploration.

Barrett, J. C., B. Fry, J. Maller, and M. J. Daly. 2005.
Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics 21: 263–265.
Stephens, M., and P. Scheet. 2005.
Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. American Journal of Human Genetics 76: 449–462.
Stephens, M., N. J. Smith, and P. Donnelly. 2001.
A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics 68: 978–989.
Additional information

Meta-analytic depiction of ordered categorical diagnostic test accuracy in ROC space

Ben Dwamena
University of Michigan
Meta-analysis of diagnostic accuracy studies may be performed to provide a summary measure of diagnostic accuracy based on a collection of studies and their reported empirical or estimated smooth ROC curves. Statistical methodology for meta-analysis of diagnostic accuracy studies has largely been focused on the most common type of studies—those reporting estimates of test sensitivity and specificity. To meta-analyze studies with results in more than two categories, one approach is to dichotomize results by grouping them into two categories and then employing one of such methods. However, it is more efficient to take all thresholds into account. Existing methods require the same number and set of categories/thresholds, are computationally intensive adaptations of the binary methods, or are only implementable using Bayesian inference. In this presentation, I present a robust and flexible parametric algorithm that is invariant to the number and set of categories and is implementable with standard statistical software such as Stata, SPSS, or SAS. The method consists of 1) estimation of study-specific ROC and location-scale parameters by heteroskedastic ordinal (probit or logit) regression; 2) estimation of correlated or uncorrelated mean location and scale from study-specific estimates with linear mixed modeling by ML, REML, or method of moments; and 3) estimation of summary ROC (bilogistic versus binormal) and ROC functionals with mean location and scale estimates from step 2. The method is illustrated with two datasets (one with studies reporting the same set of categories and the other with disparately categorized outcomes). Steps 1 and 2 are performed with oglm (authored by Richard Williams) and mvmeta (authored by Ian White) respectively. The proposed meta-analytical algorithm may be implemented in Stata by using the midacat module.

Additional information

Automated individualized student assessment

Stas Kolenikov
University of Missouri
Statisticians routinely use Monte Carlo methods to simulate random data and run new estimation procedures on those simulated data. How about simulating data for students to use in their homework? Each student gets a unique copy of a dataset, which serves at least two purposes. First, each student has to interact with the software and interpret their own answers. Second, verbatim copying of answers is not meaningful. Because the random-number generator seeds are fixed, we can also generate the answer keys and match students’ answers to those keys. I will present a system that automatically manages all the students grading tasks with the Stata package aisa. Finally, I will discuss applications in the classroom and students’ reactions to the system.

Additional information

Altruism squared: The economics of Statalist exchanges

Martin Weiss
University of Tuebingen (Germany)
I have researched the economics of interactions on Statalist, based on the full population of exchanges from 1 January to 30 April 2009. I will examine both the “demand side” (the questions asked on the list) and the “supply side” (the answers provided). I pay particular attention to the role of unsatisfied demand (“orphans”), i.e., questions that never attract a reply.

Additional information

Implementing custom graphics in Stata

Sergiy Radyakin
The World Bank
Stata provides a fairly extensive set of graphs. However, sometimes users need to implement custom graphs, which are not yet available. In some cases, it is possible to “tweak” a standard graph so that it results in the desired image; in other cases, it is not possible. Stata uses a complex system of objects implemented as classes and heavily relies on inheritance, polymorphism, and overriding to implement its graphics. While standard class programming is well described in the Stata manuals, the particulars of the design and implementation of the Stata graphics features are not documented by developers and thus are not easily accessible. In this presentation, I will briefly discuss the overall idea of how Stata graphics works and review some examples of custom graphics commands and their implementations. This part of the discussion will be most useful for skilled Stata programmers who want to know what is happening “under the hood” and, perhaps, optimize their graphic commands to improve performance or add features. Then we will look at the new command matrixplot, the sample images rendered by which generated quite a lot of interest on Statalist. matrixplot can be used to produce contour plots and heatmap-like plots, and is particularly useful when working with climate data as well as when displaying raster images for digital image processing.

Additional information

Scientific organizers

Austin Nichols, (chair) Urban Institute

Frauke Kreuter, University of Maryland

Michael Lokshin, World Bank

Mei-Ling Ting Lee, University of Maryland

Logistics organizers

Chris Farrar, StataCorp

Gretchen Farrar, StataCorp