Stata Conference DC 09: Abstracts
Thursday, July 30, 2009
Generalized method of moments estimators in Stata
Stata 11 has new command gmm
for estimating parameters by the
generalized method of moments (GMM). gmm
can estimate the
parameters of linear and nonlinear models for cross-sectional, panel, and
time-series data. In this presentation, I provide an introduction to GMM and to the
Mixed-process models with cmp
Center for Global Development
At the heart of many econometric models is a linear function and a normal
error. Examples include the classical small-sample linear regression model
and the probit, ordered probit, multinomial probit, tobit, interval
regression, and truncated distribution regression models. Because the normal
distribution has a natural multidimensional generalization, such models can
be combined into multiequation systems in which the errors share a
multivariate normal distribution. The literature has historically focused on
multistage procedures for estimating mixed models, which are more efficient
computationally, if less so statistically, than maximum likelihood (ML). But
faster computers and simulated likelihood methods such as the Geweke,
Hajivassiliou, and Keane (GHK) algorithm for estimating higher-dimensional
cumulative normal distributions have made direct ML estimation practical. ML
also facilitates a generalization to switching, selection, and other models
in which the number and types of equations vary by observation. The Stata
fits seemingly unrelated regressions models of this
broad family. Its estimator is also consistent for recursive systems in
which all endogenous variables appear on the right-hand sides as observed.
If all the equations are structural, then estimation is full-information
maximum likelihood. If only the final stage or stages are structural, then it is
limited-information maximum likelihood. cmp
can mimic a dozen
built-in Stata commands and several user-written ones. It is also
appropriate for a panoply of models previously hard to estimate.
Heteroskedasticity, however, can render it inconsistent. In this
presentation, I explain the theory and implementation of cmp
and of a
related Mata function, ghk2()
, that implements the GHK algorithm.
New multivariate time-series estimators in Stata
Stata 11 has new commands sspace
for estimating the
parameters of space-space models and diagonal-vech multivariate GARCH models,
respectively. In this presentation, I provide an introduction to space-space models,
diagonal-vech multivariate GARCH models, the implemented estimators, and the
new Stata commands.
Survey data analysis in Stata
In this presentation, I cover how to use Stata for survey data analysis assuming a fixed
population. We will begin by reviewing the sampling methods used to collect
survey data, and how they affect the estimation of totals, ratios, and
regression coefficients. We will then cover the three variance estimators
implemented in Stata's survey estimation commands. Strata with a single
sampling unit, certainty sampling units, subpopulation estimation, and
poststratification will be also covered in some detail.
Regression diagnostics for survey data
University of Maryland
Diagnostics for linear regression models are included as options in Stata
and many other statistical packages and are now readily available to
analysts. However, these tools are generally aimed at ordinary or weighted
least-squares regression and do not account for stratification, clustering,
and survey weights that are features of datasets collected in complex
sample surveys. The ordinary least-squares diagnostics can mislead users
because the variances of model parameter estimates will usually be estimated
incorrectly by the standard procedures. The variance or standard-error
estimates are an intimate part of many diagnostics. In this presentation, I
summarize research that has been done to extend some of the existing
diagnostics to complex survey data. Among the linear regression techniques
I cover are leverages, DFBETAS, DFFITS, the forward search method for
identifying influential points, and collinearity diagnostics, like variance
inflation factors and variance decompositions.
Using Stata for subpopulation analysis of complex sample survey data
University of Michigan
In this presentation, I provide an overview of important considerations that
analysts of large public-use survey datasets must keep in mind when
attempting to make inferences for finite subpopulations of research
interest. I will discuss several examples of possible subpopulation analysis approaches
that analysts could take using the Stata svy:
commands, and I will
emphasize the implications of each approach for making inferences.
Participants will have time for a question-and-answer session
building upon the examples.
Implementing econometric estimators with Mata
Christopher F. Baum
I will discuss how econometric estimators may be efficiently programmed in Mata.
The prevalence of matrix-based analytical derivations of estimation
techniques and the computational improvements available from just-in-time
compilation combine to make Mata the tool of choice for econometric
implementation. I will give two examples: computing the seemingly unrelated
regression estimator for an unbalanced panel, a multivariate linear
approach, and computing the continuously updated GMM estimator (GMM-CUE) for
a linear instrumental-variables model. The GMM-CUE estimator makes use of
Mata’s optimize suite of functions. Both illustrate the power and
effectiveness of a Mata-based approach.
Estimating high-dimensional fixed-effects models
University of South Carolina
In this presentation, I describe an alternative iterative approach for the
estimation of linear regression models with high-dimensional fixed-effects,
such as large employer–employee datasets. This approach is computationally
intensive but imposes minimum memory requirements. I also show that the
approach can be extended to nonlinear models and potentially to more than
two high-dimensional fixed effects. Note: The presentation is based on a
paper that is currently under review at the Stata Journal
Data envelopment analysis in Stata
Korea National Defense University
In this presentation, we present a procedure and an illustrative application of a
user-written Data Envelopment Analysis (DEA) program in Stata. DEA is a
linear programming method for assessing the efficiency and productivity of
units and a popular managerial tool for measuring performance of organizations.
It has been used widely for assessing the efficiency of public and
private sectors, such as banks, airlines, hospitals, universities, defense
firms, and manufacturers. The DEA program in Stata will allow DEA users to
easily access the Stata system and to conduct not only the standard
optimization procedure but also more extended managerial analysis. The Mata
programming, an extension of the DEA program code developed in the Stata
programming language, will be discussed for the cases where the data
capacity matters. We will also discuss the returns to scale options in
DEA. Unfortunately, to date no DEA options are available in Stata, but an
SFA model is available. The user-written DEA approach in Stata will provide
some possible future extensions of Stata programming in DEA.
Estimating the fractional response model with an endogenous count variable
Michigan State University
In this presentation, we introduce the command frcount
the fractional response model with an endogenous count variable. The
endogeneity of the right-hand-side count variable is controlled for under
the presence of unobserved heterogeneity. We briefly discuss the model,
estimation method, and implementation of the frcount
Stata. More importantly, we provide useful summary statistics of parameter
estimates, adjusted standard errors, and average partial effects, which can
be comparable among nonlinear models.
Threshold regression with threg
Mei-Ling Ting Lee
University of Maryland
In this presentation, I introduce a new Stata command called threg
command estimates regression coefficients of a threshold regression model
based on the first hitting time of a boundary by the sample path of a Wiener
diffusion process. The regression methodology is well suited to applications
involving survival and time-to-event data. This new command uses the MLE
routine in Stata for calculating regression coefficient estimates,
asymptotic standard errors, and p
-values. An initialization option is
also allowed, as in the conventional MLE routine. The threg
can be carried out with either calendar or analytical time scales.
Hazard ratios at selected time points for specified
scenarios (based on given categories or value settings of covariates) can
also be calculated by this command. Furthermore, curves of estimated hazard
functions, survival functions, and probability distribution functions of the
first hitting time can be plotted. Function curves corresponding to
different scenarios can be overlaid in the same plot for a comparative
analysis to give added research insights.
In this presentation, I provide a brief overview of quasiexperimental
methods of estimating causal impacts using Stata: panel data, matching and
reweighting, instrumental variables, and regression discontinuity designs,
emphasizing practical considerations. I pay particular attention to the
regression discontinuity method, which is the least widely known but the
most well regarded of the quasiexperimental methods in those circumstances
where it is appropriate.
Friday, July 31, 2009
New factor variables features in Stata
In this presentation, I cover how to use the new factor variables features in Stata 11.
Stata’s new factor variables notation allows you to identify categorical
covariates as factor variables, provides a convenient notation for specifying
indicator variables without having to generate them, and allows interactions of
factor variables with other factor variables or continuous covariates.
We will also cover the new margins
is a powerful yet easy-to-use command for computing expected
marginal means, predictive margins, adjusted predictions, average marginal
effects, and conditional marginal effects. Standard errors in margins
can be estimated conditionally on the observed/specified covariate values or
unconditionally via linearization.
Between tables and graphs
Nicholas J. Cox
Durham University (UK)
The display of data or of results often entails the preparation of a variety
of table-like graphs showing both text labels and numeric values. I will
present basic techniques, tips, and tricks using both official Stata and
various user-written commands. The main message is that whenever graph
, graph dot
, or graph box
commands fail to give what
you want, then you can knit your own customized displays using twoway
as a general framework.
Easy and efficient data management in Stata
There are many different ways to work in Stata depending on your desires: You
can work using the menus, dialog boxes, Command window, or via the Do-file
Editor. Stata 11 adds to this list with its new Variables Manager and much-improved
Data Editor, both of which provide tools that make tasks such as managing
value labels or entering and editing dates much easier. I will show off these
new features and explain how they can be used to produce do-files for
reproducibility through the use of command logs and the improved Do-file
Stata in large-scale development
The World Bank
I will present and discuss the development of the large software project
ADePT, which combines the computation kernel of Stata and the user interface
written in C#. ADePT is a software platform for applied economic analysis.
It is used widely in the World Bank and in many research institutions
around the world to produce a standardized set of tables and graphs in
different areas of applied economic analysis. Currently, ADePT includes
modules on poverty, labor market, inequality, gender, education, social
protection, and health.
I will demonstrate various stages of the project development, discuss the
software routines (both Stata and C#) developed for interaction between
ADePT and Stata, and demonstrate various tools we developed in Stata and C#.
Many of these routines are currently available for Stata users.
Stata for microtargeting using C++ and ODBC
Greenberg Quinlan Rosner
In U.S. political campaigns, the use of propensity scores of voters, predicted
attributes, such as partisanship or turnout likelihood, became quite popular
in recent years. Such applications, often called microtargeting, range
from survey sampling to voter contacts via direct mail, phone, or canvassing.
To create such models, analysts first recode the original dataset into
statistical software and then create statistical models by using data mining
tools. When the mining models are validated against validation data, then
analysts need to append propensity scores with a database of millions of
voters (such databases typically contain information from voter files, census
data, and consumer data). While database software offers a strong capacity to
store and manipulate a large volume of data, carrying out basic data
transformation such as recoding or creating an index by PCA is not easy using
database software. I will demonstrate an example of using Stata as a
front-end tool to connect to database software, calculate propensity scores
using a C++ plug-in, and return the propensity scores back to the database. This
approach combines the strengths of three different platforms: the flexibility of
Stata as a general statistical package, the speed of C++ to conduct complex
calculations, and the capacity of database software to manipulate gigabytes of
data with relative ease.
Stata commands for moving data between PHASE and HaploView
Texas A&M Health Science Center School of Rural Public Health
Abstract genetic association studies often explore the relationship between
diseases and collections of contiguous genetic markers located on the same
chromosome (known as haplotypes). Haplotypes are usually not observed directly
but are inferred statistically using a variety of algorithms. One of the
most popular haplotype inference programs is PHASE (Stephens and Scheet 2005;
Stephens, Smith, and Donnelly 2001) and one of the most popular programs for
examining characteristics of the resulting haplotypes is HaploView (Barrett,
et al. 2005). I will present a set of Stata commands for exporting genotype
data from Stata into PHASE, importing the resulting haplotypes back into
Stata for association analysis, and exporting the haplotype data from Stata
into HaploView for further exploration.
- Barrett, J. C., B. Fry, J. Maller, and M. J. Daly. 2005.
- Haploview: Analysis and visualization of LD and haplotype maps.
Bioinformatics 21: 263–265.
- Stephens, M., and P. Scheet. 2005.
- Accounting for decay of linkage disequilibrium in haplotype inference
and missing-data imputation. American Journal of Human Genetics
- Stephens, M., N. J. Smith, and P. Donnelly. 2001.
- A new statistical method for haplotype reconstruction from population data.
American Journal of Human Genetics 68: 978–989.
Meta-analytic depiction of ordered categorical diagnostic test accuracy in ROC space
University of Michigan
Meta-analysis of diagnostic accuracy studies may be performed to provide a
summary measure of diagnostic accuracy based on a collection of studies and
their reported empirical or estimated smooth ROC curves. Statistical
methodology for meta-analysis of diagnostic accuracy studies has largely
been focused on the most common type of studies—those reporting estimates
of test sensitivity and specificity. To meta-analyze studies with results in
more than two categories, one approach is to dichotomize results by grouping
them into two categories and then employing one of such methods. However,
it is more efficient to take all thresholds into account. Existing methods
require the same number and set of categories/thresholds, are
computationally intensive adaptations of the binary methods, or are only
implementable using Bayesian inference. In this presentation, I present a
robust and flexible parametric algorithm that is invariant to the
number and set of categories and is implementable with standard statistical
software such as Stata, SPSS, or SAS. The method consists of 1) estimation
of study-specific ROC and location-scale parameters by heteroskedastic
ordinal (probit or logit) regression; 2) estimation of correlated or
uncorrelated mean location and scale from study-specific estimates with
linear mixed modeling by ML, REML, or method of moments; and 3) estimation
of summary ROC (bilogistic versus binormal) and ROC functionals with mean
location and scale estimates from step 2. The method is illustrated with two
datasets (one with studies reporting the same set of categories and the other
with disparately categorized outcomes). Steps 1 and 2 are performed with
(authored by Richard Williams) and mvmeta
(authored by Ian
White) respectively. The proposed meta-analytical algorithm may be
implemented in Stata by using the midacat
Automated individualized student assessment
University of Missouri
Statisticians routinely use Monte Carlo methods to simulate random data and
run new estimation procedures on those simulated data. How about simulating
data for students to use in their homework? Each student gets a unique copy
of a dataset, which serves at least two purposes. First, each student has to
interact with the software and interpret their own answers. Second,
verbatim copying of answers is not meaningful. Because the random-number
generator seeds are fixed, we can also generate the answer keys and match
students’ answers to those keys. I will present a system that
automatically manages all the students grading tasks with the Stata package
. Finally, I will discuss applications in the classroom and
students’ reactions to the system.
Altruism squared: The economics of Statalist exchanges
University of Tuebingen (Germany)
I have researched the economics of interactions on Statalist, based on the
full population of exchanges from 1 January to 30 April 2009. I will examine
both the “demand side” (the questions asked on the list) and the
“supply side” (the answers provided). I pay particular attention
to the role of unsatisfied demand (“orphans”), i.e., questions
that never attract a reply.
Implementing custom graphics in Stata
The World Bank
Stata provides a fairly extensive set of graphs. However, sometimes users
need to implement custom graphs, which are not yet available. In some cases,
it is possible to “tweak” a standard graph so that it results in
the desired image; in other cases, it is not possible. Stata uses a complex
system of objects implemented as classes and heavily relies on inheritance,
polymorphism, and overriding to implement its graphics. While standard class
programming is well described in the Stata manuals, the particulars of the
design and implementation of the Stata graphics features are not documented
by developers and thus are not easily accessible. In this presentation, I
will briefly discuss the overall idea of how Stata graphics works and
review some examples of custom graphics commands and their implementations.
This part of the discussion will be most useful for skilled Stata programmers
who want to know what is happening “under the hood” and,
perhaps, optimize their graphic commands to improve performance or add
features. Then we will look at the new command matrixplot
, the sample
images rendered by which generated quite a lot of interest on Statalist.
can be used to produce contour plots and heatmap-like plots,
and is particularly useful when working with climate data as well as when
displaying raster images for digital image processing.