Last updated: 28 July 2004
2004 UK Stata Users Group meeting
28–29 June 2004
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
School of Economics, University of Nottingham
The use of datasets that contain information on both workers and the
firms they work for is growing rapidly, especially in fields such as
applied econometrics and labor economics. Similar data structures
may also arise in the analysis of data on patients and doctors or
students and schools. Many of these datasets are extremely large,
some containing a substantial fraction of the population of firms and
The analysis of this kind of data poses two related problems.
The first is a problem of computing power, memory, and storage. The
second is the statistical problem of how to control for and estimate
the "unobserved effects" (also known as "fixed effects") for both
workers and firms.
In this presentation, we explain the basic issues and how we have dealt
with them using Stata. We illustrate using both simulated data and a
large linked employer–employee panel collected by the Institut für
Arbeitsmarkt und Berufsforschung in Germany. We show how to implement
various potential methods and suggest problems and limitations that
the analyst using Stata may encounter.
Abowd, J. and F. Kramarz. 1999. The analysis of labor markets using
matched employer–employee data. In Ashenfelter, O. and Card, D. (eds.)
Handbook of Labor Economics Volume 3. North–Holland.
Giovanni S. F. Bruno
Università Commerciale Luigi Bocconi, Milano
It is well known that the LSDV estimator for dynamic panel data models is not
consistent for N large and finite T. Nickell (1981) derives an
expression for the inconsistency for N→∞, which is
O(1/T). Kiviet (1995) uses asymptotic expansion techniques to
approximate the small sample bias of the LSDV estimator to also include terms
of at most order 1/NT, thus offering a method to correct the LSDV
estimator for samples where N is small or only moderately large. In
Kiviet (1999) and Bun and Kiviet (2003), the bias expression is more accurate,
including higher order terms. Monte Carlo evidence in Judson and Owen (1999)
strongly supports the corrected LSDV estimator compared to more traditional
GMM estimators when N is only moderately large. Bruno (2004) extends
the bias approximation formulas in Bun and Kiviet (2003) to accommodate
unbalanced panels with a strictly exogenous selection rule.
This paper describes the Stata codes used in Bruno (2004) to compute the
bias approximations and carry out the Monte Carlo experiment estimating the
actual LSDV bias for various data generating processes. The analysis covers
both balanced and unbalanced panels. It is found that the actual bias as
estimated by Monte Carlo replications, besides following the same patterns as
in Bun and Kiviet (2003), turns out non-increasing in the degree of
unbalancedness. Moreover, the approximations are always accurate with a
decreasing contribution to the actual bias of the higher order terms.
Bruno, G. S. F. 2004. Approximating the bias of the LSDV estimator for
unbalanced dynamic panel data models. mimeo.
Bun, M. J. G. and J. F. Kiviet. 2003. On the diminishing returns of higher
order terms in asymptotic expansions of bias. Economics Letters 79:
Judson, R. A. and A. L. Owen. 1999. Estimating dynamic panel data models: a
guide for macroeconomists. Economics Letters 65: 9–15.
Kiviet, J. F. 1995. On bias, inconsistency and efficiency of various
estimators in dynamic panel data models. Journal of Econometrics
------. 1999. Expectation of expansions for estimators in a dynamic
panel data model: some results for weakly exogenous regressors. In Hsiao, C.,
K. Lahiri, L-F Lee, and M. H. Pesaran. eds. Analysis of Panel Data and
Limited Dependent Variables. Cambridge: Cambridge University Press.
Nickell, S. J. 1981. Biases in dynamic models with fixed effects.
Econometrica 49: 1417–1426.
MRC Clinical Trials Unit, London
Following the seminal publications of Rubin starting about 30 years ago,
statisticians have become increasingly aware of the inadequacy of `complete
case' analysis of datasets with missing observations. In medicine, for
example, observations may be missing in a sporadic way for different
covariates; a complete-case analysis may omit as many as half of the
available cases. `Hotdeck' imputation was implemented in Stata by Mander and
Clayton (1999). However, this technique may perform poorly in the common case
when many rows of data have at least one missing value. In this talk, I will
describe an implementation for Stata of the `MICE' method of multiple
multivariate imputation described by van Buuren et al. (1999). MICE stands for Multivariate Imputation by
Chained Equations. The basic idea of data analysis with multiple imputation is
to create a small number (e.g., 3–5) copies of the data, each of which
has the missing values suitably imputed. Then, each complete dataset is
analyzed independently. Estimates of parameters of interest are averaged
across the copies to give a single estimate. Standard errors are computed
according to the `Rubin rules' (Rubin 1987), devised to allow for the between-
and within-imputation components of variation in the parameter estimates. In
the talk, I will present briefly five ado-files. mvis creates multiple
multivariate imputations. uvis imputes missing values for a single
variable as a function of several covariates, each with complete data.
micombine fits a wide variety of regression models to a multiply
imputed dataset, combining the estimates using Rubin's rules.
micombine supports survival analysis models (stcox and
streg), categorical data models, generalized linear models, and more.
Finally, misplit and mijoin are utilities to inter-convert
datasets created by mvis and by Carlin et al. (2003)'s miset
routine. The use of the routines will be illustrated by example.
Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing
multiple imputed datasets. Stata Journal 3: 226–244.
Mander, A. and D. Clayton. 1999. Hotdeck imputation.
Stata Technical Bulletin 51: 32–34.
Rubin, D. B. 1987. Multiple Imputation for Non-response in Surveys.
New York: John Wiley.
van Buuren, S., H. C. Boshuizen, and D. L Knook. 1999.
Multiple imputation of missing blood pressure covariates in survival
analysis. Statistics in Medicine 18: 681–694.
Department of Social Medicine, University of Bristol
In medical prognosis based on survival analysis, there is an interest in
visualizing the shape of the hazard function. In fully parametric models, the
shape of the hazard function is constrained by the properties of the chosen
distribution (Weibull, log-logistic, lognormal, Gompertz, gamma). The
semi-parametric Cox model only assumes proportional hazards and has no
specification of the baseline hazard. In Stata 8, a method of illustrating
hazard functions using a kernel smooth of the hazard contributions is
implemented for the Cox model which will allow more flexible shapes. However,
if the proportional hazards assumption is violated, a method based on
smoothing the Nelson–Aalen cumulative hazard function followed by numerical
differentiation to give the hazard function and further kernel-density
smoothing of the resulting function may be useful.
This method will be illustrated using data from ART-CC, an international
collaboration of 12 cohorts with data on over 19,000 HIV positive patients. The
hazard of AIDS or death by risk factor groups defined by initial CD4 count (a
measure of immune system functioning) or injection drug use (IDU) is modeled
from the time of starting antiretroviral therapy for up to 5 years.
Arnold School of Public Health, University of South Carolina, Columbia, SC
and Institute of Information Science and Technology, National Research
Institute of Information Science and Technology,
National Research Council, Pisa and Institute of Environmental Medicine,
Karolinska Institutet, Stockholm
The Stata command xtreg estimates the random-effects linear
regression model, for which the random effects are assumed to be normally
distributed with zero mean and non-negative variance,
Testing homogeneity across units is equivalent to testing the null
hypothesis H0:su2 = 0, which is
a value on the boundary of the parameter space. The command xtreg
provides the upper-tail probability of the appropriate asymptotic distribution
of the likelihood ratio test statistic. However, such a method cannot be used
to construct confidence intervals for the parameter,
su2. Besides, confidence intervals for the
random-effect variance that are based on a Wald-type test, too often used, can
be shown to be asymptotically wrong.
Based on the asymptotic theory for singular information problems, a method
is developed and implemented in the Stata command xtci, which
provides asymptotically-correct confidence intervals. Also,
when testing the hypothesis of homogeneity across units, the proposed method
is shown to have better small-sample properties than one based on the
likelihood ratio test statistic.
Bottai, M. 2003. Confidence regions when the Fisher
information is zero. Biometrika 90(1): 73–84.
Bottai, M. and N. Orsini. 2004. Confidence intervals for the variance
component of random-effects linear models. Stata Journal : in press.
Julian A. Fennema
Centre for Economic Reform and Transformation, Heriot–Watt University
This paper introduces the dhurdle command for Stata, a
maximum-likelihood routine (d2) to estimate the Cragg double hurdle
model with independent or dependent errors. We give a brief
description of the procedure and its application to durable goods
consumption and market participation models. We briefly demonstrate the
construction of the program and present evidence of its consistency. We
compare its efficiency to the results reported by Flood and Gråsjö
for the routine programmed in Gauss and repeat their tests of the effect of
misspecification on the parameter estimates. We also outline extensions
in the pipeline, particularly the inverse hyperbolic sine
heteroskedasticity correction, and also invite suggestions.
Flood, L. and U. Gråsjö. 2001.
A Monte Carlo Simulation of Tobit models.
Applied Economics Letters 8: 581–584.
Christopher F. Baum
Department of Economics, Boston College, Chestnut Hill, MA
This talk will discuss the use of a number of Stata commands, some "official"
and some user-contributed, in the context of working with time-series and
panel data. Testing for endogeneity/exogeneity of regressors,
heteroskedasticity in an instrumental variables context, and fitting
regression models with ARMA errors will be considered, as well as a number of
tests for stationarity of single or multiple time series, including
stationarity in the presence of structural breaks.
StataCorp, College Station, TX
Stata's graphics are more flexible than many realize. We will exploit this
flexibility and explore a potpourri of topics, some of interest to all
graphers and others primarily of interest to those creating highly customized
graphs or new graph commands. Among the topics will be creating custom
schemes to control the appearance of graphs, including an overview of the
format and contents of scheme files. We will also examine twoway
graphs as a platform for creating custom graphs, some of which are not readily
apparent. We will discuss techniques for managing data and leveraging
twoway's native plottypes. Along the way we will introduce some new
official and unofficial tools, and perhaps some downright dangerous, but
useful, undocumented tricks.
There are annotated materials for this talk that can be viewed and run from
within Stata. To find, install, and begin the materials, type the following
commands in Stata:
. net from http://www.stata.com/users/vwiggins
. net describe uk04
. net install uk04
. whelp ukgrtalk
Office for National Statistics, London
The recent addition of support for the CMYK color model to Stata 8 allows
graphs from Stata to be used where color separation is required for printing.
This paper outlines work being done at the Office for National Statistics to
use Stata for graphics in our flagship "Economic Trends" publication. This
work aims to reduce the burden on our design teams, allow later deadlines and
ensure that analysts have more control over the appearance of their graphics
in the final publication.
Nicholas J. Cox
Department of Geography, Durham University
Circular data are a large class of directional data, which are of interest to
scientists in many fields, including biologists (movements of migrating
animals), meteorologists (winds), geologists (directions of joints and faults),
and geomorphologists (landforms, oriented stones). These examples are all
recordable as compass bearings relative to North. Other examples include
phenomena that are periodic in time, including those dependent on time of day
(in biomedical statistics: hospital visits or times of birth) or time of year
(in applied economics: unemployment or sales variations). The analysis of
circular data is an odd corner of statistical science that many never visit,
even though it has a long and curious history. Moreover, it seems that no major
statistical language provides direct support for circular statistics. This talk
describes the development and use of some routines that have been written in
Stata, primarily to allow graphical and exploratory analyses. In 2004, such
routines are being rewritten, especially to allow use of the new graphics of
Biplots display correlations and differences in means and
standard deviations of many variables on one graph, together
with the values of the plotted variables and approximations of
the Euclidean distance between the observations. Biplots are
useful for identifying clusters of observations, guiding
interpretation of factor analyses, detecting multivariate
outliers and getting an idea about the correlation structure
of the data. The talk will demonstrate the merits of biplots
and discuss the development of a new version of biplot
for Stata 8.2.
Soziologie, ETH Zürich
Although multiple response questions are quite common in survey
research, Stata's official release does not provide much possibility for
an effective analysis of multiple response variables. For example, in a
study on drug addiction an interview question might be, "Which
substances did you consume during the last four weeks?" The respondents
just list all the drugs they took if any, e.g., an answer could be
"cannabis, cocaine, heroin" or "ecstasy, cannabis" or "none", etc.
Usually, the responses to such questions are held as a set of variables
and, therefore, cannot be easily tabulated. I will address this issue
and present a new module to compute one- and two-way tables of multiple
responses. The module supports several types of data structure, provides
significance tests, and offers various options to control the computation
and display of the results.
Department of Social Medicine, University of Bristol
M. A. Hernán, Harvard School of Public Health
F. Wolfe, National Data Bank for Rheumatic Diseases, USA
K. Tilling, Department of Social Medicine, University of Bristol
H. Choi, Harvard Medical School
J. A. C. Sterne, Department of Social Medicine, University of Bristol
Longitudinal studies in which exposures, confounders, and outcomes
are measured repeatedly over time have the potential to allow
causal inferences about the effects of exposure on outcome. There
is particular interest in estimating the causal effects of medical
treatments (or other interventions) in circumstances in which a
randomized controlled trial is difficult or impossible. However,
standard methods for estimating exposure effects in longitudinal
studies are biased in the presence of time-dependent confounders
affected by prior treatment.
This talk describes the use of marginal structural models
(described by Robins et al.) to estimate exposure or treatment
effects in the presence of time-dependent confounders affected by
prior treatment. The method is based on deriving
inverse-probability-of-treatment weights, which are then
used in a pooled logistic regression model to estimate the causal
effect of treatment on outcome. We demonstrate the use of
marginal structural models to estimate the effect of methotrexate
on mortality in persons suffering from rheumatoid arthritis.
Department of Social Medicine, University of Bristol
Systematic reviews of randomized trials are now widely recognized to be the
best way to summarize the evidence on the effects of medical interventions. A
systematic review may (though it need not) contain a meta-analysis, `a
statistical analysis that combines the results of several independent studies
considered by the analyst to be "combinable" '. The first researcher to
do a meta-analysis was probably Karl Pearson, in 1904. Sadly, Stata was not
available at this time. The first Stata command for meta-analysis — the
meta command — was published in the Stata Technical Bulletin
in 1997 and exploited a facility, introduced in Stata version 5, to program
graphics. It requires the user to derive an estimate of the effect of
intervention, together with its standard error, for each study. The
metan command, published in 1998, does analyses based on the 2
×2 table for each study and provides more detailed graphical displays.
Facilities for cumulative meta-analysis and meta-regression and tools for
examining bias in meta-analysis have since been introduced.
It is perhaps surprising that Stata commands for meta-analysis are still
entirely user-written. This means that the existing commands that produce
graphics (a major advantage of the Stata commands compared with those available
in other statistical packages) are outdated since the introduction of Stata 8
graphics. Possible ways forward will be discussed, and the talk will conclude
with a discussion of developments in meta-analysis that could usefully be
addressed by future Stata commands.
Lois G. Kim
MRC Biostatistics Unit, Cambridge
Ian, R. White, MRC Biostatistics Unit, Cambridge
Time-to-event endpoints are a common outcome of interest in randomized
clinical trials. The primary analysis should usually be by
intention-to-treat, giving an indication of the effectiveness of the
intervention in a population as a whole. However, the benefit
specifically for an individual receiving the intervention is becoming
increasingly important as patient decisions become more evidence-based.
Effectiveness is defined as the benefit of intervention as actually
applied, and may be estimated from simple all-or-nothing compliance
data. Efficacy, on the other hand, is the benefit of intervention under
ideal circumstances, and requires more complex compliance data.
Intervention effectiveness and efficacy after accounting for
non-compliance can be estimated in various ways, some of which have
already been implemented in Stata (e.g., strbee).
Recently, Loeys and Goetghebeur (2003) provided new methodology using
proportional-hazards techniques in survival data where compliance is
all-or-nothing in the intervention arm and perfect in the control arm.
Here, their method is implemented in Stata. The output is a hazard ratio
for the effectiveness of intervention, adjusted for observed adherence
to intervention in the treated group. An example application is
discussed for a subset of a large, randomized trial of screening
where the average benefit of 26% risk reduction becomes a 34%
risk reduction for individuals attending screening.
Loeys, T. and E. Goetghebeur. 2003.
A causal proportional hazards estimator for the effect of treatment
actually received in a randomized trial with all-or-nothing compliance.
Biometrics 59: 100–105.
Department of Public Health Sciences, King's College, London
A resultsset is a Stata dataset created as output by a Stata program. It can
be used as input to other Stata programs, which may in turn output the results
as publication-ready plots or tables. Programs that create resultssets
include xcontract, xcollapse, parmest,
parmby, and descsave. Stata resultssets do a similar job to
SAS output datasets, which are saved to disk files. However, in Stata, the
user typically has the options of saving a resultsset to a disk file, writing
it to the memory (overwriting any pre-existing data set), or simply listing
it. Resultssets are often saved to temporary files, using the
tempfile command. This lecture introduces programs that create
resultssets, and also programs that do things with resultssets after they have
been created. listtex outputs resultssets to tables that can be
inserted into a Microsoft Word, HTML, or TeX document.
eclplot inputs resultssets and creates confidence interval plots.
Other programs, such as sencode and tostring, process
resultssets after they are created and before they are listed, tabulated, or
plotted. These programs, used together, have a power not always appreciated
if the user simply reads the online help for each package.
Abdel G. A. Babiker
MRC Clinical Trials Unit, London
Mohamed M. Ali
In the presence of dependent competing risks in survival analysis, the Cox
proportional hazards model can be utilized to examine covariate effects on the
cause-specific hazard function for each type of failure. The method proposed
by Lunn and McNeil (1995) requires data augmentation. With k failure
types, the data would be duplicated k times, one record for each
failure type. Either a stratified or an unstratified analysis could be used,
depending on whether the assumption of proportional hazards holds. If the
proportional hazards assumption does not hold across the causes, the
stratified analysis should be used, which is equivalent to fitting a separate
model for each failure type. The unstratified analysis assumes a constant
hazard ratio between failure types and this could be fitted by including an
indicator variable as a covariate.
We will show how both approaches could be fitted on augmented data using
stcox. In addition to the parameter estimates and their standard
errors, the program has an option to produce cumulative incidence functions
with pointwise confidence limits.
Lunn, M. and D. McNeil. 1995. Applying Cox regression to competing risks.
Biometrics 51: 524–532.
Twin & Genetic Epidemiology Research Unit, Department of Medicine,
St Thomas' Hospital
Searches for genes using linkage analyses with genetic markers placed across
the entire human genome are hypothesis-free experiments, which represent an
extreme form of multiple testing. As such, the low p-values required to obtain
nominal significance make accurate diagnostics essential to assess model fit
and to eliminate naive incorrect results. In hypothesis-driven single tests,
researchers usually take good care to assess model fit and the validity of
model assumptions, but such concerns are usually ignored when it comes to
linkage analysis. This is particularly problematic where low thresholds
(p < 0.0001) can result in extreme
sensitivity to outlying observations and for some models (e.g., standard
variance component analysis), greater sensitivity to violation of model
Here, we attempt to address these problems for genomic data based on 1,300
healthy sib-pairs (dizygotic twins) using modified Haseman–Elston
regression-based linkage analysis for quantitative traits, in which sib-pair
phenotypic covariance is correlated with genetic marker covariance. The
statistical theory underpinning the implementation of tests for linkage using
generalized linear models (GLM) (glm in Stata) is documented in
detail elsewhere. In brief, the advantage of analyzing sib-pairs using GLM is
that the approach shares all of the strengths of OLS and variance components,
but none of their weaknesses. These are that (1) unlike OLS, the residual
errors are correctly specified with a gamma distribution and known
heteroskedasticity is accounted for; (2) unlike standard variance components,
by freely estimating the coefficient of variation, GLM is robust to phenotypic
deviations from multivariate normality.
Just as important are the practical advantages. With the release of
Stata8/SE for large datasets, we have been able to store and check
genetic markers for all 22 pairs of autosomal chromosomes plus sex chromosomes. In addition, we have generated 2-point
and multipoint allele-sharing identical by descent (IBD) elsewhere and imported
this into Stata. Using Stata scripts with a simple loop structure that calls
on the glm command, we are able to perform genome-wide scans and save
any summary statistics to file. We have been able to utilize the following
features in Stata:
1. correct diagnostics on a genome-wide basis that are not normally made
available to users of applied linkage packages
2. robust estimates of significance, such as Huber sandwich estimates,
bootstrap routines, permutation tests, etc.
3. probability weighting to utilize the full probability distribution of the
number of alleles shared IBD
4. computationally fast and easy to implement
Finally, we also can perform basic, but powerful bioinformatics tasks such as:
1. using the xpose command to summarize marker information by
chromosome and sib-pair
2. resolving marker order more accurately, which is essential for
correct multipoint IBD generation, by interpolating genetic
distance using the latest physical and genetic marker maps
Barber, M. J., H. J. Cordell, A. J. MacGregor, and T. Andrew. 2004.
Gamma regression improves Haseman-Elston and variance components
linkage analysis for sib-pairs. Genetic Epidemiology 26(2):
Paul T. Seed
Dept of Obstetrics & Gynaecology, GKT School of Medicine, King's College, London
A diagnostic test is used typically because it is cheaper, quicker, or less
invasive than the reference standard but may not be as reliable.
Diagnostic tests are evaluated against a reference standard (sometimes
called "Gold Standard"), regarded as completely accurate.
Commands diagt and diagti have been developed to evaluate binary
tests and provide all the standard measures of performance (including
sensitivity, specificity, likelihood ratios, and predictive values, with
appropriate confidence intervals. A prevalence option adjusts for
different case-mix, and evaluates the test result for a particular patient
with known pre-test risk.
The use of ROC curves for ordered categorical and continuous data will be
considered, in particular the determining of a suitable cutoff value.
Where the distribution of a continuous measure can be adequately modeled,
the likelihood ratio can be used to determine the absolute risk of an
Appropriate Stata commands for these analyses will be demonstrated.
William W. Gould
StataCorp, College Station, TX
Bill Gould, who is President of StataCorp, and, more
importantly for this meeting, the head of development,
will ruminate about work at Stata over the last year
and about ongoing activity.
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Timberlake Consultants, the official distributor
of Stata in the United Kingdom, Ireland, Spain, and Portugal.