Last updated: 7 June 2002
2002 UK Stata Users Group meeting
20–21 May 2002
Royal Statistical Society
12 Errol Street
London EC1Y 8LX
Creating plots and tables of estimation results using parmest and friends
Department of Public Health Medicine, King's College, London
Statisticians make their living mostly by producing confidence intervals and
p-values. However, those supplied in the Stata log are not in any fit
state to be delivered to the end user, who usually at least wants them
tabulated and formatted, and may appreciate them even more if they are
plotted on a graph for immediate impact. The parmest package was
developed to make this easy, and consists of two programs. These are
parmest, which converts the latest estimation results to a data set
with one observation per estimated parameter and data on confidence
intervals, p-values and other estimation results, and parmby,
a ``quasi-byable'' front end to parmest, which is like
statsby, but creates a data set with one observation per parameter
per by-group instead of a data set with one observation per by-group. The
parmest package can be used together with a team of other Stata
programs to produce a wide range of tables and plots of confidence intervals
and p-values. The programs descsave and factext can
be used with parmby to create plots of confidence intervals against
values of a categorical factor included in the fitted model, using dummy
variables produced by xi or tabulate. The user may easily
fit multiple models, produce a parmby output data set for each one,
and concatenate these output data sets using the program dsconcat to
produce a combined data set, which can then be used to produce tables or
plots involving parameters from all the models. For instance, the user might
tabulate or plot unadjusted and adjusted regression parameters side by side,
together with their confidence limits and/or p-values. The
parmest team is particularly useful when dealing with large volumes
of results derived from multiple multi-parameter models, which are
particularly common in the world of epidemiology.
Simulation studies comparing different genetic methodologies using Stata
Department of Epidemiology, Emory University
In a recent, genetic case-control study of myocardial infarction (MI), cases'
children were used as controls. That paper described one method to analyze
such data. We describe two other methods for analyzing such data and compared
the three methods by simulation using Stata. Each subject is classified
according to three genotypes, MM, MN and NN, where M is the mutant allele and
N is the normal allele. The probability that a subject has each genotype
depends on the population allele frequency P, the relative risk of
disease for the MN genotype compared with the NN genotype
R1, and the relative risk for the MM genotype compared with
the NN genotype R2. The first analytic method ignores the
case/child pairings, the second method does not, and the third method
considers P a nuisance parameter and eliminates it by conditioning.
We randomly generated either 200 or 300 case/child pairs for various values of
P, R1, and R2. We generated 1,000
data sets and applied each of the three methods. All analyses were based on
likelihood procedures and were implemented using the maximum likelihood
(ml) procedure. The standard errors of MLEs from each method were
compared. We estimated power by comparing the likelihood of the full model to
the likelihood with the constraints that R1 and
R2 using Stata's lrtest and counting the number of
the 1,000 simulations which lead to rejection of the null hypothesis. For
the simulations done under the null hypothesis, we counted the number of times
the null hypothesis was rejected and compared this number with an expectation
of 50 using an exact binomial test.
The simulations showed that all methods provide unbiased estimates in
populations with a homogenous P and have an appropriate Type I error
rate. The method based upon case/child pairs was generally more powerful than
the other two methods. In populations with sub-populations with different
Ps only the conditional approach is unbiased, although the simulations
showed that method 2 was robust.
This paper illustrates the utility of using Stata for simulation studies
comparing different analytic approaches in case association studies of
genetics. It also illustrates how useful simulation studies can be in
estimating power. Stata is very well suited for simulation studies because of
its speed, the ease of posting the simulation findings, and its
Graphics before and after model fitting
Nicholas J. Cox,
Department of Geography, University of Durham
It is commonplace to compute various flavours of residual and predicted values
after fitting many different kinds of model. This allows production of a great
variety of diagnostic graphics, used to examine the general and specific fit
between data and model and to seek possible means of improving the model.
Several different graphs may be inspected in many modelling exercises, partly
because each kind may be best for particular purposes, and partly because in
many analyses a variety of models - in terms of functional form, choice of
predictors, and so forth - may be entertained, at least briefly. It is
therefore helpful to be able to produce such graphs very rapidly.
Official Stata supplies as built-ins a bundle of commands originally written
for use after regress: avplot, avplots,
cprplot, acprplot, lvr2plot, rvfplot and
rvpplot. These were introduced in Stata 3.0 in 1992 and are
documented at [R] regdiag. More recently, in an update to Stata 7.0 on
6 September 2001, all but the first two have been modified so that they may be
used after anova. Despite their many uses, this suite omits some
very useful kinds of plot, while none of the commands may be used after other
The presentation focuses on a new set of commands, which are biased to
graphics useful for models predicting continuous response variables. The
ideal, approachable asymptotically, is to make minimal assumptions about which
modelling command has been issued previously. The down-side for users is that
if the data and the previous model results do not match the assumptions, it is
possible to get either bizarre results or an error message.
The commands which have been written include
fitted or predicted values from an immediately previous one-, two-, or
three-way anova. By default the data for the response are also
plotted. In particular, anovaplot can show interaction plots.
indexplot plots estimation results (by default whatever
predict produces by default) from an immediately previous
regress or similar command versus a numeric index or identifier
variable, if that is supplied, or observation number, if that is not supplied.
Values are shown, by default, as vertical spikes starting at 0.
ovfplot plots observed vs fitted or predicted values for the response
from an immediately previous regress or similar command, with by
default a line of equality superimposed.
qfrplot plots quantile plots of fitted values, minus their mean, and
residuals from the previous estimation command. Fitted values are whatever
predict produces by default and residuals are whatever predict,
res produces. Comparing the distributions gives an overview of their
variability and some idea of their fine structure. By default plots are
side-by-side. Quantile plots may be observed vs normal (Gaussian).
rdplot graphs residual distributions. The residuals are, by default,
those calculated by predict, residuals or (if the previous estimation
command was glm) by predict, response. The graph by default
is a single or multiple dotplot, as produced by dotplot: histograms
or box plots may be selected by specifying either the histogram or
the box option.
regplot plots fitted or predicted values from an immediately previous
regress or similar command. By default the data for the response are
also plotted. With one syntax, no varname is specified.
regplot shows the response and predicted values on the y axis
and the covariate named first in the regress or similar command on
the x axis. Thus with this syntax the plot shown is sensitive to the
order in which covariates are specified in the estimation command. With
another syntax, a varname is supplied, which may name any numeric
variable. This is used as the variable on the x axis. Thus in practice
regplot is most useful when the fitted values are a smooth function
of the variable shown on the x axis, or a set of such functions given
also one or more dummy variables as covariates. However, other applications
also arise, such as plotting observed and predicted values from a time series
model versus time.
rvfplot2 graphs a residual-versus-fitted plot, a graph of the
residuals versus the fitted values. The residuals are, by default, those
calculated by predict, residuals or (if the previous estimation
command was glm) by predict, response. The fitted values are
those produced by predict by default after each estimation command.
rvfplot2 is offered as a generalisation of rvfplot in
diag.html graphs covered in the meeting
Standardising anthropometric measures in children and adolescents with new
extensions to egen
Suzanna Vidmar, Kylie Hesketh, and John Carlin,
Murdoch Childrens Research Institute and University of Melbourne
Comparing crude anthropometric data from children of different ages is
complicated by the fact that children are still growing (we do not expect the
height of a 5-year-old to be the same as the height of a 10-year-old!).
Clinicians and researchers are often interested in the question ``is this
child taller, shorter, or about average compared to other children their
age?''. Two sets of population-based reference data are now widely used to
address this question: the 1990 British Growth Reference and the Centers for
Disease Control and Prevention (CDC) Growth Reference in the U.S. Both
references tabulate values obtained by the LMS method (Cole, Eur. J. Clin.
Nutr., 1990; Cole and Green, Statistics in Medicine, 1992) that
can be used to transform crude data to standard deviation (z) scores,
which are standardised to the reference population. The LMS transformation
reduces right skew and adjusts for physiological changes in anthropometric
measures that occur with age. Stata provides a convenient environment in
which to apply the age-specific (or height-specific) LMS values and generate
z-scores for each child in a dataset, using the egen command.
New functions of egen have been developed to allow transformation of
crude child anthropometric data to z-scores using the LMS method and
the reference data available from the British Growth Reference and the CDC
Growth Reference. Recently the Childhood Obesity Working Group of the
International Obesity Taskforce recommended use of BMI cut-off points to
categorise children as normal weight, overweight or obese based on age, gender
and BMI. An additional function of egen allows for children to be
categorised according to these international cut-off points. This talk will
provide brief background on growth standards and the LMS method, and describe
in detail how the new egen commands were created, with examples of
Using Stata at the Office for National Statistics: an overview
Matthew Barnes, Office for National Statistics
We describe the use of Stata in the UK Office for National Statistics (ONS).
We give examples of projects Stata has been used for that have fed into UK
official statistics. In addition we show some of the longer term research
under way at ONS that is using Stata to analyse linked business data sets.
Results from such work are fed into the evidence base for government policy.
Primarily we discuss work in the Economic Analysis and Satellite Accounts
Division of the Economic Statistics Directorate within ONS.
A note on ownership and productivity in UK businesses
Ralf Martin and Chiara Criscuolo, Office for National Statistics, CeRiBA, and London School of Economics
A series of studies in a number of countries have found that foreign-owned
firms are more productive than domestic firms. However, almost all this work
compares foreign firms — which are, by definition, multinationals
— with all domestic firms. This paper analyses for the first time in the
UK the relative productivity performance of foreign-owned manufacturing firms
and UK manufacturing firms split into UK Multinationals and UK pure domestic
firms. This was not possible before because none of the datasets used for
productivity analysis distinguished between domestic multinational and
non-multinational firms. We are able to make such a distinction. Our results
suggest that the foreign productivity advantage is by and large a
multinational effect. US multinational firms, however, seem to maintain a
productivity advantage with respect to both other foreign-owned firms and
domestic multinational firms. Interestingly, this UK result mirrors and
extends results for the US by Doms and Jensen. We use Stata to deal with
changes in raw data files in a consistent manner and to link together data
about the same units collected at different levels of aggregation in a
sensible way in order to do econometric analysis.
The Stata Technical Bulletin and the Stata Journal: editors' report
Joe Newton, Department of Statistics, Texas A & M University
Nicholas J. Cox, Department of Geography, University of Durham
The Stata Technical Bulletin (STB ) started publication in
March 1991 and ceased in May 2001, after 61 bimonthly issues. It has been
succeeded by the Stata Journal (SJ), of which two quarterly
issues have so far appeared, 1(1) for the last quarter of 2001 and 2(1) for
the first of 2002. Although published by StataCorp, the SJ is
controlled by an international board including the Editor and Executive Editor
and 18 Associate Editors.
We believe that the STB was a great success, but by 2001 there was a
need for fairly radical change in its content and format. Its role in making
available new Stata programs and documentation, whether written by users or by
StataCorp, has largely been superseded by easy and rapid use of the Internet.
The SJ continues to be a vehicle for distributing valuable new
programs, but it will carry more, and more substantial, expository articles on
statistics, data management and graphics using Stata. The SJ is also
now a reviewed journal, which we believe is important both for its
contributors and for its readers. Finally, the SJ has been redesigned
and is now printed on better paper and in more durable covers.
We will talk briefly about the transition from the STB to the
SJ. Comments and questions about the SJ will be most welcome.
Programmable GLM: a collection of case studies
Roberto G. Gutierrez, StataCorp
With the release of Stata 7, the capabilities of glm were greatly
enhanced. Among the improvements was the ability for users to program their
own custom link and variance functions. Whereas previously glm was
used primarily as a platform on which to compare the results of standard
regression models (such as the logistic, probit, and Poisson), it may now be
utilised to perform generalized maximum pseudo-likelihood estimation in any
framework. Thus far, this has been an ability that for the most part has not
The method by which user-defined links and variance functions may be
incorporated is quite straightforward, as demonstrated in the companion text
to glm by Hardin and Hilbe (2001). In this talk, I present a few
examples of case studies from the literature where the science dictated the
fitting of a generalized linear model with special (non-standard) link and/or
variance function. I demonstrate how these models (which were typically fit
using SAS's GENMOD procedure) may be fit using Stata.
Hardin, J. and J. Hilbe. 2001. Generalized linear models and
extensions. Stata Press, College Station, TX.
BCa bootstrap confidence intervals
James Carpenter, Medical Statistics Unit, London School of Hygiene and Tropical Medicine
Patrick Royston, MRC Clinical Trials Unit, London
The existing Stata command bstrap takes a user-defined program and
calculates normal approximation, percentile and bias- corrected percentile
bootstrap confidence intervals. However, these intervals are not the most
accurate available. In this article, we describe a new command, bci,
which shares a similar syntax to bstrap, but which additionally
calculates the more accurate BCa bootstrap confidence interval, as well as the
so-called 'basic' bootstrap confidence interval.
The use of fractional polynomials to model interactions between treatment and
continuous covariates in clinical trials
Patrick Royston, MRC Clinical Trials Unit, London
W. Sauerbrei, IMBI, University Hospital of Freiburg
We consider modelling and testing for `interaction' between a continuous
covariate X and a categorical covariate C in a regression model.
Here C represents two treatment arms in a parallel-group clinical trial
and X is a prognostic factor which may influence response to treatment.
Usually X is categorised into groups according to cut-point(s) and the
interaction is analysed in a model with main effects and multiplicative terms.
A trend test of the effect of C over the ordered categories from
X may be performed and is likely to have better power. The cut-point
approach raises several well-known and difficult issues for the analyst,
including dependency of the results on the choice of cut-point, loss of power
due to categorisation, and the danger of `over-fitting' if several cut-points
are considered in a search for `optimality' (Altman et al., 1994).
We will describe an approach to avoid such problems based on fractional
polynomial (FP) modelling of X, without categorisation, overall and at
each level of C (Royston and Sauerbrei, 2002). The first step is to
construct a multivariable adjustment model which may contain binary covariates
and FP transformations of continuous covariates other than X. The
second step involves FP modelling of X within the adjustment model.
Stata software to fit the models will be demonstrated using example datasets,
mainly from cancer studies. The examples show the power of the approach in
detecting and displaying interactions in real data from randomised controlled
trials with a survival-time outcome.
Altman, D. G., B. Lausen, W. Sauerbrei, M. Schumacher. 1994. The dangers of
using `optimal' cutpoints in the evaluation of prognostic factors. Journal
of the National Cancer Institute 86: 829–835.
Royston, P. and W. Sauerbrei. 2002. A new approach to modelling interactions
between treatment and continuous covariates in clinical trials by using
fractional polynomials. Statistics in Medicine, to be submitted.
Applying the Cox proportional hazard regression model to competing risks
Mohamed Ali, Department of Epidemiology and Public Health, London School of Hygiene and Tropical Medicine
Abdel Babiker, MRC Clinical Trials Unit, London
In the presence of dependent competing risks in survival analysis, the Cox
proportional hazard model can be utilised to examine the covariate effects on
the cause-specific hazard function for each type of failure. The use of the
Cox model was proposed by Lunn and McNeil (1995). Their method requires data
augmentation. With k failure types, the data would be duplicated
k times, one record for each failure type. Either a stratified or an
unstratified analysis could be used, depending on whether the assumption of
proportional hazard holds. If the proportional hazard assumption does not
hold across the causes, the stratified analysis should be used, which is
equivalent of fitting separate model for each failure type. The unstratified
analysis assumes a constant hazard ratio between failure types and this could
be fitted by including an indicator variable as a covariate.
We will show how both approaches could be fitted on augmented data using
stcox. In addition to the parameter estimates and their standard
errors, the program has an option to produce cumulative incidences with
pointwise confidence interval.
Lunn, M. and D. McNeil. 1995. Applying Cox regression to competing risks.
Biometrics 51: 524–532.
Adjusting for measurement error in explanatory variables
Ian White, Chris Frost, and Shoji Tokunaga, MRC Biostatistics Unit, Cambridge
In epidemiology, measurement error or within-individual variation in exposures
and confounders leads to attenuated effect estimates and inadequate control of
confounding. Adjustment for measurement error is possible if its magnitude may
be estimated from supplementary information, typically replicate measurements
or partial observation of an error-free value. Methods currently available in
Stata, including eivreg and ivreg, are of limited usefulness in
epidemiology, and regression calibration is more commonly used. I will
describe a new program, regcal, which implements regression
calibration, and incorporates a modification for replicated discrete variables
when the assumptions underlying regression calibration do not hold. This is
illustrated using observational data on the association between cholesterol
levels and green tea consumption.
Further steps towards a graphical interface for making tables and estimating
Michael Hills, London School of Hygiene and Tropical Medicine
At the last users' group meeting I presented a GI command for estimating
effects using linear models. The command relied on a classification of
variables according to their function in the linear model, and the user had to
select, from a list, the particular regression command to be used (e.g.
logistic). In this update I have concentrated on four types of response
variable (binary, metric, failure, count) which in turn imply which regression
command to use, so there is no longer a need to know which command is
appropriate. The same classification is used in a companion GI command to
present tables of means, medians, proportions, odds, or rates. Each of these
GI commands is capable of displaying an equivalent command which does not use
To obtain the ado-files covered in this meeting, in Stata type
. net from http://fmwww.bc.edu/repec/bocode/e/
. net describe effects
. net install effects
. net get effects
How to face lists with fortitude (Tutorial)
Nicholas J. Cox, Department of Geography, University of Durham
Among various structures in Stata for cycling through lists (whether lists of
variable names, numbers, or arbitrary strings) are foreach and
forvalues, introduced in Stata 7 in 2001, and for, introduced in
Stata 3.1 in 1992, and revised in 5.0 (1997) and 6.0 (1999). Typically, each
member of the list supplied is substituted in turn in one or more commands
within the structure being used.
This is a tutorial specifically designed for Stata users who do little or no
Stata programming. Despite being labelled as programming commands, these
structures have many uses either interactively or within do files and
help impart both speed and system to repetitive tasks.
One prerequisite for understanding foreach and forvalues is the
idea of a local macro, of which the most difficult part is understanding the
strange name. With that idea assimilated, it is relatively easy to see how
foreach and forvalues can be used in a large variety of
problems. Despite being newer than for in Stata, these structures are
recommended over for to new users, or to more experienced users who
have made little or no use of any of these to date. For completeness, there
is also comparison with for, and some comments on their relative
Report to users
William W. Gould, StataCorp
Bill Gould, who is President of StataCorp, and more importantly for this
meeting, the head of development, will ruminate about work at Stata over the
last year and about ongoing activity.
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Timberlake Consultants, the official distributor
of Stata in the United Kingdom.