Home  /  Resources & support  /  Users Group meetings  /  2002 UK Stata Users Group meeting

Last updated: 7 June 2002

2002 UK Stata Users Group meeting

20–21 May 2002


Royal Statistical Society
12 Errol Street
London EC1Y 8LX


Creating plots and tables of estimation results using parmest and friends

Roger Newson, Department of Public Health Medicine, King's College, London


Statisticians make their living mostly by producing confidence intervals and p-values. However, those supplied in the Stata log are not in any fit state to be delivered to the end user, who usually at least wants them tabulated and formatted, and may appreciate them even more if they are plotted on a graph for immediate impact. The parmest package was developed to make this easy, and consists of two programs. These are parmest, which converts the latest estimation results to a data set with one observation per estimated parameter and data on confidence intervals, p-values and other estimation results, and parmby, a ``quasi-byable'' front end to parmest, which is like statsby, but creates a data set with one observation per parameter per by-group instead of a data set with one observation per by-group. The parmest package can be used together with a team of other Stata programs to produce a wide range of tables and plots of confidence intervals and p-values. The programs descsave and factext can be used with parmby to create plots of confidence intervals against values of a categorical factor included in the fitted model, using dummy variables produced by xi or tabulate. The user may easily fit multiple models, produce a parmby output data set for each one, and concatenate these output data sets using the program dsconcat to produce a combined data set, which can then be used to produce tables or plots involving parameters from all the models. For instance, the user might tabulate or plot unadjusted and adjusted regression parameters side by side, together with their confidence limits and/or p-values. The parmest team is particularly useful when dealing with large volumes of results derived from multiple multi-parameter models, which are particularly common in the world of epidemiology.



Simulation studies comparing different genetic methodologies using Stata

Harland Austin, Department of Epidemiology, Emory University


In a recent, genetic case-control study of myocardial infarction (MI), cases' children were used as controls. That paper described one method to analyze such data. We describe two other methods for analyzing such data and compared the three methods by simulation using Stata. Each subject is classified according to three genotypes, MM, MN and NN, where M is the mutant allele and N is the normal allele. The probability that a subject has each genotype depends on the population allele frequency P, the relative risk of disease for the MN genotype compared with the NN genotype R1, and the relative risk for the MM genotype compared with the NN genotype R2. The first analytic method ignores the case/child pairings, the second method does not, and the third method considers P a nuisance parameter and eliminates it by conditioning.

We randomly generated either 200 or 300 case/child pairs for various values of P, R1, and R2. We generated 1,000 data sets and applied each of the three methods. All analyses were based on likelihood procedures and were implemented using the maximum likelihood (ml) procedure. The standard errors of MLEs from each method were compared. We estimated power by comparing the likelihood of the full model to the likelihood with the constraints that R1 and R2 using Stata's lrtest and counting the number of the 1,000 simulations which lead to rejection of the null hypothesis. For the simulations done under the null hypothesis, we counted the number of times the null hypothesis was rejected and compared this number with an expectation of 50 using an exact binomial test.

The simulations showed that all methods provide unbiased estimates in populations with a homogenous P and have an appropriate Type I error rate. The method based upon case/child pairs was generally more powerful than the other two methods. In populations with sub-populations with different Ps only the conditional approach is unbiased, although the simulations showed that method 2 was robust.

This paper illustrates the utility of using Stata for simulation studies comparing different analytic approaches in case association studies of genetics. It also illustrates how useful simulation studies can be in estimating power. Stata is very well suited for simulation studies because of its speed, the ease of posting the simulation findings, and its maximum-likelihood procedure.

Graphics before and after model fitting

Nicholas J. Cox, Department of Geography, University of Durham


It is commonplace to compute various flavours of residual and predicted values after fitting many different kinds of model. This allows production of a great variety of diagnostic graphics, used to examine the general and specific fit between data and model and to seek possible means of improving the model. Several different graphs may be inspected in many modelling exercises, partly because each kind may be best for particular purposes, and partly because in many analyses a variety of models - in terms of functional form, choice of predictors, and so forth - may be entertained, at least briefly. It is therefore helpful to be able to produce such graphs very rapidly.

Official Stata supplies as built-ins a bundle of commands originally written for use after regress: avplot, avplots, cprplot, acprplot, lvr2plot, rvfplot and rvpplot. These were introduced in Stata 3.0 in 1992 and are documented at [R] regdiag. More recently, in an update to Stata 7.0 on 6 September 2001, all but the first two have been modified so that they may be used after anova. Despite their many uses, this suite omits some very useful kinds of plot, while none of the commands may be used after other modelling commands.

The presentation focuses on a new set of commands, which are biased to graphics useful for models predicting continuous response variables. The ideal, approachable asymptotically, is to make minimal assumptions about which modelling command has been issued previously. The down-side for users is that if the data and the previous model results do not match the assumptions, it is possible to get either bizarre results or an error message.

The commands which have been written include

anovaplot shows fitted or predicted values from an immediately previous one-, two-, or three-way anova. By default the data for the response are also plotted. In particular, anovaplot can show interaction plots.

indexplot plots estimation results (by default whatever predict produces by default) from an immediately previous regress or similar command versus a numeric index or identifier variable, if that is supplied, or observation number, if that is not supplied. Values are shown, by default, as vertical spikes starting at 0.

ovfplot plots observed vs fitted or predicted values for the response from an immediately previous regress or similar command, with by default a line of equality superimposed.

qfrplot plots quantile plots of fitted values, minus their mean, and residuals from the previous estimation command. Fitted values are whatever predict produces by default and residuals are whatever predict, res produces. Comparing the distributions gives an overview of their variability and some idea of their fine structure. By default plots are side-by-side. Quantile plots may be observed vs normal (Gaussian).

rdplot graphs residual distributions. The residuals are, by default, those calculated by predict, residuals or (if the previous estimation command was glm) by predict, response. The graph by default is a single or multiple dotplot, as produced by dotplot: histograms or box plots may be selected by specifying either the histogram or the box option.

regplot plots fitted or predicted values from an immediately previous regress or similar command. By default the data for the response are also plotted. With one syntax, no varname is specified. regplot shows the response and predicted values on the y axis and the covariate named first in the regress or similar command on the x axis. Thus with this syntax the plot shown is sensitive to the order in which covariates are specified in the estimation command. With another syntax, a varname is supplied, which may name any numeric variable. This is used as the variable on the x axis. Thus in practice regplot is most useful when the fitted values are a smooth function of the variable shown on the x axis, or a set of such functions given also one or more dummy variables as covariates. However, other applications also arise, such as plotting observed and predicted values from a time series model versus time.

rvfplot2 graphs a residual-versus-fitted plot, a graph of the residuals versus the fitted values. The residuals are, by default, those calculated by predict, residuals or (if the previous estimation command was glm) by predict, response. The fitted values are those produced by predict by default after each estimation command. rvfplot2 is offered as a generalisation of rvfplot in official Stata.


diag.html – graphs covered in the meeting

Standardising anthropometric measures in children and adolescents with new extensions to egen

Suzanna Vidmar, Kylie Hesketh, and John Carlin,
Murdoch Childrens Research Institute and University of Melbourne


Comparing crude anthropometric data from children of different ages is complicated by the fact that children are still growing (we do not expect the height of a 5-year-old to be the same as the height of a 10-year-old!). Clinicians and researchers are often interested in the question ``is this child taller, shorter, or about average compared to other children their age?''. Two sets of population-based reference data are now widely used to address this question: the 1990 British Growth Reference and the Centers for Disease Control and Prevention (CDC) Growth Reference in the U.S. Both references tabulate values obtained by the LMS method (Cole, Eur. J. Clin. Nutr., 1990; Cole and Green, Statistics in Medicine, 1992) that can be used to transform crude data to standard deviation (z) scores, which are standardised to the reference population. The LMS transformation reduces right skew and adjusts for physiological changes in anthropometric measures that occur with age. Stata provides a convenient environment in which to apply the age-specific (or height-specific) LMS values and generate z-scores for each child in a dataset, using the egen command. New functions of egen have been developed to allow transformation of crude child anthropometric data to z-scores using the LMS method and the reference data available from the British Growth Reference and the CDC Growth Reference. Recently the Childhood Obesity Working Group of the International Obesity Taskforce recommended use of BMI cut-off points to categorise children as normal weight, overweight or obese based on age, gender and BMI. An additional function of egen allows for children to be categorised according to these international cut-off points. This talk will provide brief background on growth standards and the LMS method, and describe in detail how the new egen commands were created, with examples of their application.

Using Stata at the Office for National Statistics: an overview

Matthew Barnes, Office for National Statistics


We describe the use of Stata in the UK Office for National Statistics (ONS). We give examples of projects Stata has been used for that have fed into UK official statistics. In addition we show some of the longer term research under way at ONS that is using Stata to analyse linked business data sets. Results from such work are fed into the evidence base for government policy. Primarily we discuss work in the Economic Analysis and Satellite Accounts Division of the Economic Statistics Directorate within ONS.



A note on ownership and productivity in UK businesses

Ralf Martin and Chiara Criscuolo, Office for National Statistics, CeRiBA, and London School of Economics


A series of studies in a number of countries have found that foreign-owned firms are more productive than domestic firms. However, almost all this work compares foreign firms — which are, by definition, multinationals — with all domestic firms. This paper analyses for the first time in the UK the relative productivity performance of foreign-owned manufacturing firms and UK manufacturing firms split into UK Multinationals and UK pure domestic firms. This was not possible before because none of the datasets used for productivity analysis distinguished between domestic multinational and non-multinational firms. We are able to make such a distinction. Our results suggest that the foreign productivity advantage is by and large a multinational effect. US multinational firms, however, seem to maintain a productivity advantage with respect to both other foreign-owned firms and domestic multinational firms. Interestingly, this UK result mirrors and extends results for the US by Doms and Jensen. We use Stata to deal with changes in raw data files in a consistent manner and to link together data about the same units collected at different levels of aggregation in a sensible way in order to do econometric analysis.

The Stata Technical Bulletin and the Stata Journal: editors' report

Joe Newton, Department of Statistics, Texas A & M University
Nicholas J. Cox, Department of Geography, University of Durham


The Stata Technical Bulletin (STB ) started publication in March 1991 and ceased in May 2001, after 61 bimonthly issues. It has been succeeded by the Stata Journal (SJ), of which two quarterly issues have so far appeared, 1(1) for the last quarter of 2001 and 2(1) for the first of 2002. Although published by StataCorp, the SJ is controlled by an international board including the Editor and Executive Editor and 18 Associate Editors.

We believe that the STB was a great success, but by 2001 there was a need for fairly radical change in its content and format. Its role in making available new Stata programs and documentation, whether written by users or by StataCorp, has largely been superseded by easy and rapid use of the Internet. The SJ continues to be a vehicle for distributing valuable new programs, but it will carry more, and more substantial, expository articles on statistics, data management and graphics using Stata. The SJ is also now a reviewed journal, which we believe is important both for its contributors and for its readers. Finally, the SJ has been redesigned and is now printed on better paper and in more durable covers.

We will talk briefly about the transition from the STB to the SJ. Comments and questions about the SJ will be most welcome.

Programmable GLM: a collection of case studies

Roberto G. Gutierrez, StataCorp


With the release of Stata 7, the capabilities of glm were greatly enhanced. Among the improvements was the ability for users to program their own custom link and variance functions. Whereas previously glm was used primarily as a platform on which to compare the results of standard regression models (such as the logistic, probit, and Poisson), it may now be utilised to perform generalized maximum pseudo-likelihood estimation in any framework. Thus far, this has been an ability that for the most part has not been exploited.

The method by which user-defined links and variance functions may be incorporated is quite straightforward, as demonstrated in the companion text to glm by Hardin and Hilbe (2001). In this talk, I present a few examples of case studies from the literature where the science dictated the fitting of a generalized linear model with special (non-standard) link and/or variance function. I demonstrate how these models (which were typically fit using SAS's GENMOD procedure) may be fit using Stata.


Hardin, J. and J. Hilbe. 2001. Generalized linear models and extensions. Stata Press, College Station, TX.

BCa bootstrap confidence intervals

James Carpenter, Medical Statistics Unit, London School of Hygiene and Tropical Medicine
Patrick Royston, MRC Clinical Trials Unit, London


The existing Stata command bstrap takes a user-defined program and calculates normal approximation, percentile and bias- corrected percentile bootstrap confidence intervals. However, these intervals are not the most accurate available. In this article, we describe a new command, bci, which shares a similar syntax to bstrap, but which additionally calculates the more accurate BCa bootstrap confidence interval, as well as the so-called 'basic' bootstrap confidence interval.

The use of fractional polynomials to model interactions between treatment and continuous covariates in clinical trials

Patrick Royston, MRC Clinical Trials Unit, London
W. Sauerbrei, IMBI, University Hospital of Freiburg


We consider modelling and testing for `interaction' between a continuous covariate X and a categorical covariate C in a regression model. Here C represents two treatment arms in a parallel-group clinical trial and X is a prognostic factor which may influence response to treatment. Usually X is categorised into groups according to cut-point(s) and the interaction is analysed in a model with main effects and multiplicative terms. A trend test of the effect of C over the ordered categories from X may be performed and is likely to have better power. The cut-point approach raises several well-known and difficult issues for the analyst, including dependency of the results on the choice of cut-point, loss of power due to categorisation, and the danger of `over-fitting' if several cut-points are considered in a search for `optimality' (Altman et al., 1994).

We will describe an approach to avoid such problems based on fractional polynomial (FP) modelling of X, without categorisation, overall and at each level of C (Royston and Sauerbrei, 2002). The first step is to construct a multivariable adjustment model which may contain binary covariates and FP transformations of continuous covariates other than X. The second step involves FP modelling of X within the adjustment model.

Stata software to fit the models will be demonstrated using example datasets, mainly from cancer studies. The examples show the power of the approach in detecting and displaying interactions in real data from randomised controlled trials with a survival-time outcome.


Altman, D. G., B. Lausen, W. Sauerbrei, M. Schumacher. 1994. The dangers of using `optimal' cutpoints in the evaluation of prognostic factors. Journal of the National Cancer Institute 86: 829–835.

Royston, P. and W. Sauerbrei. 2002. A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Statistics in Medicine, to be submitted.



Applying the Cox proportional hazard regression model to competing risks

Mohamed Ali, Department of Epidemiology and Public Health, London School of Hygiene and Tropical Medicine
Abdel Babiker, MRC Clinical Trials Unit, London


In the presence of dependent competing risks in survival analysis, the Cox proportional hazard model can be utilised to examine the covariate effects on the cause-specific hazard function for each type of failure. The use of the Cox model was proposed by Lunn and McNeil (1995). Their method requires data augmentation. With k failure types, the data would be duplicated k times, one record for each failure type. Either a stratified or an unstratified analysis could be used, depending on whether the assumption of proportional hazard holds. If the proportional hazard assumption does not hold across the causes, the stratified analysis should be used, which is equivalent of fitting separate model for each failure type. The unstratified analysis assumes a constant hazard ratio between failure types and this could be fitted by including an indicator variable as a covariate.

We will show how both approaches could be fitted on augmented data using stcox. In addition to the parameter estimates and their standard errors, the program has an option to produce cumulative incidences with pointwise confidence interval.


Lunn, M. and D. McNeil. 1995. Applying Cox regression to competing risks. Biometrics 51: 524–532.

Adjusting for measurement error in explanatory variables

Ian White, Chris Frost, and Shoji Tokunaga, MRC Biostatistics Unit, Cambridge


In epidemiology, measurement error or within-individual variation in exposures and confounders leads to attenuated effect estimates and inadequate control of confounding. Adjustment for measurement error is possible if its magnitude may be estimated from supplementary information, typically replicate measurements or partial observation of an error-free value. Methods currently available in Stata, including eivreg and ivreg, are of limited usefulness in epidemiology, and regression calibration is more commonly used. I will describe a new program, regcal, which implements regression calibration, and incorporates a modification for replicated discrete variables when the assumptions underlying regression calibration do not hold. This is illustrated using observational data on the association between cholesterol levels and green tea consumption.

Further steps towards a graphical interface for making tables and estimating effects

Michael Hills, London School of Hygiene and Tropical Medicine


At the last users' group meeting I presented a GI command for estimating effects using linear models. The command relied on a classification of variables according to their function in the linear model, and the user had to select, from a list, the particular regression command to be used (e.g. logistic). In this update I have concentrated on four types of response variable (binary, metric, failure, count) which in turn imply which regression command to use, so there is no longer a need to know which command is appropriate. The same classification is used in a companion GI command to present tables of means, medians, proportions, odds, or rates. Each of these GI commands is capable of displaying an equivalent command which does not use menus.


To obtain the ado-files covered in this meeting, in Stata type
                . net from http://fmwww.bc.edu/repec/bocode/e/
                . net describe effects
                . net install effects
                . net get effects

How to face lists with fortitude (Tutorial)

Nicholas J. Cox, Department of Geography, University of Durham


Among various structures in Stata for cycling through lists (whether lists of variable names, numbers, or arbitrary strings) are foreach and forvalues, introduced in Stata 7 in 2001, and for, introduced in Stata 3.1 in 1992, and revised in 5.0 (1997) and 6.0 (1999). Typically, each member of the list supplied is substituted in turn in one or more commands within the structure being used.

This is a tutorial specifically designed for Stata users who do little or no Stata programming. Despite being labelled as programming commands, these structures have many uses either interactively or within do files and help impart both speed and system to repetitive tasks.

One prerequisite for understanding foreach and forvalues is the idea of a local macro, of which the most difficult part is understanding the strange name. With that idea assimilated, it is relatively easy to see how foreach and forvalues can be used in a large variety of problems. Despite being newer than for in Stata, these structures are recommended over for to new users, or to more experienced users who have made little or no use of any of these to date. For completeness, there is also comparison with for, and some comments on their relative merits.



Report to users

William W. Gould, StataCorp

Bill Gould, who is President of StataCorp, and more importantly for this meeting, the head of development, will ruminate about work at Stata over the last year and about ongoing activity.

Scientific organizers

Nicholas J. Cox, Durham University

Patrick Royston, MRC Clinical Trials Unit

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom.