Home  /  Resources & support  /  Users Group meetings  /  2001 UK Stata Users Group meeting

Last updated: 7 June 2001

2001 UK Stata Users Group meeting

14–15 May 2001


Royal Statistical Society
12 Errol Street
London EC1Y 8LX


Plotting graded data: a Tukey-ish approach

Nicholas J. Cox

Graded data are those possessing an inherent order but falling short of a metric scale. Examples are opinions on a five-point scale, such as strongly disagree, disagree, neutral, agree, strongly agree. Graded data are, like ranked data, one kind of ordinal data. They are common in many fields, especially as the record of some considered judgment, but little attention seems to have been given to methods for their easy and effective graphical display.

This presentation draws on suggestions made in various places by J.W. Tukey, using the principle that cumulative probabilities are a logical and practical way to represent graded data, which is after all the basis for many models for graded responses. Cumulative probability curves for different subsets of the data are found useful in initial description and exploration of the data. A Stata program ordplot offers various kinds of flexibility in showing such curves:

  1. cumulating to the bottom, the middle and the top of each class;
  2. complementary distribution functions (descending curves) may be shown as well as cumulative distribution functions (ascending curves);
  3. logit, folded root (more generally, folded power), loglog, cloglog, normal (Gaussian), percent and raw scales are all allowed for cumulative probabilities;
  4. for such scales, labels, lines and ticks may be in terms of the transformed units or in terms of probabilities or percents;
  5. different scores may be assigned to grades on the fly. In practice, most datasets seem to reveal their basic structure either on raw or on logit scales. In some cases, the discrete response models fitted by previous authors appear, as a consequence, to be unnecessarily elaborate or indirect.

Fitting log-linear models with ignorable and non-ignorable missing data

Andrew Pickles

We describe Stata macros that implement the composite link approach to missing data in log-linear models first described by David Rindskopf (Psychometrika, 1992, V57, 29–42). When a missing value occurs among the variables that form a contingency table, the resulting observation contributes to the frequencies of a table of lower dimension than the full table being collapsed along the dimension of the missing variable. Our primary interest lies in constructing a model for the full dimensional table. The composite link approach maps the observed cells of this collapsed table to the corresponding unobserved cells of the full dimensional table. This mapping allows expected cells frequencies for observed cells to be obtained from the expected cell frequencies for the unobserved cells, the latter being derived from a near standard log-linear model. A preliminary macro reorganizes the data from a file of individual records with possibly missing variable values to a file where each record represents either an observed cell frequency or an unobserved cell that contributes to an observed cell. The records also contain the necessary design variables and interaction terms to allow the second macro, an adaptation of Stata's original glm procedure, to fit log-linear models that assume the missing values are MCAR, MAR or conform to some non-ignorable model. We illustrate the use of the macros. The primary contributors to this work were Colin Taylor and Alan Taylor (Institute of Psychiatry, London) and Daphne Kounali (now MRC Environmental Epidemiology).

Parametric survival functions

Patrick Royston

Cox proportional-hazard regression has been essentially the automatic choice of analysis tool for modeling survival data in medical studies. However, the Cox model has several intrinsic features that may cause problems for the analyst or an interpreter of the data. These include the necessity of assuming proportional hazards and the very noisy estimate of the baseline hazard function that is typically obtained. I shall demonstrate flexible parametric models based on a proportional hazards or a proportional odds metric. I will show the utility of such models in helping one to visualize the hazard function and hazard ratio as functions of time, and in modeling data with non-proportional effects of some or all of the covariates.


Royston, P. 2001. Flexible parametric alternatives to the Cox model ... and more. Stata Technical Bulletin: 61.

Royston, P. and M. K. B. Parmar. 2001. Flexible parametric models for censored survival data with application to prognostic modeling and estimation of treatment effects. Submitted for publication.

Adjusting for cross-over in a trial with survival end-points

Ian White, Abdel Babiker, and Sarah Walker

We consider a two-group clinical trial with a survival outcome, in which some subjects may 'cross over' to receive the treatment of the other arm. Our command strbee adjusts for treatment cross-over in one or both arms. This is done by a randomization-respecting method which preserves the intention-to-treat P-value.

G-estimation of the effect of exposures in longitudinal studies

Jonathan Sterne and Kate Tilling

Stata's st suite of commands for the analysis of survival time data allow flexible modeling of the effect of exposures which vary over time. A potential problem in such analyses is that other risk factors may be both confounders (i.e., associated with both exposure and disease outcome) and also intermediate variables (on the causal pathway from exposure to disease). This phenomenon is known as "time-varying confounding". Standard statistical models for the analysis of cohort studies do not take such complex relationships into account and may produce biased estimates of the effect of risk factor changes. G-estimation of the effect of a time-varying exposure on outcome, allowing for confounders which are also on the causal pathway, has been proposed for the analysis of such inter-related data. We will present stgest, a Stata program which performs g-estimation, allowing the results to be compared to those from the more usual survival analysis. Using simulated data, we show that the usual analysis can under-estimate the effect of an exposure on disease where there is time-varying confounding, and that g-estimation produces a more accurate estimate. Applications of the method will be discussed.

A graphical interface for Stata estimation commands

Michael Hills and David Clayton

At the UK Stata User's Group meeting in 2000, we presented a series of linked commands which made it possible to declare exposure, stratifying and confounding variables, and to combine this information with Stata estimation commands such as regress, logistic, poisson, stcox, xtpois, etc., to produce maximum likelihood estimates of stratum-specific exposure effects, possibly controlled for other confounders.

Essentially, the idea was that the estimation techniques should be ML, but the output should be closer to Mantel–Haenszel than to the traditional table of main effects and interactions.

In this presentation, we demonstrate the use of a Graphical Interface as an alternative way of declaring the information which will guide the analysis. The GI was prepared using Stata's windowing commands.

We are not advocating the use of GI's in place of the command line as a general strategy, only where the information to be passed to a command is complex.

Additional resources:

A Short Introduction to Stata for Biostatistics: chap6-8.pdf

xtgraph: summary graphs of xt data

Paul Seed

xtgraph produces summary graphs of xt data, by time and by group. It is very flexible, allowing means based on any power or 3-parameter log transformation, and error bars for SE, CI, SD and Reference Range, as well as medians with IQR. Normally, points are estimated separately for each level of t and group. However, a model option will take values from the last model fitted. This allows for linear and non-linear effects and displays interactions. The main illustration deals with data about Vitamin C and E supplementation.

Triangular plots

Nicholas J. Cox

The Stata program triplot produces a triangular plot of three variables with constant sum. Most commonly, three fractions or proportions add to 1, or three percents add to 100. The constant sum constraint means that there are just two independent pieces of information. Hence, it is possible to plot observations in two dimensions within a triangle, which is a 2-simplex. Triangular plots appear under various names in the literature, including trilinear, triaxial, three-element maps, ternary, reference triangles, percentage triangles, mixture, barycentric. Common geological applications are to sedimentary facies or particle form; hence, more specific terms such as facies and form triangles. triplot has several options for tuning displays. A variety of data examples will be shown to indicate some of the flexibility of the program.

Extensions to gllamm

Sophia Rabe-Hesketh, Andrew Pickles, and Anders Skrondal

gllamm is a program to fit generalised linear latent and mixed models. Since gllamm6 appeared in the STB (sg129), a large number of new features have been added. Two important extensions will be discussed:

  1. More response processes can now be modelled including ordered and unordered categorical responses and rankings. Multilevel models for nominal data and rankings will be described and fitted in gllamm.

  2. Multilevel structural equation models can be fitted by specifying regressions of latent variables on other latent variables and on explanatory variables. Examples will be described and fitted in gllamm.

Other new features in gllamm include parameter constraints, and a 'post-estimation' program, gllapred, for estimating posterior means and probabilities.

Sample size calculation in complex studies with failure time outcome

Abdel Babiker and Patrick Royston

Stata includes just one program, sampsi, for calculating sample size. It deals only with comparisons between two groups in terms of binary or normally distributed outcomes. Many randomized controlled trials, however, are designed around a survival-time outcome measure, may compare more than two groups, and are subject to loss to follow-up withdrawal from allocated treatment and staggered entry. We provide a very flexible tool for determining sample size in such studies. Because inevitably there are many potential 'options' (in the Stata sense), the underlying conventional ado-file (calcssi) for the routine may be dauntingly complex. For this reason we have provided a menu-driven front end initiated by running the ado-file ssmenu.

The following study design features have been implemented:

  • Up to 6 treatment groups.
  • Arbitrary baseline time-to-event distribution.
  • Time-varying hazard ratios (i.e. non-proportional hazards).
  • Arbitrary allocation ratios across groups.
  • Loss to follow-up.
  • Staggered patient entry.
  • Cross-over from allocated treatment to alternative treatment.
Analysis by unweighted, Tarone–Ware, or Harrington–Fleming versions of the logrank test. In addition, a chi-square test comparing the proportion of failures at the end of the study is available.

Failure rates in the presence of competing risks

Mohamed Ali

A Stata routine for estimating the cumulative incidence rate (CIR) and its standard error in the presence of competing risks will be demonstrated. The program mtable will have the same features as ltable command in Stata. In addition to the CIR estimates, the program will have an option to produce the Kaplan-Meier type cause-specific failure rate.

          mtable timevar outcomevar  [weight] [if exp] [in exp] 
                 [, by(groupvar) level(#) survival failure hazard 
                 intervals(interval) noadjust notab graph graph_options 
                 noconf saving(newvar) reason(#) at(#) ]

A discrete time split population survival (cure) model

Stephen Jenkins

In the standard survival model, the risk of failure is non-zero for all cases. A split-population (or cure) survival model relaxes this assumption and allows an (estimable) fraction of cases never to experience the event. This presentation reports on an implementation of a discrete time (or grouped survival data) version of this model, using ml method d0, and the problems with implementing a 'robust' option.

Efficient management of multi-frequency panel data with Stata

Christopher F. Baum

This presentation discusses how the tasks involved with carrying out a sizable research project, involving panel data at both monthly and daily frequencies, could be efficiently managed by making use of built-in and user-contributed features of Stata. The project entails the construction of a dataset of cross-country monthly measures for 18 nations, and the evaluation of bilateral economic activity between each distinct pair of countries. One measure of volatility, at a monthly frequency, is calculated from daily spot exchange rate data, and effectively merged back to the monthly dataset. Nonlinear least squares models are estimated for every distinct bilateral relationship, and the results of those 300+ models organized for further analysis and production of summary tables and graphics using a postfile. The various labor-saving techniques used to carry out this research will be discussed, with emphasis on the generality that allows additional countries, time periods, and data to be integrated with the panel dataset with ease.

Splines with parameters that can be explained in words to non-mathematicians

Roger Newson

This contribution is based on my programs bspline and frencurv, which are used to generate bases for Schoenberg B-splines and splines parameterized by their values at reference points on the X-axis (presented in STB-57 as insert sg151). The program frencurv ("French curve") makes it possible for the user to fit a model containing a spline, whose parameters are simply values of the spline at reference points on the X-axis. For instance, if I am modeling a time series of daily hospital asthma admissions counts to assess the effect of acute pollution episodes, I might use a spline to model the long-term time trend (typically a gradual long-term increase superimposed on a seasonal cycle), and include extra parameters representing the short-term increases following pollution episodes. The parameters of the spline, as presented with confidence intervals, might then be the levels of hospital admissions, on the first day of each month, expected in the absence of pollution. The spline would then be a way of interpolating expected pollution-free values for the other days of the month. The advantage of presenting splines in this way is that the spline parameters can be explained in words to a non-mathematician (e.g., a medic), which is not easy with other parameterizations used for splines.

Propensity score matching

Barbara Sianesi

The typical evaluation problem aims at quantifying the impact of a 'treatment' (e.g., a training programme, a reform, or a medicine) on an outcome of interest (such as earnings, school attendance, or illness indicators), where a group of units, the 'treated', receive the 'treatment', while a second group remains untreated. Statistical matching involves pairing to each treated unit a non-treated unit with the 'same' observable characteristics, so that (under some assumptions) the outcome experienced by the matched pool of non-treated may be taken as the outcome the treated units would have experienced had they not been treated. Alternatively, one can associate to each treated unit a matched outcome given by the average of the outcome of all the untreated units, where each of their contributions is weighted according to their 'distance' to the treated unit under consideration. An interesting quantity which avoids the dimensionality problem is the 'propensity score', the conditional probability of being treated. This ado-file implements propensity score matching, in both its one-to-one and kernel-based versions. Additionally, it allows to match on two variables, as would be required, e.g., in the evaluation of multiple treatments.

Smoothed hazards

Ken Simons

Nonparametric estimates of hazard rates can be computed as functions of time (e.g., age or calendar time). Given random variations in survival times, estimates of the hazard typically must be smoothed to distinguish trends from noise. Left truncation (at a known age or time) and right censoring typically complicate estimation. Stata does not include routines to estimate smoothed hazards. Therefore, I will present a practical means to estimate smoothed hazards, allowing for possible left-truncation and right-censoring.

The presentation will consider the use of kernel density estimation methods. For discrete time intervals of fixed length, other approaches are available, and may be mentioned. Confidence intervals and choice of (constant and varying) smoothing window widths may also be discussed. Company exit ("death") rates for alternative types of firms will be used as illustration.

Predicting ordinal outcomes: options and assumptions

[Hand-outs/slides, part I, part II]
Mark Lunt

There are a number of methods of analyzing data that consists of several distinct categories, with the categories ordered in some manner. Analysis of such data is commonly based on a generalized linear model of the cumulative response probability, either the cumulative odds model (ologit) or the continuation ratio model (ocratio). However, these models assume a particular relationship between the predictor variables and the outcome. If these assumptions are not met, a multinomial model, which does not make such assumptions, can be fitted instead. This effectively ignores the ordering of the categories. It has the disadvantage that it requires more parameters than the above models, which makes it more difficult to interpret. An alternative model for ordinal data is the stereotype model. This has been little used in the past, as it is quite difficult to fit. It can be thought of as a constrained multinomial model, although some of the constraints applied are nonlinear. An ado-file to fit this model in Stata has recently been developed.

I will present analyses of a radiographic dataset, where the aim was to predict the severity of joint damage. All four of the above models were fitted to the data. The assumptions of the cumulative odds and continuation ratio models were not satisfied. A highly constrained stereotype model provided a good fit. Importantly, it showed that different variables were important for discriminating between different levels of the outcome variable.

Frailty in survival analysis models (parametric frailty, parametric shared frailty, and frailty in Cox models

Bobby Gutierrez

Frailty models are used to model survival times in the presence of overdispersion or group-specific random effects. The latter are distinguished from the former by the term "shared" frailty models. With the release of Stata 7, estimation of parametric non-shared frailty models is now possible, and the new models appear as extensions to the six parametric survival models previously available. The overdispersion in this case is represented by an unobservable multiplicative effect on the hazard, or frailty. For purposes of estimation this frailty is then assumed to either follow a gamma or inverse-Gaussian distribution.

Parametric shared frailty models are the next logical step in the development in this area, and will soon be available as an update to Stata 7. For these models, the random unobservable frailty effects are assumed to follow either a gamma or inverse-Gaussian distribution, but are constrained to be equal over those observations from a given group or panel.

Frailty models and shared frailty models for parametric regression with survival data will be discussed, along with avenues for future development at StataCorp in this area, in particular, an application of the frailty principle to Cox regression.

Report to users / Wishes and grumbles

William Gould and Bobby Gutierrez

Report to users/Wishes and grumbles session

Scientific organizers

Stephen P. Jenkins, University of Essex

Bianca De Stavola, London School of Hygiene and Tropical Medicine

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom.