Last updated: 7 June 2001
2001 UK Stata Users Group meeting
14–15 May 2001
Royal Statistical Society
12 Errol Street
London EC1Y 8LX
Plotting graded data: a Tukey-ish approach
Nicholas J. Cox
Graded data are those possessing an inherent order but falling short of a
metric scale. Examples are opinions on a five-point scale, such as strongly
disagree, disagree, neutral, agree, strongly agree. Graded data are, like
ranked data, one kind of ordinal data. They are common in many fields,
especially as the record of some considered judgment, but little attention
seems to have been given to methods for their easy and effective graphical
This presentation draws on suggestions made in various places by J.W. Tukey,
using the principle that cumulative probabilities are a logical and practical
way to represent graded data, which is after all the basis for many models for
graded responses. Cumulative probability curves for different subsets of the
data are found useful in initial description and exploration of the data. A
Stata program ordplot offers various kinds of flexibility in showing
- cumulating to the bottom, the middle and the top of each
- complementary distribution functions (descending curves) may be
shown as well as cumulative distribution functions (ascending curves);
logit, folded root (more generally, folded power), loglog,
cloglog, normal (Gaussian), percent and raw scales are all allowed for
- for such scales, labels, lines and ticks may be
in terms of the transformed units or in terms of probabilities or percents;
- different scores may be assigned to grades on the fly. In practice, most
datasets seem to reveal their basic structure either on raw or on logit
scales. In some cases, the discrete response models fitted by previous authors
appear, as a consequence, to be unnecessarily elaborate or indirect.
Fitting log-linear models with ignorable and non-ignorable missing data
We describe Stata macros that implement the composite link approach to missing
data in log-linear models first described by David Rindskopf (Psychometrika,
1992, V57, 29–42). When a missing value occurs among the variables that form
a contingency table, the resulting observation contributes to the frequencies
of a table of lower dimension than the full table being collapsed along the
dimension of the missing variable. Our primary interest lies in constructing
a model for the full dimensional table. The composite link approach maps the
observed cells of this collapsed table to the corresponding unobserved cells
of the full dimensional table. This mapping allows expected cells frequencies
for observed cells to be obtained from the expected cell frequencies for the
unobserved cells, the latter being derived from a near standard log-linear
model. A preliminary macro reorganizes the data from a file of individual
records with possibly missing variable values to a file where each record
represents either an observed cell frequency or an unobserved cell that
contributes to an observed cell. The records also contain the necessary
design variables and interaction terms to allow the second macro, an
adaptation of Stata's original glm procedure, to fit log-linear models
that assume the missing values are MCAR, MAR or conform to some non-ignorable
model. We illustrate the use of the macros. The primary contributors to this
work were Colin Taylor and Alan Taylor (Institute of Psychiatry, London) and
Daphne Kounali (now MRC Environmental Epidemiology).
Parametric survival functions
Cox proportional-hazard regression has been essentially the automatic choice
of analysis tool for modeling survival data in medical studies. However, the
Cox model has several intrinsic features that may cause problems for the analyst or an interpreter of the data. These include the necessity of assuming proportional hazards and the very noisy estimate of the baseline hazard function that is typically obtained. I shall demonstrate flexible parametric models based on a proportional hazards or a proportional odds metric. I will show the utility of such models in helping one to visualize the hazard function and hazard ratio as functions of time, and in modeling data with non-proportional effects of some or all of the covariates.
Royston, P. 2001. Flexible parametric alternatives to the Cox model ... and
more. Stata Technical Bulletin: 61.
Royston, P. and M. K. B. Parmar. 2001. Flexible parametric models for censored
survival data with application to prognostic modeling and estimation of treatment effects. Submitted for publication.
Adjusting for cross-over in a trial with survival end-points
Abdel Babiker, and
We consider a two-group clinical trial with a survival outcome, in which some
subjects may 'cross over' to receive the treatment of the other arm. Our
command strbee adjusts for treatment cross-over in one or both arms.
This is done by a randomization-respecting method which preserves the
G-estimation of the effect of exposures in longitudinal studies
and Kate Tilling
Stata's st suite of commands for the analysis of survival time data
allow flexible modeling of the effect of exposures which vary over time. A
potential problem in such analyses is that other risk factors may be both
confounders (i.e., associated with both exposure and disease outcome) and also
intermediate variables (on the causal pathway from exposure to disease). This
phenomenon is known as "time-varying confounding". Standard statistical models
for the analysis of cohort studies do not take such complex relationships into
account and may produce biased estimates of the effect of risk factor changes.
G-estimation of the effect of a time-varying exposure on outcome, allowing for
confounders which are also on the causal pathway, has been proposed for the
analysis of such inter-related data. We will present stgest, a Stata
program which performs g-estimation, allowing the results to be compared to
those from the more usual survival analysis. Using simulated data, we show
that the usual analysis can under-estimate the effect of an exposure on
disease where there is time-varying confounding, and that g-estimation
produces a more accurate estimate. Applications of the method will be
A graphical interface for Stata estimation commands
and David Clayton
At the UK Stata User's Group meeting in 2000, we presented a series of
linked commands which made it possible to declare exposure, stratifying
and confounding variables, and to combine this information with Stata
estimation commands such as regress, logistic, poisson,
stcox, xtpois, etc., to produce maximum likelihood estimates of
stratum-specific exposure effects, possibly controlled for other confounders.
Essentially, the idea was that the estimation techniques should be ML,
but the output should be closer to MantelHaenszel than to the
traditional table of main effects and interactions.
In this presentation, we demonstrate the use of a Graphical Interface
as an alternative way of declaring the information which will guide
the analysis. The GI was prepared using Stata's windowing commands.
We are not advocating the use of GI's in place of the command line as
a general strategy, only where the information to be passed to a
command is complex.
A Short Introduction to Stata for Biostatistics: chap6-8.pdf
xtgraph: summary graphs of xt data
xtgraph produces summary graphs of xt data, by time and by group.
It is very flexible, allowing means based on any power or 3-parameter log
transformation, and error bars for SE, CI, SD and Reference Range, as well as
medians with IQR. Normally, points are estimated separately for each level of
t and group. However, a model option will take values from the last
model fitted. This allows for linear and non-linear effects and displays
interactions. The main illustration deals with data about Vitamin C and E
Nicholas J. Cox
The Stata program triplot produces a triangular plot of three variables
with constant sum. Most commonly, three fractions or proportions add to 1, or
three percents add to 100. The constant sum constraint means that there are
just two independent pieces of information. Hence, it is possible to plot
observations in two dimensions within a triangle, which is a 2-simplex.
Triangular plots appear under various names in the literature, including
trilinear, triaxial, three-element maps, ternary, reference triangles,
percentage triangles, mixture, barycentric. Common geological applications are
to sedimentary facies or particle form; hence, more specific terms such as
facies and form triangles. triplot has several options for tuning
displays. A variety of data examples will be shown to indicate some of the
flexibility of the program.
Extensions to gllamm
Sophia Rabe-Hesketh, Andrew Pickles, and Anders Skrondal
gllamm is a program to fit generalised linear latent and mixed models.
Since gllamm6 appeared in the STB (sg129), a large number of new
features have been added. Two important extensions will be discussed:
- More response processes can now be modelled including ordered and
unordered categorical responses and rankings. Multilevel models for
nominal data and rankings will be described and fitted in gllamm.
- Multilevel structural equation models can be fitted by specifying
regressions of latent variables on other latent variables and on
explanatory variables. Examples will be described and fitted in
Other new features in gllamm include parameter constraints, and a
'post-estimation' program, gllapred, for estimating posterior means and
Sample size calculation in complex studies with failure time outcome
Abdel Babiker and Patrick Royston
Stata includes just one program, sampsi, for calculating sample size.
It deals only with comparisons between two groups in terms of binary or
normally distributed outcomes. Many randomized controlled trials, however,
are designed around a survival-time outcome measure, may compare more than two
groups, and are subject to loss to follow-up withdrawal from allocated
treatment and staggered entry. We provide a very flexible tool for
determining sample size in such studies. Because inevitably there are many
potential 'options' (in the Stata sense), the underlying conventional ado-file
(calcssi) for the routine may be dauntingly complex. For this reason we
have provided a menu-driven front end initiated by running the ado-file
The following study design features have been implemented:
Analysis by unweighted, Tarone–Ware, or Harrington–Fleming
versions of the logrank test. In addition, a chi-square test comparing the
proportion of failures at the end of the study is available.
- Up to 6 treatment groups.
- Arbitrary baseline time-to-event distribution.
- Time-varying hazard ratios (i.e. non-proportional hazards).
- Arbitrary allocation ratios across groups.
- Loss to follow-up.
- Staggered patient entry.
- Cross-over from allocated treatment to alternative treatment.
Failure rates in the presence of competing risks
A Stata routine for estimating the cumulative incidence rate (CIR) and its
standard error in the presence of competing risks will be demonstrated. The
program mtable will have the same features as ltable command in
Stata. In addition to the CIR estimates, the program will have an option to
produce the Kaplan-Meier type cause-specific failure rate.
mtable timevar outcomevar [weight] [if exp] [in exp]
[, by(groupvar) level(#) survival failure hazard
intervals(interval) noadjust notab graph graph_options
noconf saving(newvar) reason(#) at(#) ]
A discrete time split population survival (cure) model
In the standard survival model, the risk of failure is non-zero for all cases.
A split-population (or cure) survival model relaxes this assumption and allows
an (estimable) fraction of cases never to experience the event. This
presentation reports on an implementation of a discrete time (or grouped
survival data) version of this model, using ml method d0, and the
problems with implementing a 'robust' option.
Efficient management of multi-frequency panel data with Stata
Christopher F. Baum
This presentation discusses how the tasks involved with carrying out a sizable
research project, involving panel data at both monthly and daily frequencies,
could be efficiently managed by making use of built-in and user-contributed
features of Stata. The project entails the construction of a dataset of
cross-country monthly measures for 18 nations, and the evaluation of bilateral
economic activity between each distinct pair of countries. One measure of
volatility, at a monthly frequency, is calculated from daily spot exchange
rate data, and effectively merged back to the monthly dataset. Nonlinear
least squares models are estimated for every distinct bilateral relationship,
and the results of those 300+ models organized for further analysis and
production of summary tables and graphics using a postfile. The various
labor-saving techniques used to carry out this research will be discussed,
with emphasis on the generality that allows additional countries, time
periods, and data to be integrated with the panel dataset with ease.
Splines with parameters that can be explained in words to non-mathematicians
This contribution is based on my programs bspline and frencurv,
which are used to generate bases for Schoenberg B-splines and splines
parameterized by their values at reference points on the X-axis (presented in
STB-57 as insert sg151). The program frencurv ("French curve") makes it
possible for the user to fit a model containing a spline, whose parameters are
simply values of the spline at reference points on the X-axis. For instance,
if I am modeling a time series of daily hospital asthma admissions counts to
assess the effect of acute pollution episodes, I might use a spline to model
the long-term time trend (typically a gradual long-term increase superimposed
on a seasonal cycle), and include extra parameters representing the short-term
increases following pollution episodes. The parameters of the spline, as
presented with confidence intervals, might then be the levels of hospital
admissions, on the first day of each month, expected in the absence of
pollution. The spline would then be a way of interpolating expected
pollution-free values for the other days of the month. The advantage of
presenting splines in this way is that the spline parameters can be explained
in words to a non-mathematician (e.g., a medic), which is not easy with other
parameterizations used for splines.
Propensity score matching
The typical evaluation problem aims at quantifying the impact of a 'treatment'
(e.g., a training programme, a reform, or a medicine) on an outcome of interest
(such as earnings, school attendance, or illness indicators), where a group of
units, the 'treated', receive the 'treatment', while a second group remains
untreated. Statistical matching involves pairing to each treated unit a
non-treated unit with the 'same' observable characteristics, so that (under
some assumptions) the outcome experienced by the matched pool of non-treated
may be taken as the outcome the treated units would have experienced had they
not been treated. Alternatively, one can associate to each treated unit a
matched outcome given by the average of the outcome of all the untreated
units, where each of their contributions is weighted according to their
'distance' to the treated unit under consideration. An interesting quantity
which avoids the dimensionality problem is the 'propensity score', the
conditional probability of being treated. This ado-file implements propensity
score matching, in both its one-to-one and kernel-based versions.
Additionally, it allows to match on two variables, as would be required, e.g.,
in the evaluation of multiple treatments.
Nonparametric estimates of hazard rates can be computed as functions of time
(e.g., age or calendar time). Given random variations in survival times,
estimates of the hazard typically must be smoothed to distinguish trends from
noise. Left truncation (at a known age or time) and right censoring typically
complicate estimation. Stata does not include routines to estimate smoothed
hazards. Therefore, I will present a practical means to estimate smoothed
hazards, allowing for possible left-truncation and right-censoring.
The presentation will consider the use of kernel density estimation methods.
For discrete time intervals of fixed length, other approaches are available,
and may be mentioned. Confidence intervals and choice of (constant and
varying) smoothing window widths may also be discussed. Company exit
("death") rates for alternative types of firms will be used as illustration.
Predicting ordinal outcomes: options and assumptions
[Hand-outs/slides, part I
, part II
There are a number of methods of analyzing data that consists of several
distinct categories, with the categories ordered in some manner. Analysis of
such data is commonly based on a generalized linear model of the cumulative
response probability, either the cumulative odds model (ologit) or the
continuation ratio model (ocratio). However, these models assume a
particular relationship between the predictor variables and the outcome. If
these assumptions are not met, a multinomial model, which does not make such
assumptions, can be fitted instead. This effectively ignores the ordering of
the categories. It has the disadvantage that it requires more parameters than
the above models, which makes it more difficult to interpret. An alternative
model for ordinal data is the stereotype model. This has been little used in
the past, as it is quite difficult to fit. It can be thought of as a
constrained multinomial model, although some of the constraints applied are
nonlinear. An ado-file to fit this model in Stata has recently been developed.
I will present analyses of a radiographic dataset, where the aim was to
predict the severity of joint damage. All four of the above models were fitted
to the data. The assumptions of the cumulative odds and continuation ratio
models were not satisfied. A highly constrained stereotype model provided a
good fit. Importantly, it showed that different variables were important for
discriminating between different levels of the outcome variable.
Frailty in survival analysis models (parametric frailty, parametric shared
frailty, and frailty in Cox models
Frailty models are used to model survival times in the presence of
overdispersion or group-specific random effects. The latter are distinguished
from the former by the term "shared" frailty models. With the release of
Stata 7, estimation of parametric non-shared frailty models is now possible,
and the new models appear as extensions to the six parametric survival models
previously available. The overdispersion in this case is represented by an
unobservable multiplicative effect on the hazard, or frailty. For purposes of
estimation this frailty is then assumed to either follow a gamma or
Parametric shared frailty models are the next logical step in the development
in this area, and will soon be available as an update to Stata 7. For these
models, the random unobservable frailty effects are assumed to follow either a
gamma or inverse-Gaussian distribution, but are constrained to be equal over
those observations from a given group or panel.
Frailty models and shared frailty models for parametric regression with
survival data will be discussed, along with avenues for future development at
StataCorp in this area, in particular, an application of the frailty
principle to Cox regression.
William Gould and
Report to users/Wishes and grumbles session
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Timberlake Consultants, the official distributor
of Stata in the United Kingdom.