Last updated: 7 June 2001
2001 UK Stata Users Group meeting
14–15 May 2001
Royal Statistical Society
12 Errol Street
London EC1Y 8LX
Proceedings
Plotting graded data: a Tukeyish approach
[
Handouts/slides]
Nicholas J. Cox

Graded data are those possessing an inherent order but falling short of a
metric scale. Examples are opinions on a fivepoint scale, such as strongly
disagree, disagree, neutral, agree, strongly agree. Graded data are, like
ranked data, one kind of ordinal data. They are common in many fields,
especially as the record of some considered judgment, but little attention
seems to have been given to methods for their easy and effective graphical
display.
This presentation draws on suggestions made in various places by J.W. Tukey,
using the principle that cumulative probabilities are a logical and practical
way to represent graded data, which is after all the basis for many models for
graded responses. Cumulative probability curves for different subsets of the
data are found useful in initial description and exploration of the data. A
Stata program ordplot offers various kinds of flexibility in showing
such curves:
 cumulating to the bottom, the middle and the top of each
class;
 complementary distribution functions (descending curves) may be
shown as well as cumulative distribution functions (ascending curves);

logit, folded root (more generally, folded power), loglog,
cloglog, normal (Gaussian), percent and raw scales are all allowed for
cumulative probabilities;
 for such scales, labels, lines and ticks may be
in terms of the transformed units or in terms of probabilities or percents;
 different scores may be assigned to grades on the fly. In practice, most
datasets seem to reveal their basic structure either on raw or on logit
scales. In some cases, the discrete response models fitted by previous authors
appear, as a consequence, to be unnecessarily elaborate or indirect.
Fitting loglinear models with ignorable and nonignorable missing data
Andrew Pickles

We describe Stata macros that implement the composite link approach to missing
data in loglinear models first described by David Rindskopf (Psychometrika,
1992, V57, 29–42). When a missing value occurs among the variables that form
a contingency table, the resulting observation contributes to the frequencies
of a table of lower dimension than the full table being collapsed along the
dimension of the missing variable. Our primary interest lies in constructing
a model for the full dimensional table. The composite link approach maps the
observed cells of this collapsed table to the corresponding unobserved cells
of the full dimensional table. This mapping allows expected cells frequencies
for observed cells to be obtained from the expected cell frequencies for the
unobserved cells, the latter being derived from a near standard loglinear
model. A preliminary macro reorganizes the data from a file of individual
records with possibly missing variable values to a file where each record
represents either an observed cell frequency or an unobserved cell that
contributes to an observed cell. The records also contain the necessary
design variables and interaction terms to allow the second macro, an
adaptation of Stata's original glm procedure, to fit loglinear models
that assume the missing values are MCAR, MAR or conform to some nonignorable
model. We illustrate the use of the macros. The primary contributors to this
work were Colin Taylor and Alan Taylor (Institute of Psychiatry, London) and
Daphne Kounali (now MRC Environmental Epidemiology).
Parametric survival functions
[
Handouts/slides]
Patrick Royston

Cox proportionalhazard regression has been essentially the automatic choice
of analysis tool for modeling survival data in medical studies. However, the
Cox model has several intrinsic features that may cause problems for the analyst or an interpreter of the data. These include the necessity of assuming proportional hazards and the very noisy estimate of the baseline hazard function that is typically obtained. I shall demonstrate flexible parametric models based on a proportional hazards or a proportional odds metric. I will show the utility of such models in helping one to visualize the hazard function and hazard ratio as functions of time, and in modeling data with nonproportional effects of some or all of the covariates.
References
Royston, P. 2001. Flexible parametric alternatives to the Cox model ... and
more. Stata Technical Bulletin: 61.
Royston, P. and M. K. B. Parmar. 2001. Flexible parametric models for censored
survival data with application to prognostic modeling and estimation of treatment effects. Submitted for publication.
Adjusting for crossover in a trial with survival endpoints
[
Handouts/slides]
Ian White,
Abdel Babiker, and
Sarah Walker

We consider a twogroup clinical trial with a survival outcome, in which some
subjects may 'cross over' to receive the treatment of the other arm. Our
command strbee adjusts for treatment crossover in one or both arms.
This is done by a randomizationrespecting method which preserves the
intentiontotreat Pvalue.
Gestimation of the effect of exposures in longitudinal studies
Jonathan Sterne
and Kate Tilling

Stata's st suite of commands for the analysis of survival time data
allow flexible modeling of the effect of exposures which vary over time. A
potential problem in such analyses is that other risk factors may be both
confounders (i.e., associated with both exposure and disease outcome) and also
intermediate variables (on the causal pathway from exposure to disease). This
phenomenon is known as "timevarying confounding". Standard statistical models
for the analysis of cohort studies do not take such complex relationships into
account and may produce biased estimates of the effect of risk factor changes.
Gestimation of the effect of a timevarying exposure on outcome, allowing for
confounders which are also on the causal pathway, has been proposed for the
analysis of such interrelated data. We will present stgest, a Stata
program which performs gestimation, allowing the results to be compared to
those from the more usual survival analysis. Using simulated data, we show
that the usual analysis can underestimate the effect of an exposure on
disease where there is timevarying confounding, and that gestimation
produces a more accurate estimate. Applications of the method will be
discussed.
A graphical interface for Stata estimation commands
[
Handouts/slides]
Michael Hills
and David Clayton

At the UK Stata User's Group meeting in 2000, we presented a series of
linked commands which made it possible to declare exposure, stratifying
and confounding variables, and to combine this information with Stata
estimation commands such as regress, logistic, poisson,
stcox, xtpois, etc., to produce maximum likelihood estimates of
stratumspecific exposure effects, possibly controlled for other confounders.
Essentially, the idea was that the estimation techniques should be ML,
but the output should be closer to Mantel–Haenszel than to the
traditional table of main effects and interactions.
In this presentation, we demonstrate the use of a Graphical Interface
as an alternative way of declaring the information which will guide
the analysis. The GI was prepared using Stata's windowing commands.
We are not advocating the use of GI's in place of the command line as
a general strategy, only where the information to be passed to a
command is complex.
Additional resources:
A Short Introduction to Stata for Biostatistics: chap68.pdf
xtgraph: summary graphs of xt data
[
Handouts/slides]
Paul Seed

xtgraph produces summary graphs of xt data, by time and by group.
It is very flexible, allowing means based on any power or 3parameter log
transformation, and error bars for SE, CI, SD and Reference Range, as well as
medians with IQR. Normally, points are estimated separately for each level of
t and group. However, a model option will take values from the last
model fitted. This allows for linear and nonlinear effects and displays
interactions. The main illustration deals with data about Vitamin C and E
supplementation.
Triangular plots
[
Handouts/slides]
Nicholas J. Cox

The Stata program triplot produces a triangular plot of three variables
with constant sum. Most commonly, three fractions or proportions add to 1, or
three percents add to 100. The constant sum constraint means that there are
just two independent pieces of information. Hence, it is possible to plot
observations in two dimensions within a triangle, which is a 2simplex.
Triangular plots appear under various names in the literature, including
trilinear, triaxial, threeelement maps, ternary, reference triangles,
percentage triangles, mixture, barycentric. Common geological applications are
to sedimentary facies or particle form; hence, more specific terms such as
facies and form triangles. triplot has several options for tuning
displays. A variety of data examples will be shown to indicate some of the
flexibility of the program.
Extensions to gllamm
[
Handouts/slides]
Sophia RabeHesketh, Andrew Pickles, and Anders Skrondal

gllamm is a program to fit generalised linear latent and mixed models.
Since gllamm6 appeared in the STB (sg129), a large number of new
features have been added. Two important extensions will be discussed:
 More response processes can now be modelled including ordered and
unordered categorical responses and rankings. Multilevel models for
nominal data and rankings will be described and fitted in gllamm.
 Multilevel structural equation models can be fitted by specifying
regressions of latent variables on other latent variables and on
explanatory variables. Examples will be described and fitted in
gllamm.
Other new features in gllamm include parameter constraints, and a
'postestimation' program, gllapred, for estimating posterior means and
probabilities.
Sample size calculation in complex studies with failure time outcome
Abdel Babiker and Patrick Royston

Stata includes just one program, sampsi, for calculating sample size.
It deals only with comparisons between two groups in terms of binary or
normally distributed outcomes. Many randomized controlled trials, however,
are designed around a survivaltime outcome measure, may compare more than two
groups, and are subject to loss to followup withdrawal from allocated
treatment and staggered entry. We provide a very flexible tool for
determining sample size in such studies. Because inevitably there are many
potential 'options' (in the Stata sense), the underlying conventional adofile
(calcssi) for the routine may be dauntingly complex. For this reason we
have provided a menudriven front end initiated by running the adofile
ssmenu.
The following study design features have been implemented:
 Up to 6 treatment groups.
 Arbitrary baseline timetoevent distribution.
 Timevarying hazard ratios (i.e. nonproportional hazards).
 Arbitrary allocation ratios across groups.
 Loss to followup.
 Staggered patient entry.
 Crossover from allocated treatment to alternative treatment.
Analysis by unweighted, Tarone–Ware, or Harrington–Fleming
versions of the logrank test. In addition, a chisquare test comparing the
proportion of failures at the end of the study is available.
Failure rates in the presence of competing risks
Mohamed Ali

A Stata routine for estimating the cumulative incidence rate (CIR) and its
standard error in the presence of competing risks will be demonstrated. The
program mtable will have the same features as ltable command in
Stata. In addition to the CIR estimates, the program will have an option to
produce the KaplanMeier type causespecific failure rate.
Syntax:
mtable timevar outcomevar [weight] [if exp] [in exp]
[, by(groupvar) level(#) survival failure hazard
intervals(interval) noadjust notab graph graph_options
noconf saving(newvar) reason(#) at(#) ]
A discrete time split population survival (cure) model
[
Handouts/slides]
Stephen Jenkins

In the standard survival model, the risk of failure is nonzero for all cases.
A splitpopulation (or cure) survival model relaxes this assumption and allows
an (estimable) fraction of cases never to experience the event. This
presentation reports on an implementation of a discrete time (or grouped
survival data) version of this model, using ml method d0, and the
problems with implementing a 'robust' option.
Efficient management of multifrequency panel data with Stata
[
Handouts/slides]
Christopher F. Baum

This presentation discusses how the tasks involved with carrying out a sizable
research project, involving panel data at both monthly and daily frequencies,
could be efficiently managed by making use of builtin and usercontributed
features of Stata. The project entails the construction of a dataset of
crosscountry monthly measures for 18 nations, and the evaluation of bilateral
economic activity between each distinct pair of countries. One measure of
volatility, at a monthly frequency, is calculated from daily spot exchange
rate data, and effectively merged back to the monthly dataset. Nonlinear
least squares models are estimated for every distinct bilateral relationship,
and the results of those 300+ models organized for further analysis and
production of summary tables and graphics using a postfile. The various
laborsaving techniques used to carry out this research will be discussed,
with emphasis on the generality that allows additional countries, time
periods, and data to be integrated with the panel dataset with ease.
Splines with parameters that can be explained in words to nonmathematicians
[
Handouts/slides]
Roger Newson

This contribution is based on my programs bspline and frencurv,
which are used to generate bases for Schoenberg Bsplines and splines
parameterized by their values at reference points on the Xaxis (presented in
STB57 as insert sg151). The program frencurv ("French curve") makes it
possible for the user to fit a model containing a spline, whose parameters are
simply values of the spline at reference points on the Xaxis. For instance,
if I am modeling a time series of daily hospital asthma admissions counts to
assess the effect of acute pollution episodes, I might use a spline to model
the longterm time trend (typically a gradual longterm increase superimposed
on a seasonal cycle), and include extra parameters representing the shortterm
increases following pollution episodes. The parameters of the spline, as
presented with confidence intervals, might then be the levels of hospital
admissions, on the first day of each month, expected in the absence of
pollution. The spline would then be a way of interpolating expected
pollutionfree values for the other days of the month. The advantage of
presenting splines in this way is that the spline parameters can be explained
in words to a nonmathematician (e.g., a medic), which is not easy with other
parameterizations used for splines.
Propensity score matching
[
Handouts/slides]
Barbara Sianesi

The typical evaluation problem aims at quantifying the impact of a 'treatment'
(e.g., a training programme, a reform, or a medicine) on an outcome of interest
(such as earnings, school attendance, or illness indicators), where a group of
units, the 'treated', receive the 'treatment', while a second group remains
untreated. Statistical matching involves pairing to each treated unit a
nontreated unit with the 'same' observable characteristics, so that (under
some assumptions) the outcome experienced by the matched pool of nontreated
may be taken as the outcome the treated units would have experienced had they
not been treated. Alternatively, one can associate to each treated unit a
matched outcome given by the average of the outcome of all the untreated
units, where each of their contributions is weighted according to their
'distance' to the treated unit under consideration. An interesting quantity
which avoids the dimensionality problem is the 'propensity score', the
conditional probability of being treated. This adofile implements propensity
score matching, in both its onetoone and kernelbased versions.
Additionally, it allows to match on two variables, as would be required, e.g.,
in the evaluation of multiple treatments.
Smoothed hazards
Ken Simons

Nonparametric estimates of hazard rates can be computed as functions of time
(e.g., age or calendar time). Given random variations in survival times,
estimates of the hazard typically must be smoothed to distinguish trends from
noise. Left truncation (at a known age or time) and right censoring typically
complicate estimation. Stata does not include routines to estimate smoothed
hazards. Therefore, I will present a practical means to estimate smoothed
hazards, allowing for possible lefttruncation and rightcensoring.
The presentation will consider the use of kernel density estimation methods.
For discrete time intervals of fixed length, other approaches are available,
and may be mentioned. Confidence intervals and choice of (constant and
varying) smoothing window widths may also be discussed. Company exit
("death") rates for alternative types of firms will be used as illustration.
Predicting ordinal outcomes: options and assumptions
[Handouts/slides,
part I,
part II]
Mark Lunt

There are a number of methods of analyzing data that consists of several
distinct categories, with the categories ordered in some manner. Analysis of
such data is commonly based on a generalized linear model of the cumulative
response probability, either the cumulative odds model (ologit) or the
continuation ratio model (ocratio). However, these models assume a
particular relationship between the predictor variables and the outcome. If
these assumptions are not met, a multinomial model, which does not make such
assumptions, can be fitted instead. This effectively ignores the ordering of
the categories. It has the disadvantage that it requires more parameters than
the above models, which makes it more difficult to interpret. An alternative
model for ordinal data is the stereotype model. This has been little used in
the past, as it is quite difficult to fit. It can be thought of as a
constrained multinomial model, although some of the constraints applied are
nonlinear. An adofile to fit this model in Stata has recently been developed.
I will present analyses of a radiographic dataset, where the aim was to
predict the severity of joint damage. All four of the above models were fitted
to the data. The assumptions of the cumulative odds and continuation ratio
models were not satisfied. A highly constrained stereotype model provided a
good fit. Importantly, it showed that different variables were important for
discriminating between different levels of the outcome variable.
Frailty in survival analysis models (parametric frailty, parametric shared
frailty, and frailty in Cox models
[
Handouts/slides]
Bobby Gutierrez

Frailty models are used to model survival times in the presence of
overdispersion or groupspecific random effects. The latter are distinguished
from the former by the term "shared" frailty models. With the release of
Stata 7, estimation of parametric nonshared frailty models is now possible,
and the new models appear as extensions to the six parametric survival models
previously available. The overdispersion in this case is represented by an
unobservable multiplicative effect on the hazard, or frailty. For purposes of
estimation this frailty is then assumed to either follow a gamma or
inverseGaussian distribution.
Parametric shared frailty models are the next logical step in the development
in this area, and will soon be available as an update to Stata 7. For these
models, the random unobservable frailty effects are assumed to follow either a
gamma or inverseGaussian distribution, but are constrained to be equal over
those observations from a given group or panel.
Frailty models and shared frailty models for parametric regression with
survival data will be discussed, along with avenues for future development at
StataCorp in this area, in particular, an application of the frailty
principle to Cox regression.
William Gould and
Bobby Gutierrez

Report to users/Wishes and grumbles session
Scientific organizers
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Logistics organizers
Timberlake Consultants, the official distributor
of Stata in the United Kingdom.