6th UK Users Group meeting abstracts
Stata users meeting
RSS, London, 15 May 2000
Abstracts and notes
Academic organisers:
- Nicholas J. Cox, Durham University, UK
Patrick Royston, MRC Clinical Trials Unit, London
Logistic organisers:
- Timberlake Consultants
Program
 |
15 May 2000
|
0925
|
 |
Introduction and welcome
|
|
0930
|
 |
Fitting complex random effect models with Stata using data augmentation:
an application to the study of male and female fecundability
|
David Clayton
|
|
MRC Biostatatistics Unit, Cambridge
|
|
|
René Ecochard
|
|
DIM Hospices Civils de Lyon
|
|
|
|
1000
|
 |
Nonparametric regression modelling using MCMC methods
|
Gareth Ambler
|
|
Medical Statistics and Evaluation, Imperial College
|
|
|
|
1030
|
 |
Analysis of cancer survival with Stata
|
Andy Sloggett
|
|
Epidemiology and Population Health, London School of Hygiene and Tropical Medicine
|
|
|
|
1100
|
 |
Coffee
|
|
1130
|
 |
Enhancing access to statistical software tools and datasets
for research and instruction
|
Christopher F. Baum
|
|
Economics, Boston College
|
|
|
|
1200
|
 |
A web-based "Survival Analysis using Stata" course to accompany a lecture course: what is it and was it worth doing?
|
Stephen Jenkins
|
|
Institute for Social and Economic Research, University of Essex
|
|
|
|
1215
|
 |
Confidence intervals for rank order statistics: Somers' D, Kendall's tau_a and
their differences
|
Roger Newson
|
|
Department of Public Health Sciences, Guy's, King's and St Thomas' School of Medicine
|
|
|
|
1230
|
 |
Lunch
|
|
1345
|
 |
Parametrizing Regression Models
|
Michael Hills
|
|
consultant
|
|
|
David Clayton
|
|
MRC Biostatatistics Unit, Cambridge
|
|
|
|
1415
|
 |
Quantile plots for right-censored data
|
Tony Brady
|
|
Medical Statistics and Evaluation, Imperial College
|
|
|
Patrick Royston
|
|
MRC Clinical Trials Unit, London
|
|
|
|
1445
|
 |
Plotting and fitting univariate distributions with long or heavy tails
|
Nicholas J. Cox
|
|
Geography, University of Durham
|
|
|
|
1515
|
 |
Tea
|
|
1545
|
 |
Report to users
followed by Open discussion: Wishes and grumbles
|
William Gould
|
|
StataCorp, College Station, TX
|
|
|
|
1715
|
 |
Formal end of meeting
|
|
|
Nicholas J. Cox (Geography, University of Durham)
-
The 6th annual London Stata users' meeting was held at the Royal Statistical
Society on 15th May. (Other users' meetings have been held in Spain, 1999,
2000, and the Netherlands, 2000.)
The logistics were organised by Timberlake Consultants
(http://www.timberlake.co.uk)
and the academic programme was organised by
Patrick Royston and myself.
Participants in a full day of papers and discussions included not
only Stata users from the UK and the Irish Republic, but also frequent
Statalist contributors William Gould, President of StataCorp, and
Kit Baum, Boston College, from the United States and Jens Lauritsen
from Denmark.
It is usually invidious to single out any particular paper, but on this
occasion David Clayton's ingenious use of multiple Statas for parallel
processing, each Stata fitting one random effect in a complex model, was
particularly stimulating. We all look forward to seeing it written up
and the underlying ideas implemented more generally.
A copy of the original programme together with the abstracts and
pointers appears here,
-
http://www.stata.com/support/meeting/6uk
This includes some notes I made on on Bill Gould's
Report to users and the ensuing
discussion.
Kit Baum has already drawn attention to photos and a copy of his paper on his
site at
-
http://fmwww.bc.edu/repec/docs/suguk2000.html
It may be that other photos will be added on the Stata website. Certainly,
Bill seemed to be taking lots of photos whenever he was not talking Stata.
It is likely that the software discussed will be added to the site.
To obtain the software discussed during the meeting, in Stata type
-
. net from http://www.stata.com
. net cd meetings
. net cd 6uk
or, in Stata,
- pull down Help and select STB and User-written Programs
- click on http://www.stata.com
- click on meetings
- click on 6uk
Some of the software is already published (e.g. Roger Newson's program is just
out in STB-55) and some may appear elsewhere.
— Nick
n.j.cox@durham.ac.uk
David Clayton (MRC Biostatatistics Unit, Cambridge), and
René Ecochard (DIM Hospices Civils de Lyon)
-
We discuss fitting of a complex random effect model using Stata to carry out
block-wise Gibbs sampling within a multi-processor computing environment. The
application involves a dataset concerning artificial insemination by donor
(AID). Success or failure at each of 12,100 menstrual cycles is modelled with
a mixed model with random effects due to woman, conception attempt within
woman, semen donor, donation within donor, and the treating physician. Given
the availability of software within Stata to fit a model with a single random
effect, the full model can be fitted by an alternating imputation algorithm
(Clayton and Rasbash, 1999) implemented with five copies of Stata running on
separate processors and communicating via disk files. Each process fits one
random effect plus all the fixed effects. The five processes may run in
synchronous or asynchronous mode. Process synchronisation and file locking are
implemented in a `toolkit' of Stata programs.
Gareth Ambler (Medical Statistics and Evaluation,
Imperial College)
-
Nonparametric regression modelling may be used to estimate the relationship
between a response and a predictor when one wants to make few assumptions
about the form of the relationship. One approach is to estimate the
regression function using piecewise polynomials that are non-zero only between
adjacent knot points. A drawback of this approach is that the number and
location of the knots usually has to be chosen.
Denison and colleagues (1998) suggested a methodology that does not require us
to make this choice. They proposed treating the number and location of the
knots as random variables and using MCMC simulation techniques to sample from
their distribution. An average of the corresponding fits provides an estimate
of the regression function.
I will describe bcf which implements this method and will illustrate
its potential in both real and simulated data.
Denison, D., Mallick, B.K. and Smith, A.F.M. 1998. Automatic Bayesian
curve fitting. Journal, Royal Statistical Society Series B 60,
333-350.
Andy Sloggett (Epidemiology and Population Health, London School of
Hygiene and Tropical Medicine)
-
Two years ago, at the 4th User Group meeting, I presented a purpose-written
Stata routine for calculating relative survival in follow-up studies, usually
cancer survival studies. The routine strel (name registered with
Stata) has been further developed and also adapted for use with Stata v5 or
v6. A brief re-cap will be presented.
Some comparisons with the hitherto "gold standard" routine written by Timo
Hakulinen, and the advantages and disadvantages of each approach, will be
presented. The Esteve methodology has some foibles which will be
mentioned.
For many cancers the relative survival curve flattens to an asymptote
after some years. When relative survival estimates are available for a
series of times post-diagnosis the curve can be modelled with the Stata
non-linear procedure, specifying a mixture model which provides the
proportion "cured" — the proportion at asymptote, whose survival is no
worse than the general population. With a bit of magic the procedure also
provides the mean survival time of those who have died before "cure" was
attained.
These two measures — proportion cured and mean survival of fatal cases —
can give interesting extra insights into trends in cancer survival. Some
results will be presented.
Christopher F. Baum (Economics, Boston College)
-
Statistical software tools have become more extensible, readily permitting
their users to extend functionality, while widespread access to the Internet
has made it possible to exchange those materials within the research
community. Stata has become particularly supportive of these trends with its
.ado architecture, in which user commands properly installed are
indistinguishable from built-in commands, and its net-aware facilities for
installation and archive access (such as net describe and
webseek).
This paper describes an initiative to enhance information flow in the
discipline of economics — the RePEc project — which has
been expanded from its original focus on preprints and published articles to
incorporate "metadata", or bibliographic information, on "software components"
such as user-authored additions to Stata. The use of a RePEc archive to house
these metadata provides greater visibility for these materials, and integrates
them into a broader set of software components that may be referenced to
enhance Stata's facilities. The SSC-IDEAS archive provides Web browser access
to over 400 Stata components, incorporating those published on Statalist, and
is mirrored by the new webseek facility. The archive's Stata-oriented
contents are accompanied by automatically generated package (.pkg) files that
render them installable in web-aware Stata.
The RePEc metadata structures may be used to integrate a researcher's
preprints, her software, and her datasets that are to be shared with the
research community. This facility has clear advantages for instruction as well
as research. This paper demonstrates how three sets of instructional data,
made available by econometrics textbook authors, may be catalogued and made
directly accessible within web-aware Stata for classroom use.
Stephen Jenkins (Institute for Social and Economic Research,
University of Essex)
-
I teach a 10 hour lecture course on Survival Analysis to M.Sc. students in
Economics (though others sit in on it too). This year, for the first time,
the lectures were supplemented by web-based Survival Analysis with Stata
materials.
Topics covered are:
- Introduction to Stata
- The shapes of hazard and survival functions
- Preparing survival time data for analysis and estimation
- Estimation of the empirical (KM) hazard and survivor functions
- Estimation: (i) continuous time models and
- Estimation: (ii) discrete time models.
Downloadable lessons provide worked examples plus exercises. This short talk
reviews the advantages and disadvantages of this venture, and hopes to
stimulate suggestions for improvements, as well as more general discussion
about teaching methods.
Roger Newson (Department of Public Health Sciences,
Guy's, King's and St Thomas' School of Medicine)
-
So-called "non-parametric" methods are in fact based on population
parameters, which are zero under the null hypothesis.
Two of these parameters are Kendall's taua and Somers'
D, defined respectively by
-
tauXY
=
E[sign(X1-X2) sign(Y1-Y2)],
DYX
=
tauXY / tauXX
where (X1,Y1) and
(X2,Y2) are sampled independently from the
same bivariate population. If X is a binary variable, then Somers'
D is the parameter tested by a Wilcoxon rank-sum test.
It is more informative to have confidence limits for these parameters than
P-values alone, for three main reasons:
- It might discourage people from arguing that a high P-value proves
a null hypothesis.
- For continuous data, Kendall's taua is often related to
the classical Pearson correlation by Greiner's relation
rho=sin((pi/2)tau), so we can use Kendall's
taua to define robust confidence limits for Pearson's
correlation.
- We might want to know confidence limits for differences between two
Kendall's tauas or Somers' Ds, because a larger
Kendall's tauas or Somers' D cannot be secondary
to a smaller one. That is to say, if Y is an outcome variable, and
W and X are two competing predictor variables, then the
difference tauXY-tauWY is positive or
negative, depending on the most likely direction of the difference between
two Y-values, assuming that the larger of the two W-values
is associated with the smaller of the two X-values. Therefore, if
tauXY>tauWY>0, then the positive
correlation of X with Y cannot be caused by a positive
relationship of both variables with W.
The program somersd, which I have submitted to STB, calculates
confidence intervals for Somers' D or Kendall's taua,
using jackknife variances. There is a choice of transformations, including
Fisher's z, Daniels' arcsine, Greiner's rho, and the
z-transform of Greiner's rho. A cluster() option is
available, intended for measuring intra-class correlation (such as exists
between measurements on pairs of sisters). The estimation results are saved as
for a model fit, so that differences can be estimated using lincom.
Michael Hills (consultant)
and David Clayton (MRC Biostatatistics Unit, Cambridge)
-
The Mantel–Haenszel commands in Stata are still popular with epidemiologists
even though they are less efficient than their maximum likelihood
counterparts. The reason lies in the way the parameters are chosen: they show
effects of variables of interest (`exposures') by potential confounding
variables, possibly controlled for other stratifying variables, followed by
combined effects based on the assumption of no interaction. In the
conventional parametrization the parameters show the sizes of the interaction
terms — these create confusion and fill the screen without being of any
practical value.
We will present a series of linked commands which makes it possible to combine
any single equation regression model with declared stratifying and confounding
variables to produce maximum likelihood estimates of Mantel-Haenszel
parameters. The output is based on the kind of table required for publications
in the epidemiological literature.
For much epidemiological analysis these commands could replace the use of
xi.
Tony Brady (Medical Statistics and Evaluation, Imperial College)
and Patrick Royston (MRC Clinical Trials Unit, London)
-
Parametric survival models make assumptions about the distribution of survival
times that are not straightforward to check in practice. This might be one
reason why Cox regression is so often used to analyse survival data despite
some advantages of parametric models, such as the ability to make inferences
directly about survival times in addition to the hazard. We propose a simple
tool for checking the distribution of survival times, analogous to a normal
plot. Quantiles of the (right) censored survival times are estimated using the
method of Kaplan and Meier to account for censoring. These are plotted against
quantiles of the proposed parametric survival distribution. Departure of the
plotted points from the line of equality indicates departure from the proposed
distribution. We will illustrate cqplot using simulated datasets from
known survival distributions to show that it works in principle, and then go
on to demonstrate its use on real data.
Nicholas J. Cox (Geography, University of Durham)
-
Distributions with long or heavy tails are commonplace and in many fields are
more frequently encountered than (say) approximately Gaussian (normal)
distributions. Data examples for this presentation come from environmental
statistics, in which assessing the character of heavy tails of distributions
for such variables as rainfall or river discharge is a central problem. I
will survey some graphical and estimation programs written in the Stata
language for such distributions. Some of these programs are of use for many
kinds of data.
distplot and quantil2 (STB-51) show cumulative distribution
functions, survival functions, or quantile functions. skewplot
(SSC-IDEAS) is a Tukey-style plot for examining the degree and character of
skewness. mexplot and hillplot are more specific to data on
extreme events.
Last year Patrick Royston and I reported on a Stata program for calculating
L-moments (lmoments, SSC-IDEAS). This approach will be
revisited briefly and it will be shown how L-moments provide easily
calculated parameter estimates for fitting distributions such as the
generalised Pareto distribution and the generalised extreme value
distribution. Quantile-quantile plots can also be produced easily given such
estimates. In addition, plotting the third and fourth L-moments is an
alternative to plotting skewness and kurtosis.
William Gould (StataCorp, College Station, TX)
-
Here are some of the highlights from Bill Gould's report to users and the
subsequent discussion of user "wishes and grumbles".
This summary was prepared by Nicholas Cox. Long-time Stata users will know
that StataCorp does not make promises or predictions about what will appear
when: it promises only to listen very carefully.
StataCorp has been suffering growing pains: the number of technical
developers has doubled in the last year. In the short term, output of new code
has slowed while developers come up to speed, but thereafter will come faster.
StataCorp will be moving into a new, larger custom-built building, probably
early next year.
Sales have been good!
The web-aware features of Stata 6.0 have had a major impact. Since 6.0 was
released in January 1999, there have been 51 updates to .ado files and 9
updates to executables. On average, an .ado file is updated every 2.7 days.
(One insight into StataCorp's practices is that it takes about 1.5 weeks to
certify a new executable.) The net command, which allows most users to
update over the internet, allows quick bug fixes and addition of new features.
The latter have included improvements in regression accuracy and a much
revised xtgee. The new webseek command has led to greater
trading of user-written programs.
The Stata Technical Bulletin is growing more slowly than Stata itself,
despite improved quality. In due course, the STB will be made available on
the web, but precisely in what way is not yet fixed.
Ventures like icd9 (STB-54) for handling disease codes are quite easy
for StataCorp and apparently useful for large groups. Suggestions of others?
[Audience members suggested zip and SIC.]
Net courses are going well. The latest course on Survival analysis is much
more statistical than any previous net course, but has had very good
feedback to date.
The programming language will remain stable into future releases. But more
structured programming commands will be added, such as a foreach.
Improving graphics is one major project under way.
Requests from the audience included
- being able to link C code to Stata
- being able to use more than one data set within Stata
- GMM
- data paths (like ado paths)
- more flexible merge
- better error diagnostics, better debugging
See
http://www.stata.com/meeting/proceedings.html for the proceedings of other
UK user group meetings.
|
Meetings
Stata Conference
User Group meetings
Proceedings
|