Home  /  Resources & support  /  Users Group meetings  /  2000 UK Stata Users Group meeting

2000 UK Stata Users Group meeting

15 May 2000


Royal Statistical Society
12 Errol Street
London EC1Y 8LX


Fitting complex random effect models with Stata using data augmentation: an application to the study of male and female fecundability

David Clayton (MRC Biostatatistics Unit, Cambridge), and René Ecochard (DIM Hospices Civils de Lyon)

We discuss fitting of a complex random effect model using Stata to carry out block-wise Gibbs sampling within a multi-processor computing environment. The application involves a dataset concerning artificial insemination by donor (AID). Success or failure at each of 12,100 menstrual cycles is modelled with a mixed model with random effects due to woman, conception attempt within woman, semen donor, donation within donor, and the treating physician. Given the availability of software within Stata to fit a model with a single random effect, the full model can be fitted by an alternating imputation algorithm (Clayton and Rasbash, 1999) implemented with five copies of Stata running on separate processors and communicating via disk files. Each process fits one random effect plus all the fixed effects. The five processes may run in synchronous or asynchronous mode. Process synchronisation and file locking are implemented in a `toolkit' of Stata programs.

Nonparametric regression modelling using MCMC methods

Gareth Ambler (Medical Statistics and Evaluation, Imperial College)

Nonparametric regression modelling may be used to estimate the relationship between a response and a predictor when one wants to make few assumptions about the form of the relationship. One approach is to estimate the regression function using piecewise polynomials that are non-zero only between adjacent knot points. A drawback of this approach is that the number and location of the knots usually has to be chosen.

Denison and colleagues (1998) suggested a methodology that does not require us to make this choice. They proposed treating the number and location of the knots as random variables and using MCMC simulation techniques to sample from their distribution. An average of the corresponding fits provides an estimate of the regression function.

I will describe bcf which implements this method and will illustrate its potential in both real and simulated data.

Denison, D., Mallick, B.K. and Smith, A.F.M. 1998. Automatic Bayesian curve fitting. Journal, Royal Statistical Society Series B 60, 333-350.

Analysis of cancer survival with Stata

Andy Sloggett (Epidemiology and Population Health,
London School of Hygiene and Tropical Medicine)

Two years ago, at the 4th User Group meeting, I presented a purpose-written Stata routine for calculating relative survival in follow-up studies, usually cancer survival studies. The routine strel (name registered with Stata) has been further developed and also adapted for use with Stata v5 or v6. A brief re-cap will be presented.

Some comparisons with the hitherto "gold standard" routine written by Timo Hakulinen, and the advantages and disadvantages of each approach, will be presented. The Esteve methodology has some foibles which will be mentioned.

For many cancers the relative survival curve flattens to an asymptote after some years. When relative survival estimates are available for a series of times post-diagnosis the curve can be modelled with the Stata non-linear procedure, specifying a mixture model which provides the proportion "cured" — the proportion at asymptote, whose survival is no worse than the general population. With a bit of magic the procedure also provides the mean survival time of those who have died before "cure" was attained.

These two measures — proportion cured and mean survival of fatal cases — can give interesting extra insights into trends in cancer survival. Some results will be presented.

Enhancing access to statistical software tools and datasets for research and instruction

Christopher F. Baum (Economics, Boston College)

Statistical software tools have become more extensible, readily permitting their users to extend functionality, while widespread access to the Internet has made it possible to exchange those materials within the research community. Stata has become particularly supportive of these trends with its .ado architecture, in which user commands properly installed are indistinguishable from built-in commands, and its net-aware facilities for installation and archive access (such as net describe and webseek).

This paper describes an initiative to enhance information flow in the discipline of economics — the RePEc project — which has been expanded from its original focus on preprints and published articles to incorporate "metadata", or bibliographic information, on "software components" such as user-authored additions to Stata. The use of a RePEc archive to house these metadata provides greater visibility for these materials, and integrates them into a broader set of software components that may be referenced to enhance Stata's facilities. The SSC-IDEAS archive provides Web browser access to over 400 Stata components, incorporating those published on Statalist, and is mirrored by the new webseek facility. The archive's Stata-oriented contents are accompanied by automatically generated package (.pkg) files that render them installable in web-aware Stata.

The RePEc metadata structures may be used to integrate a researcher's preprints, her software, and her datasets that are to be shared with the research community. This facility has clear advantages for instruction as well as research. This paper demonstrates how three sets of instructional data, made available by econometrics textbook authors, may be catalogued and made directly accessible within web-aware Stata for classroom use.

A web-based "Survival Analysis using Stata" course to accompany a lecture course: what is it and was it worth doing?

Stephen Jenkins (Institute for Social and Economic Research,
University of Essex)

I teach a 10 hour lecture course on Survival Analysis to M.Sc. students in Economics (though others sit in on it too). This year, for the first time, the lectures were supplemented by web-based Survival Analysis with Stata materials.

Topics covered are:

  • Introduction to Stata
  • The shapes of hazard and survival functions
  • Preparing survival time data for analysis and estimation
  • Estimation of the empirical (KM) hazard and survivor functions
  • Estimation: (i) continuous time models and
  • Estimation: (ii) discrete time models.

Downloadable lessons provide worked examples plus exercises. This short talk reviews the advantages and disadvantages of this venture, and hopes to stimulate suggestions for improvements, as well as more general discussion about teaching methods.

Confidence intervals for rank order statistics: Somers' D, Kendall's tau_a and their differences

Roger Newson (Department of Public Health Sciences, Guy's, King's and St Thomas' School of Medicine)

So-called "non-parametric" methods are in fact based on population parameters, which are zero under the null hypothesis. Two of these parameters are Kendall's taua and Somers' D, defined respectively by

tauXY = E[sign(X1-X2) sign(Y1-Y2)],     DYX = tauXY / tauXX

where (X1,Y1) and (X2,Y2) are sampled independently from the same bivariate population. If X is a binary variable, then Somers' D is the parameter tested by a Wilcoxon rank-sum test.

It is more informative to have confidence limits for these parameters than P-values alone, for three main reasons:

  1. It might discourage people from arguing that a high P-value proves a null hypothesis.

  2. For continuous data, Kendall's taua is often related to the classical Pearson correlation by Greiner's relation rho=sin((pi/2)tau), so we can use Kendall's taua to define robust confidence limits for Pearson's correlation.

  3. We might want to know confidence limits for differences between two Kendall's tauas or Somers' Ds, because a larger Kendall's tauas or Somers' D cannot be secondary to a smaller one. That is to say, if Y is an outcome variable, and W and X are two competing predictor variables, then the difference tauXY-tauWY is positive or negative, depending on the most likely direction of the difference between two Y-values, assuming that the larger of the two W-values is associated with the smaller of the two X-values. Therefore, if tauXY>tauWY>0, then the positive correlation of X with Y cannot be caused by a positive relationship of both variables with W.

The program somersd, which I have submitted to STB, calculates confidence intervals for Somers' D or Kendall's taua, using jackknife variances. There is a choice of transformations, including Fisher's z, Daniels' arcsine, Greiner's rho, and the z-transform of Greiner's rho. A cluster() option is available, intended for measuring intra-class correlation (such as exists between measurements on pairs of sisters). The estimation results are saved as for a model fit, so that differences can be estimated using lincom.

Parametrizing Regression Models

Michael Hills (consultant) and David Clayton (MRC Biostatatistics Unit, Cambridge)

The Mantel–Haenszel commands in Stata are still popular with epidemiologists even though they are less efficient than their maximum likelihood counterparts. The reason lies in the way the parameters are chosen: they show effects of variables of interest (`exposures') by potential confounding variables, possibly controlled for other stratifying variables, followed by combined effects based on the assumption of no interaction. In the conventional parametrization the parameters show the sizes of the interaction terms — these create confusion and fill the screen without being of any practical value.

We will present a series of linked commands which makes it possible to combine any single equation regression model with declared stratifying and confounding variables to produce maximum likelihood estimates of Mantel-Haenszel parameters. The output is based on the kind of table required for publications in the epidemiological literature. For much epidemiological analysis these commands could replace the use of xi.

Quantile plots for right-censored data

Tony Brady (Medical Statistics and Evaluation, Imperial College) and Patrick Royston (MRC Clinical Trials Unit, London)

Parametric survival models make assumptions about the distribution of survival times that are not straightforward to check in practice. This might be one reason why Cox regression is so often used to analyse survival data despite some advantages of parametric models, such as the ability to make inferences directly about survival times in addition to the hazard. We propose a simple tool for checking the distribution of survival times, analogous to a normal plot. Quantiles of the (right) censored survival times are estimated using the method of Kaplan and Meier to account for censoring. These are plotted against quantiles of the proposed parametric survival distribution. Departure of the plotted points from the line of equality indicates departure from the proposed distribution. We will illustrate cqplot using simulated datasets from known survival distributions to show that it works in principle, and then go on to demonstrate its use on real data.

Plotting and fitting univariate distributions with long or heavy tails

Nicholas J. Cox (Geography, University of Durham)

Distributions with long or heavy tails are commonplace and in many fields are more frequently encountered than (say) approximately Gaussian (normal) distributions. Data examples for this presentation come from environmental statistics, in which assessing the character of heavy tails of distributions for such variables as rainfall or river discharge is a central problem. I will survey some graphical and estimation programs written in the Stata language for such distributions. Some of these programs are of use for many kinds of data.

distplot and quantil2 (STB-51) show cumulative distribution functions, survival functions, or quantile functions. skewplot (SSC-IDEAS) is a Tukey-style plot for examining the degree and character of skewness. mexplot and hillplot are more specific to data on extreme events.

Last year Patrick Royston and I reported on a Stata program for calculating L-moments (lmoments, SSC-IDEAS). This approach will be revisited briefly and it will be shown how L-moments provide easily calculated parameter estimates for fitting distributions such as the generalised Pareto distribution and the generalised extreme value distribution. Quantile-quantile plots can also be produced easily given such estimates. In addition, plotting the third and fourth L-moments is an alternative to plotting skewness and kurtosis.

Report to users

William Gould (StataCorp, College Station, TX)

Here are some of the highlights from Bill Gould's report to users and the subsequent discussion of user "wishes and grumbles".

This summary was prepared by Nicholas Cox. Long-time Stata users will know that StataCorp does not make promises or predictions about what will appear when: it promises only to listen very carefully.

StataCorp has been suffering growing pains: the number of technical developers has doubled in the last year. In the short term, output of new code has slowed while developers come up to speed, but thereafter will come faster.

StataCorp will be moving into a new, larger custom-built building, probably early next year.

Sales have been good!

The web-aware features of Stata 6.0 have had a major impact. Since 6.0 was released in January 1999, there have been 51 updates to .ado files and 9 updates to executables. On average, an .ado file is updated every 2.7 days. (One insight into StataCorp's practices is that it takes about 1.5 weeks to certify a new executable.) The net command, which allows most users to update over the internet, allows quick bug fixes and addition of new features. The latter have included improvements in regression accuracy and a much revised xtgee. The new webseek command has led to greater trading of user-written programs.

The Stata Technical Bulletin is growing more slowly than Stata itself, despite improved quality. In due course, the STB will be made available on the web, but precisely in what way is not yet fixed.

Ventures like icd9 (STB-54) for handling disease codes are quite easy for StataCorp and apparently useful for large groups. Suggestions of others? [Audience members suggested zip and SIC.]

Net courses are going well. The latest course on Survival analysis is much more statistical than any previous net course, but has had very good feedback to date.

The programming language will remain stable into future releases. But more structured programming commands will be added, such as a foreach. Improving graphics is one major project under way. Requests from the audience included

  • being able to link C code to Stata
  • being able to use more than one data set within Stata
  • GMM
  • data paths (like ado paths)
  • more flexible merge
  • better error diagnostics, better debugging

Scientific organizers

Nicholas J. Cox, Durham University

Patrick Royston, MRC Clinical Trials Unit

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom.