>> Home >> Resources & support >> Users Group meetings >> 2010 UK Stata Users Group meeting >> Abstracts

Roger B. Newson

National Heart and Lung Institute, Imperial College London

The **parmest** package is used with Stata estimation commands to produce
output datasets (or results-sets) with one observation per estimated
parameter, and data on parameter names, estimates, confidence limits,
*p*-values, and other parameter attributes. These results-sets can then
be input to other Stata programs to produce tables, listings, plots, and
secondary results-sets containing derived parameters. Three recently added
packages for post-**parmest** processing are **fvregen**,
**invcise**, and **qqvalue**.

**fvregen** is used when the parameters belong to models containing
factor variables, introduced in Stata version 11. It regenerates these
factor variables in the results-set, enabling the user to plot, list, or
tabulate factor levels with estimates and confidence limits of parameters
specific to these factor levels.

**invcise** calculates standard errors inversely from confidence limits
produced without standard errors, such as those for medians and for
Hodges–Lehmann median differences. These standard errors can then be
input, with the estimates, into the **metaparm** module of
**parmest** to produce confidence intervals for linear combinations of
medians or of median differences, such as those used in meta-analysis or
interaction estimation.

**qqvalue** inputs the *p*-values in a results-set and creates a new
variable containing the quasi-*q*-values, which are calculated by inverting a
multiple-test procedure designed to control the familywise error rate (FWER)
or the false discovery rate (FDR). The quasi-*q*-value for each
*p*-value is the minimum FWER or FDR for which that *p*-value
would be in the discovery set if the specified multiple-test procedure was
used on the full set of *p*-values. **fvregen**, **invcise**,
**qqvalue**, and **parmest** can be downloaded from SSC.

**Additional information**

UKSUG10_newson1.zip

UKSUG10_newson1.zip

John D'Souza

National Centre for Social Research, London

Although survey data are sometimes weighted by their selection weights, it
is often preferable to use auxiliary information available on the whole
population to improve estimation. Calibration weighting (Deville and
Sarndal, 1992, *Journal of the American Statistical Association* 87:
376–382) is one of the most common methods of doing this. This method
adjusts the selection weights so that known population totals for the
auxiliary variables are reproduced exactly, while ensuring that the
calibrated weights are as close as possible to the original sampling weight.
The simplest example of calibration is poststratification. This is the
special case where the auxiliary variable is a single categorical variable.
General calibration extends this to deal with more than one auxiliary
variable and allows the user to include both categorical and numerical
variables.

A typical example might occur in a population survey, where the selection weights could be calibrated to ensure that the sample weighted by the calibration weights has exactly the same distribution as the population on variables such as age, sex, and region.

Many packages have routines for calibration. SAS has the macro CALMAR; GenStat has the procedure SVCALIBRATE; and R has the function**calibrate**. However, no such routine is publicly available in Stata. I
will introduce a user-written Stata program for calibration and will also
discuss a simple extension to show how it can incorporate a nonresponse
correction. I will also briefly discuss the program’s strengths and limitations when
compared to rival packages.

**Additional information**

UKSUG10.DSouza.ppt

A typical example might occur in a population survey, where the selection weights could be calibrated to ensure that the sample weighted by the calibration weights has exactly the same distribution as the population on variables such as age, sex, and region.

Many packages have routines for calibration. SAS has the macro CALMAR; GenStat has the procedure SVCALIBRATE; and R has the function

UKSUG10.DSouza.ppt

Therese M.-L. Andersson

Department of Medical Epidemiology and Biostatistics,
Karolinska Institutet, Stockholm

Cure models can be used to simultaneously estimate the proportion of cancer
patients who are eventually cured of their disease and the survival of
those who remain “uncured”. One limitation of parametric cure models is
that the functional form of the survival of the “uncured” has to
be specified. It can sometimes be hard to fit survival functions flexible
enough to capture high mortality rates within a few months from a diagnosis
or a high cure proportion (e.g., over 90).

If instead the flexible parametric survival models implemented in**stpm2** could be used, then these
problems could potentially be avoided. Flexible parametric survival models
are fit on the log cumulative hazard scale using restricted cubic splines
for the baseline. When cure is reached, the excess hazard rate (the
difference in the observed all-cause mortality rate among the patients
compared with that expected in the general population) is zero, and the
cumulative excess hazard is constant. By incorporating an extra constraint
on the log cumulative excess hazard after the last knot so that we force it
not only to be linear but also to have zero slope, we are able to estimate
the cure proportion. The flexible parametric survival model can be written
as a special case of a nonmixture cure model, but with a more flexible
distribution, which also enables estimation of the survival of
“uncured” patients.

We have updated the user-written**stpm2** command for flexible
parametric models and added a cure option as well as postestimation
predictions of the cure proportion and survival of the “uncured”. We will
compare the use of flexible parametric cure models implemented in
**stpm2** with standard parametric cure models implemented in
**strsmix** and **strsnmix**.

This is joint work with Sandra Eloranta and Paul W. Dickman (same institution) and Paul C. Lambert (same institution and Department of Health Sciences, University of Leicester).

**Additional information**

UKSUG10.Andersson.pptx

If instead the flexible parametric survival models implemented in

We have updated the user-written

This is joint work with Sandra Eloranta and Paul W. Dickman (same institution) and Paul C. Lambert (same institution and Department of Health Sciences, University of Leicester).

UKSUG10.Andersson.pptx

Catherine Welch

Department of Primary Care & Population Health, University College London

Most standard missing-data techniques have been designed for cross-sectional
data. A “forward-backward” multiple-imputation algorithm has
been developed to impute missing values in longitudinal data (Nevalainen,
Kenward, and Virtanen, 2009, *Statistics in Medicine* 28:
36577–3669) This technique will be applied to The Health
Improvement Network (THIN), a longitudinal primary-care database to impute
variables associated with incidence of cardiovascular disease (CVD).

A sample of 483 patients was extracted from THIN to test the performance of the algorithm before it was applied to the whole dataset. This dataset included individuals with information available on age, sex, deprivation quintile, height, weight, systolic blood pressure, and total serum cholesterol for each age from 65 to 69 years. CVD was identified if the patient was diagnosed with one of a predefined list of conditions at any of these ages. They were then considered to have CVD at each subsequent age.

In this sample, measurements of weight, systolic blood pressure, and cholesterol were replaced with missing values such that the probability that data are missing decreases as age increases; i.e., the data are missing at random and the overall percentage of missing data is equivalent to that in THIN. We then applied the forward-backward algorithm, which imputes values at each time point by using measurements before and after the one of interest and updates values sequentially. Ten complete datasets were created. A Poisson regression was performed using data in each dataset, and estimates were combined using Rubin’s rules. These steps were repeated 200 times and the coefficients were averaged.

I will explain in more detail how the forward-backward algorithm works and also will demonstrate the results following multiple imputation using this algorithm. I will compare these results with the analysis before data were replaced with missing values and a complete case analysis to assess the performance of the algorithm.

This is joint work with Irene Petersen (same institution) and James Carpenter (Medical Statistics Unit, London School of Hygiene and Tropical Medicine).

**Additional information**

UKSUG10.Welch.ppt

A sample of 483 patients was extracted from THIN to test the performance of the algorithm before it was applied to the whole dataset. This dataset included individuals with information available on age, sex, deprivation quintile, height, weight, systolic blood pressure, and total serum cholesterol for each age from 65 to 69 years. CVD was identified if the patient was diagnosed with one of a predefined list of conditions at any of these ages. They were then considered to have CVD at each subsequent age.

In this sample, measurements of weight, systolic blood pressure, and cholesterol were replaced with missing values such that the probability that data are missing decreases as age increases; i.e., the data are missing at random and the overall percentage of missing data is equivalent to that in THIN. We then applied the forward-backward algorithm, which imputes values at each time point by using measurements before and after the one of interest and updates values sequentially. Ten complete datasets were created. A Poisson regression was performed using data in each dataset, and estimates were combined using Rubin’s rules. These steps were repeated 200 times and the coefficients were averaged.

I will explain in more detail how the forward-backward algorithm works and also will demonstrate the results following multiple imputation using this algorithm. I will compare these results with the analysis before data were replaced with missing values and a complete case analysis to assess the performance of the algorithm.

This is joint work with Irene Petersen (same institution) and James Carpenter (Medical Statistics Unit, London School of Hygiene and Tropical Medicine).

UKSUG10.Welch.ppt

Nicholas J. Cox

Department of Geography, Durham University

Stata’s graphics were completely rewritten for Stata 8, with further
key additions in later versions. Its official commands have, as usual, been
supplemented by a variety of user-written programs. The resulting variety
presents even experienced users with a system that undeniably is large,
often appears complicated, and sometimes seems confusing. In this talk, I
provide a personal digest of graphics strategy and tactics for Stata users
emphasizing details large and small that, in my view, deserve to be known by
all.

**Additional information**

UKSUG10.Cox.zip

UKSUG10.Cox.zip

William Gould

StataCorp, College Station, Texas

Mata is Stata’s matrix programming language. StataCorp provides
detailed documentation on it, but so far has failed to give users—and
especially users who add new features to Stata—any guidance in when and
how to use the language. This talk provides what has been missing. In
practical ways, this talk shows how to include Mata code in Stata ado-files,
it reveals when to include Mata code and when not to, and it provides an
introduction to the broad concepts of Mata, the concepts that will make the
*Mata Reference Manual* approachable.

**Additional information**

UKSUG10.Gould.pdf

UKSUG10.Gould.pdf

J. Charles Huber Jr.

Texas A&M Health Science Center School of Rural Public Health, College Station, Texas

Project Heartbeat! was a longitudinal study of metabolic and morphological
changes in adolescents aged 8–18 years and was conducted in the 1990s.
A study is currently being conducted to consider the relationship between a
collection of phenotypes (including BMI, blood pressure, and blood lipids) and
a panel of 1,500 candidate SNPs (single nucleotide polymorphisms).
Traditional genetics software such as PLINK and HelixTree lacks the ability
to model longitudinal phenotype data.

This talk will describe the use of Stata for a longitudinal genetic association study from the early stages of data checking (allele frequencies and Hardy–Weinberg equilibrium), modeling of individual SNPs, the use of false discovery rates to control for the large number of comparisons, exporting and importing data through PHASE for haplotype reconstruction, selection of tagSNPs in Stata, and the analysis of haplotypes. We will also discuss strategies for scaling up to an Illumina 100k SNP chip using Stata. All SNP and gene names will be de-identified, because this is a work in progress.

This is joint work with Michael Hallman, Ron Harrist, Victoria Friedel, Melissa Richard, and Huandong Sun (same institution).

**Additional information**

UKSUG10.Huber.ppt

This talk will describe the use of Stata for a longitudinal genetic association study from the early stages of data checking (allele frequencies and Hardy–Weinberg equilibrium), modeling of individual SNPs, the use of false discovery rates to control for the large number of comparisons, exporting and importing data through PHASE for haplotype reconstruction, selection of tagSNPs in Stata, and the analysis of haplotypes. We will also discuss strategies for scaling up to an Illumina 100k SNP chip using Stata. All SNP and gene names will be de-identified, because this is a work in progress.

This is joint work with Michael Hallman, Ron Harrist, Victoria Friedel, Melissa Richard, and Huandong Sun (same institution).

UKSUG10.Huber.ppt

Yulia Marchenko

StataCorp, College Station, Texas

In haplotype-association studies, the risk of a disease is often determined
not only by the presence of certain haplotypes but also by their
interactions with various environmental factors. The detection of such
interactions with case–control data is a challenging task and often requires
very large samples. This prompted the development of more efficient
estimation methods for analyzing case–control genetic data. The
**haplologit** command implements efficient semiparametric methods, recently
proposed in the literature, for fitting haplotype-environment models in the
very important special cases of 1) a rare disease, 2) a single candidate
gene in Hardy–Weinberg equilibrium, and 3) independence of genetic and
environmental factors. In this presentation, I will describe new
features of the **haplologit** command.

**Additional information**

UKSUG10.Marchenko.pdf

UKSUG10.Marchenko.pdf

Patrick Royston

MRC Clinical Trials Unit, London

Fractional polynomial models are a simple yet very useful extension of
ordinary polynomials. They greatly increase the available range of
nonlinear functions and are often used in regression modeling, both in
univariate format (using Stata’s **fracpoly** command) and in
multivariable modeling (using **mfp**). The standard implementation in
**fracpoly** supports a wide range of single-equation regression models
but can not cope with the more complex and varied syntaxes of other types of
multi-equation models. In this talk, I show that if you are willing to do
some straightforward do-file programming, you can apply fractional
polynomials in a bespoke manner to more complex Stata regression commands
and get useful results. I illustrate the approach in multilevel modeling
of longitudinal fetal-size data using **xtmixed** and in a seemingly
unrelated regression analysis of a dataset of academic achievement using
**sureg**.

**Additional information**

UKSUG10.Royston.ppt

UKSUG10.Royston.ppt

Robert A. Yaffee

Silver School of Social Work, New York University

Forecasters are expected to provide evaluations of their forecasts along with
their forecasts. The forecast assessments demonstrate comparative,
adequate, or optimal accuracy by common forecasting criteria to
provide acceptable credence in the forecasts. To assist the Stata user in
this process, Robert Yaffee has written Stata programs to evaluate ARIMA and
GARCH models. He explains how these assessment programs are applied to
one-step-ahead and dynamic forecasts, ex post and ex ante
forecasts, conditional and unconditional forecasts, as well as combinations
of forecasts. In his presentation, he will also demonstrate how assessment
can be applied to rolling origin forecasts of time-series models.

**Additional information**

UKSUG10.Yaffee.pdf

UKSUG10.Yaffee.pdf

Jonathan Sterne

Department of Social Medicine, University of Bristol

Roger Harbord

Department of Social Medicine, University of Bristol

Ian White

MRC Biostatistics Unit, Cambridge

A comprehensive range of user-written commands for meta-analysis is
available in Stata and documented in detail in the recent book
*Meta-Analysis in Stata* (Sterne, ed., 2009, [Stata Press]).The purpose of this session
is to describe these commands, with a focus on recent developments and
areas in which further work is needed. We will define systematic reviews and
meta-analyses and will introduce the **metan** command, which is the main
Stata meta-analysis command. We will distinguish between meta-analyses of
randomized controlled trials and observational studies, and we will discuss the
additional complexities inherent in systematic reviews of the latter.

Meta-analyses are often complicated by heterogeneity, variation between the results of different studies beyond that expected due to sampling variation alone. Meta-regression, implemented in the**metareg** command, can be
used to explore reasons for heterogeneity, although its utility in medical
research is limited by the modest numbers of studies typically included in
meta-analyses and the many possible reasons for heterogeneity.
Heterogeneity is a striking feature of meta-analyses of diagnostic-test
accuracy studies. We will describe how to use the **midas** and
**metandi** commands to display and meta-analyse the results of such
studies.

Many meta-analysis problems involve combining estimates of more than one quantity: for example, treatment effects on different outcomes or contrasts among more than two groups. Such problems can be tackled using multivariate meta-analysis, implemented in the**mvmeta** command. We
will describe how the model is fit, and when it may be superior to a set of
univariate meta-analyses. Will will also illustrate its application in a variety of
settings.

**Additional information**

UKSUG10.Sterne.pdf

UKSUG10.White.ppt

UKSUG10.Harbord.pdf

Meta-analyses are often complicated by heterogeneity, variation between the results of different studies beyond that expected due to sampling variation alone. Meta-regression, implemented in the

Many meta-analysis problems involve combining estimates of more than one quantity: for example, treatment effects on different outcomes or contrasts among more than two groups. Such problems can be tackled using multivariate meta-analysis, implemented in the

UKSUG10.Sterne.pdf

UKSUG10.White.ppt

UKSUG10.Harbord.pdf

Christopher F. Baum

Department of Economics, Boston College, Chestnut Hill, Massachusetts

In this presentation, I update Nichols and Schaffer’s 2007 UK Stata Users
Group talk on clustered standard errors. Although cluster–robust standard
errors are now recognized as essential in a panel-data context, official
Stata only supports clusters that are nested within panels. This
requirement rules out the possibility of defining clusters in the time
dimension and modeling contemporaneous dependence of panel units’
error processes. I build upon recent analytical developments that define
two-way (and conceptually, *n*-way) clustering and upon the 2010
implementation of two-way clustering in the widely used **ivreg2** and
**xtivreg2** packages. I present examples of the utility of one-way and
two-way clustering using Monte Carlo techniques, I present a comparison with
alternative approaches to modeling error dependence, and I consider tests
for clustering of errors.

This is joint work with Mark E. Schaffer (Heriot-Watt University) and Austin Nichols (Urban Institute).

**Additional information**

UKSUG10.Baum.pdf

This is joint work with Mark E. Schaffer (Heriot-Watt University) and Austin Nichols (Urban Institute).

UKSUG10.Baum.pdf

Barbara Sianesi

Institute for Fiscal Studies, London

Matching, especially in its propensity-score flavors, has become an
extremely popular evaluation method. Matching is, in fact, the
best-available method for selecting a matched (or reweighted) comparison
group that looks like the treatment group of interest.

In this talk, I will introduce matching methods within the general problem of causal inference, highlight their strengths and weaknesses, and offer a brief overview of different matching estimators. Using**psmatch2**, I
will then step through a practical example in Stata that is based on real
data. I will then show how to implement some of these estimators, as well as
highlight a number of implementational issues.

**Additional information**

UKSUG10.Sianesi.pdf

UKSUG10.Sianesi.zip

In this talk, I will introduce matching methods within the general problem of causal inference, highlight their strengths and weaknesses, and offer a brief overview of different matching estimators. Using

UKSUG10.Sianesi.pdf

UKSUG10.Sianesi.zip

William Gould

StataCorp, College Station, Texas

William Gould, as President of StataCorp and Chief of Development, will
report on StataCorp activity over the last year. This will morph into the
traditional voicing from the audience of users’ wishes and grumbles
regarding Stata.