9th UK meeting proceedings, abstracts, notes
Monday, 19 May 2003
Graphics (and numerics) for comparison
Nick Cox,
Durham University

Abstract
Most statistical data analysis, and thus most graphical data analysis, is
directed towards modelling of relationships, but many statistical problems
have a different flavour: their focus is comparison, and the key question is
assessing agreement or disagreement between two or more datasets or subsets
with variables measured in the same units. I survey the range of official and
userwritten graphical programs available in Stata 8 for such problems, with
emphasis on making use of all the information in the data. Recurrent themes
include (1) the use of reference lines, especially horizontal reference lines,
indicating benchmark cases; (2) the relative merits of superimposition and
juxtaposition; (3) how far methods work well at a range of sample sizes; (4)
standing on giant's shoulders by writing wrappers around existing Stata
commands; (5) use (and abuse) of summary statistics appropriate for such
problems.
Additional information
compare_gph.pdf
Instrumental variables and GMM: Estimation and testing
Mark Schaffer, HeriotWatt University (presenter)
Kit Baum, Boston College
Steven Stillman, New Zealand Department of Labour

Abstract
We discuss instrumental variables (IV) estimation in the broader context of
the generalized method of moments (GMM), and describe a set of Stata commands
ivreg2, ivhettest, overid, and ivendog that allows
the user to estimate linear IV and GMM singleequation estimators and to
apply diagnostic tests for heteroskedasticity, instrument relevance,
overidentification, and endogeneity.
Additional information
IVGMM3316.pdf
wp545.pdf
IVandGMM.do
General score tests for regression models incorporating 'robust' variance estimates
David Clayton,
Diabetes and Inflammation Laboratory,
Cambridge Institute for Medical Research, Cambridge University
Joanna Howson,
Diabetes and Inflammation Laboratory,
Cambridge Institute for Medical Research, Cambridge University

Abstract
Stata incorporates commands for carrying out two of the three general
approaches to asymptotic significance testing in regression models, namely
likelihoodratio (lrtest) and Wald tests (testparms). However,
the third approach, using "score" tests, has no such general implementation.
This omission is particularly serious when dealing with "clustered" data using
the Huber–White approach. Here the likelihoodratio test is lost, leaving only
the Wald test. This has relatively poor asymptotic properties. Our paper
describes a general implementation of score tests which generalizes to the
clustered data case.
Additional information
Clayton_sug03.pdf
Seminonparametric estimation of extended ordered probit models
Mark Stewart, University of Warwick

Abstract
A seminonparametric estimator is presented for a series of generalized models
that nest the ordered probit model and thereby relax the distributional
assumptions in that model. A new Stata command for the estimation of such
models is presented. The approach is illustrated using examples.
Additional information
snp_uksug.pdf
Diagnostics for generalised linear mixed models
Sophia Rabe–Hesketh,
King's College London
and Anders Skrondal,
Norwegian Institute of Public Health

Abstract
Generalized linear mixed models are generalized linear models that include
random effects varying between clusters or 'higherlevel' units of
hierarchically structured data. Such models can be estimated using
gllamm. The prediction command gllapred can be used to obtain
empirical Bayes predictions of the random effects, interpretable as
higherlevel residuals. Combined with approximate sampling standard
deviations, these residuals can be used for identifying unusual higherlevel
units. However, since the distribution of these predictions is generally not
known, we recommend simulating responses from the model using gllasim
and comparing 'observed' and simulated residuals. We also discuss different
types of level 1 residuals and influence diagnostics.
Additional information
diag.pdf
Prognosis of survival for breast cancer patients using Stata
Kenneth Ryder, Breast Cancer Unit, Guy's Hospital
Patrick Royston, MRC Clinical Trials Unit, London

Abstract
All doctors treating patients with breast cancer know which key variables
indicate a good prognosis and which values decrease the chances of surviving.
However, because of complex interactions between the variables and survival,
doctors cannot give an individualized prognosis to a patient. The Breast
Cancer Unit at Guy's Hospital has data on just over 3000 patients, who were
diagnosed between 1975 and 1999, with operable breast cancer, and treated with
different adjuvant therapies. A system has been developed, using to Stata, to
provide the doctors with graphs showing the overall survival curves (including
the risk of dying from other causes) for the tumours characteristics and the
treatments available for that patient. The paper will outline the steps and
functions used in the analysis to produce the predictions for survival and
illustrate how the patient's data are entered into a dialogue box opened via a
main menu option.
Additional information
ryder.pdf
FIML estimation of an endogenous switching model for count data
Alfonso Miranda Caso Luengo, Warwick University

Abstract
We develop FIML code for estimating a Poisson count data model with lognormal
unobserved heterogeneity and an endogenous dummy variable as proposed by Terza
(1998). Gauss–Hermite quadrature is used for calculating the log likelihood
and a ml d0 method is employed. We present an example and discuss the
problems found during the development of the code.
References
Terza, J. 1998. Estimating Count data with endogenous switching:
Sample selection and endogenous treatment effects. Journal of
Econometrics 84: 129–154.
Additional information
esp_usug.pdf
Multiple imputations for missing data in lifecourse studies
Bianca L. De Stavola, London School of Hygiene and Tropical Medicine

Abstract
Missing imputation (MI) is a method to deal with missing at random (MAR) data.
It is a Monte Carlo procedure, where missing values are replaced by several
(usually less than 10) simulated versions. It consists of three steps (Shafer,
1999): i. generation of the imputed values for the missing data; ii. analysis
of each imputed dataset where missing observations are replaced by imputed
ones; and iii. combination of the results from all imputed datasets.
The procedure is easily implemented in Stata for univariate normally
distributed missing variables. Extensions to the case of multivariate normal
variables — often encountered in life course epidemiology — will be discussed.
Reference
Shafer, J. L. 1999. Multiple imputation: a primer. Statistical Methods in
Medical Research 8: 3–15.
On dynamically linked libraries (DLL's) in Stata
Roberto G. Gutierrez and Chinh Nguyen, StataCorp

Abstract
Dynamically linked libraries, or DLL's as they are commonly referred,
can serve as useful and integral parts of Stata userwritten commands.
Since they consist of compiled code, DLL's can speed up the execution of
computationallyintensive portions of commands that are otherwise written
using Stata's ado language. In this talk, we outline a simple and
easilycallable interface between Stata ado code and DLL's written in the
C programming language. An example of this process, as applied to a command
that performs local polynomial smoothing, will also be presented.
Tuesday, 20 May 2003
Building a userfriendly front end to a survey using Stata
Dr. John P. HaiskenDeNew, RWI Essen, DIW Berlin, IZA Bonn

Abstract
SOEPMENU is a Statabased tool intended to ease working with large panel
datasets when running retrievals. As the German SocioEconomic Panel (SOEP) has
18 waves and more than 210 files, correct matching of information over time and
level (e.g., household vs. person) can be tedious. Further, the variable
naming scheme of the SOEP follows the question order in the particular year's
questionnaire, such that without an "itemcorrespondence", there is no
systematic way of knowing the variable name from year "t", given one knows
the name at time "t1". Therefore, for simplification, all datasets contents
are viewed as collection of "items" in an "item correspondence" and not as
"variables". One opens the data files from a dropdown menu system, and the
"items" are displayed for selection. Alternatively, one can browse ALL items
in a browse page, allowing one to select items to be saved into a "basket". By
"collecting" many item s into the basket, one creates a list of items to be
pulled out of the dataset in the retrieval. Additionally, all SOEP
questionnaires have been translated into SMCL pages, with clickable variable
names in the questionnaire. At the click of a button, th e retrieval is run,
according to the options the user has chosen. Not only is a "wide" file
produced, but also a "long" file (in "reshape" terminology). The long file is
possible as all wide file variables have been renamed according the "serial
number" of the particular item. There is also a checking procedure to examine
whether value labels have changed over time. For items that change their
contents or definition over time, there is a standardized interface to allow
"plugins" to recode old variables, generate new variables, etc. Once one has
created the wide and long data files, one can browse them interactively with
the browse tools provided. As the tool automatically pulls out the appropriate
weighting factors, these can be used at the click of a button. The "SOEP
project" or the collection of "items" in the "basket", can be saved, reloaded,
appended, etc. This allows the addition of modulized baskets. All data can be
dumped out directly for use in SPSS, SAS (keeping all labels), and Excel. As
the SOEP data are bilingual (German and English), once can switch between
languages with any input and output file (one can use English labeled input
files and automatically produce German labeled retrieval files). SOEPMENU is
written for Stata 8.
Multivariate probit regression using simulated maximum likelihood
Lorenzo Cappellari, Università del Piemonte Orientale and University of Essex
and Stephen P. Jenkins, University of Essex

Abstract
We discuss the application of the GHK simulation method to maximum likelihood
estimation of the multivariate probit regression model, and describe and
illustrate a Stata program mvprobit for this purpose.
Additional information
Capelari_JenkinsUKSUG2003.pdf
Using Stata's ml method d2 to estimate a multistate Markov transition model
Thomas Büettner, London School of Economics

Abstract
I will discuss my experience with Stata's ml method d2 when coding and
estimator for a multistate Markov transition model with unobserved
heterogeneity. When analytical derivatives are available, programming a "d2"
estimator is in principle straightforward and offers potentially huge rewards
in terms of convergence and speed of convergence: When the likelihood is flat,
method "d0" may fail to converge (after a many iterations) as numerical
derivatives cannot be computed, whereas convergence is often achieved quickly
with method "d2". However, when the likelihood function is nonstandard,
programming a "d2" estimator may be complicated by Stata's limited range of
matrix commands. In these cases, the researcher has to be inventive and may
have to take a significant "diversion" to compute blocks of the Hessian that
should have been straightforward with enhanced matrix capabilities. These
"diversions" may be difficult to code and increase evaluation time
significantly. With large datasets, this may also push the memory
requirements beyond the available limit.
Additional information
Buettner_stat.pdf
Calculation of average marginal effects using margin
Tamás Bartus, Budapest University of Economics and Public Administration

Abstract
Margin is a userwritten program that estimates average marginal effects; i.e.,
the sample average of the effects of partial or discrete changes in the
explanatory variables. The presentation will compare the performance of margin
and the official mfx. Margin is quicker because it computes the
marginal effects and their standard errors analytically, using the appropriate
cumulative distribution and density functions. If the dependent variable is a
categorical or count variable, margin is more easy to use because it computes
the marginal effects for each outcomes. It will also be shown that, unlike
margin, mfx can produce misleading results after categorical models if
the regression model includes a set of dummy variables which refer to the
categories of a single categorical variable.
Using the Longitudinal Study database with Stata
Andy Sloggett, London School of Hygiene and Tropical Medicine

Abstract
The Office for National Statistics Longitudinal Study (ONS/LS) is a huge, well
maintained, database of linked census records for 1% of the population of
England and Wales. With the imminent addition of 2001 census records it
provides longitudinal data on nearly a million people over a 30year period.
Although a wealth of research has flowed out of the LS since the mid1970's, it
is probably still underutilised, given its potential. This may be due to
nonawareness of the richness of the dataset, or because it has a reputation
of being "difficult" to work with. Access to the LS is indeed not so
straightforward as some other studies, but an academic support team is
available to academic users free of charge and this takes much of the drudge
out of access, as well as providing very constructive support for projects.
The support team use Stata as the software medium of choice and academic Stata
users will therefore find the interchange of code between themselves and the
support team familiar. Procedures common to Stata users, such as stsetting
and stsplitting longitudinal data, are now in common use for LS data.
Release of LS data from ONS is subject to certain restrictions and these will
be explained using an example of survival analysis following diagnosis of
cancer.
Additional information
sloggett2.pdf
Bootstrap CI and test statistics for kernel density estimates using Stata
Carlo Fiorio, London School of Economics and STICERD

Abstract
In recent years, nonparametric density estimation has been extensively
employed in several fields as a powerful descriptive tool, which is far more
informative and robust than histograms. Moreover, the increased computation
power of modern computers has made nonparametric density estimation a
relatively "cheap" computation, helping to easily detect unexpected aspects of
the distribution such as bimodality. However, it is also often neglected that
nonparametric methods can only provide an estimate of the true density,
whose reliability depends on various factors, such as the number of data
available and the bandwidth. We will focus here on kernel density estimation
and discuss the problem of computing bootstrap confidence intervals and test
statistics for pointwise density estimation using Stata. Construction of
confidence intervals and test of hypothesis about the true density are carried
out using an asymptotically pivotal studentized statistic after computing a
suitable estimator for its variance. The issue of asymptotic biased
correction is also discussed and tackled.
Additional information
bsciker.pdf
Adaptive kernel density estimation
Philippe Van Kerm, CEPS/INSTEAD, Differdange, G.D. Luxembourg

Abstract
The talk illustrates a userwritten command that extends the official
kdensity to estimate density functions by the kernel method. The
extensions are of two types. Firstly, the new command allows the use of an
'adaptive kernel' approach with varying, rather than fixed, bandwidths.
Secondly, estimates of pointwise variability bands around the estimated
density functions are computed.
Additional information
uksug_slides_anim.pdf
akdensity.pdf
Multiple test procedures and smile plots
Roger Newson, King's College University of London

Abstract
Scientists often have good reasons for wanting to calculate multiple
confidence intervals and/or pvalues, especially when scanning a
genome. However, if we do this, then the probability of not observing at
least one "significant" difference tends to fall, even if all null hypotheses
are true. A skeptical public will rightly ask whether a difference is
"significant" when considered as one of a large number of parameters
estimated. This presentation demonstrates some solutions to this problem,
using the unofficial Stata packages parmest and smileplot. The
parmest package allows the calculation of Bonferronicorrected or
Sidakcorrected confidence intervals for multiple estimated parameters. The
smileplot package contains two programs, multproc (which carries
out multiple test procedures) and smileplot (which presents their
results graphically by plotting the pvalue on a reverse log scale on
the vertical axis against the parameter estimate on the horizontal axis). A
multiple test procedure takes, as input, a set of estimates and
pvalues, and rejects a subset (possibly empty) of the null hypotheses
corresponding to these pvalues. Multiple test procedures have
traditionally controlled the familywise error rate (FWER), typically enabling
the user to be 95% confident that all the rejected null hypotheses are false,
and that all the corresponding "discoveries" are real. The price of this
confidence is that the power to detect a difference of a given size tends to
zero as the number of measured parameters become large. Therefore, recent
work has concentrated on procedures that control the false disco very rate
(FDR), such as the Simes procedure and the YekutieliBenjamini procedure.
FDRcontrolling procedures attempt to control the number of false discoveries
as a proportion of the number of true discoveries, typically enabling the user
to be 95% confident that some of the discoveries are real, or 90% confident
that most of the discoveries are real. This less stringent requirement causes
power to "bottom out" at a nonzero level as the number of tests becomes
large. The smileplot package offers a selection of multiple test
procedures of both kinds.
Additional information
TRANSP1.pdf
Report to users
Bill Gould,
StataCorp
Bill Gould, StataCorp