2003 UK Stata Users Group meeting

Home / Resources & support / User Group Meetings / 2003 UK Stata Users Group meeting

Last updated: 9 June 2003

2003 UK Stata Users Group meeting

19–20 May 2003

Royal Statistical Society
12 Errol Street
London EC1Y 8LX

Materials documenting the meeting

Proceedings

Graphics (and numerics) for comparison

Nick Cox, Durham University

Abstract

Most statistical data analysis, and thus most graphical data analysis, is directed towards modelling of relationships, but many statistical problems have a different flavour: their focus is comparison, and the key question is assessing agreement or disagreement between two or more datasets or subsets with variables measured in the same units. I survey the range of official and user-written graphical programs available in Stata 8 for such problems, with emphasis on making use of all the information in the data. Recurrent themes include (1) the use of reference lines, especially horizontal reference lines, indicating benchmark cases; (2) the relative merits of superimposition and juxtaposition; (3) how far methods work well at a range of sample sizes; (4) standing on giant's shoulders by writing wrappers around existing Stata commands; (5) use (and abuse) of summary statistics appropriate for such problems.

Additional information
compare_gph.pdf

Instrumental variables and GMM: Estimation and testing

Mark Schaffer, Heriot-Watt University (presenter)
Kit Baum, Boston College
Steven Stillman, New Zealand Department of Labour

Abstract

We discuss instrumental variables (IV) estimation in the broader context of the generalized method of moments (GMM), and describe a set of Stata commands ivreg2, ivhettest, overid, and ivendog that allows the user to estimate linear IV and GMM single-equation estimators and to apply diagnostic tests for heteroskedasticity, instrument relevance, overidentification, and endogeneity.

Additional information
IVGMM3316.pdf
wp545.pdf
IVandGMM.do

General score tests for regression models incorporating 'robust' variance estimates

David Clayton, Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, Cambridge University
Joanna Howson, Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, Cambridge University

Abstract

Stata incorporates commands for carrying out two of the three general approaches to asymptotic significance testing in regression models, namely likelihood-ratio (lrtest) and Wald tests (testparms). However, the third approach, using "score" tests, has no such general implementation. This omission is particularly serious when dealing with "clustered" data using the Huber–White approach. Here the likelihood-ratio test is lost, leaving only the Wald test. This has relatively poor asymptotic properties. Our paper describes a general implementation of score tests which generalizes to the clustered data case.

Additional information
Clayton_sug03.pdf

Semi-nonparametric estimation of extended ordered probit models

Mark Stewart, University of Warwick

Abstract

A semi-nonparametric estimator is presented for a series of generalized models that nest the ordered probit model and thereby relax the distributional assumptions in that model. A new Stata command for the estimation of such models is presented. The approach is illustrated using examples.

Additional information
snp_uksug.pdf

Diagnostics for generalised linear mixed models

Sophia Rabe–Hesketh, King's College London
and Anders Skrondal, Norwegian Institute of Public Health

Abstract

Generalized linear mixed models are generalized linear models that include random effects varying between clusters or 'higher-level' units of hierarchically structured data. Such models can be estimated using gllamm. The prediction command gllapred can be used to obtain empirical Bayes predictions of the random effects, interpretable as higher-level residuals. Combined with approximate sampling standard deviations, these residuals can be used for identifying unusual higher-level units. However, since the distribution of these predictions is generally not known, we recommend simulating responses from the model using gllasim and comparing 'observed' and simulated residuals. We also discuss different types of level 1 residuals and influence diagnostics.

Additional information
diag.pdf

Prognosis of survival for breast cancer patients using Stata

Kenneth Ryder, Breast Cancer Unit, Guy's Hospital
Patrick Royston, MRC Clinical Trials Unit, London

Abstract

All doctors treating patients with breast cancer know which key variables indicate a good prognosis and which values decrease the chances of surviving. However, because of complex interactions between the variables and survival, doctors cannot give an individualized prognosis to a patient. The Breast Cancer Unit at Guy's Hospital has data on just over 3000 patients, who were diagnosed between 1975 and 1999, with operable breast cancer, and treated with different adjuvant therapies. A system has been developed, using to Stata, to provide the doctors with graphs showing the overall survival curves (including the risk of dying from other causes) for the tumours characteristics and the treatments available for that patient. The paper will outline the steps and functions used in the analysis to produce the predictions for survival and illustrate how the patient's data are entered into a dialogue box opened via a main menu option.

Additional information
ryder.pdf

FIML estimation of an endogenous switching model for count data

Alfonso Miranda Caso Luengo, Warwick University

Abstract

We develop FIML code for estimating a Poisson count data model with lognormal unobserved heterogeneity and an endogenous dummy variable as proposed by Terza (1998). Gauss–Hermite quadrature is used for calculating the log likelihood and a ml d0 method is employed. We present an example and discuss the problems found during the development of the code.

References

Terza, J. 1998. Estimating Count data with endogenous switching: Sample selection and endogenous treatment effects. Journal of Econometrics 84: 129–154.

Additional information
esp_usug.pdf

Multiple imputations for missing data in lifecourse studies

Bianca L. De Stavola, London School of Hygiene and Tropical Medicine

Abstract

Missing imputation (MI) is a method to deal with missing at random (MAR) data. It is a Monte Carlo procedure, where missing values are replaced by several (usually less than 10) simulated versions. It consists of three steps (Shafer, 1999): i. generation of the imputed values for the missing data; ii. analysis of each imputed dataset where missing observations are replaced by imputed ones; and iii. combination of the results from all imputed datasets.

The procedure is easily implemented in Stata for univariate normally distributed missing variables. Extensions to the case of multivariate normal variables — often encountered in life course epidemiology — will be discussed.

Reference

Shafer, J. L. 1999. Multiple imputation: a primer. Statistical Methods in Medical Research 8: 3–15.

On dynamically linked libraries (DLL's) in Stata

Roberto G. Gutierrez and Chinh Nguyen, StataCorp

Abstract

Dynamically linked libraries, or DLL's as they are commonly referred, can serve as useful and integral parts of Stata user-written commands. Since they consist of compiled code, DLL's can speed up the execution of computationally-intensive portions of commands that are otherwise written using Stata's ado language. In this talk, we outline a simple and easily-callable interface between Stata ado code and DLL's written in the C programming language. An example of this process, as applied to a command that performs local polynomial smoothing, will also be presented.

Building a user-friendly front end to a survey using Stata

Dr. John P. Haisken-DeNew, RWI Essen, DIW Berlin, IZA Bonn

Abstract

SOEPMENU is a Stata-based tool intended to ease working with large panel datasets when running retrievals. As the German Socio-Economic Panel (SOEP) has 18 waves and more than 210 files, correct matching of information over time and level (e.g., household vs. person) can be tedious. Further, the variable naming scheme of the SOEP follows the question order in the particular year's questionnaire, such that without an "item-correspondence", there is no systematic way of knowing the variable name from year "t", given one knows the name at time "t-1". Therefore, for simplification, all datasets contents are viewed as collection of "items" in an "item correspondence" and not as "variables". One opens the data files from a drop-down menu system, and the "items" are displayed for selection. Alternatively, one can browse ALL items in a browse page, allowing one to select items to be saved into a "basket". By "collecting" many item s into the basket, one creates a list of items to be pulled out of the dataset in the retrieval. Additionally, all SOEP questionnaires have been translated into SMCL pages, with clickable variable names in the questionnaire. At the click of a button, th e retrieval is run, according to the options the user has chosen. Not only is a "wide" file produced, but also a "long" file (in "reshape" terminology). The long file is possible as all wide file variables have been renamed according the "serial number" of the particular item. There is also a checking procedure to examine whether value labels have changed over time. For items that change their contents or definition over time, there is a standardized interface to allow "plugins" to recode old variables, generate new variables, etc. Once one has created the wide and long data files, one can browse them interactively with the browse tools provided. As the tool automatically pulls out the appropriate weighting factors, these can be used at the click of a button. The "SOEP project" or the collection of "items" in the "basket", can be saved, reloaded, appended, etc. This allows the addition of modulized baskets. All data can be dumped out directly for use in SPSS, SAS (keeping all labels), and Excel. As the SOEP data are bilingual (German and English), once can switch between languages with any input and output file (one can use English labeled input files and automatically produce German labeled retrieval files). SOEPMENU is written for Stata 8.

Multivariate probit regression using simulated maximum likelihood

Lorenzo Cappellari, Università del Piemonte Orientale and University of Essex and Stephen P. Jenkins, University of Essex

Abstract

We discuss the application of the GHK simulation method to maximum likelihood estimation of the multivariate probit regression model, and describe and illustrate a Stata program mvprobit for this purpose.

Additional information
Capelari_Jenkins-UKSUG2003.pdf

Using Stata's -ml method d2- to estimate a multi-state Markov transition model

Thomas Büettner, London School of Economics

Abstract

I will discuss my experience with Stata's ml method d2 when coding and estimator for a multi-state Markov transition model with unobserved heterogeneity. When analytical derivatives are available, programming a "d2" estimator is in principle straightforward and offers potentially huge rewards in terms of convergence and speed of convergence: When the likelihood is flat, method "d0" may fail to converge (after a many iterations) as numerical derivatives cannot be computed, whereas convergence is often achieved quickly with method "d2". However, when the likelihood function is nonstandard, programming a "d2" estimator may be complicated by Stata's limited range of matrix commands. In these cases, the researcher has to be inventive and may have to take a significant "diversion" to compute blocks of the Hessian that should have been straightforward with enhanced matrix capabilities. These "diversions" may be difficult to code and increase evaluation time significantly. With large datasets, this may also push the memory requirements beyond the available limit.

Additional information
Buettner_stat.pdf

Calculation of average marginal effects using -margin-

Tamás Bartus, Budapest University of Economics and Public Administration

Abstract

Margin is a user-written program that estimates average marginal effects; i.e., the sample average of the effects of partial or discrete changes in the explanatory variables. The presentation will compare the performance of margin and the official mfx. Margin is quicker because it computes the marginal effects and their standard errors analytically, using the appropriate cumulative distribution and density functions. If the dependent variable is a categorical or count variable, margin is more easy to use because it computes the marginal effects for each outcomes. It will also be shown that, unlike margin, mfx can produce misleading results after categorical models if the regression model includes a set of dummy variables which refer to the categories of a single categorical variable.

Using the Longitudinal Study database with Stata

Andy Sloggett, London School of Hygiene and Tropical Medicine

Abstract

The Office for National Statistics Longitudinal Study (ONS/LS) is a huge, well maintained, database of linked census records for 1% of the population of England and Wales. With the imminent addition of 2001 census records it provides longitudinal data on nearly a million people over a 30-year period.

Although a wealth of research has flowed out of the LS since the mid-1970's, it is probably still under-utilised, given its potential. This may be due to non-awareness of the richness of the dataset, or because it has a reputation of being "difficult" to work with. Access to the LS is indeed not so straightforward as some other studies, but an academic support team is available to academic users free of charge and this takes much of the drudge out of access, as well as providing very constructive support for projects.

The support team use Stata as the software medium of choice and academic Stata users will therefore find the interchange of code between themselves and the support team familiar. Procedures common to Stata users, such as stsetting and stsplitting longitudinal data, are now in common use for LS data. Release of LS data from ONS is subject to certain restrictions and these will be explained using an example of survival analysis following diagnosis of cancer.

Additional information
sloggett2.pdf

Bootstrap CI and test statistics for kernel density estimates using Stata

Carlo Fiorio, London School of Economics and STICERD

Abstract

In recent years, nonparametric density estimation has been extensively employed in several fields as a powerful descriptive tool, which is far more informative and robust than histograms. Moreover, the increased computation power of modern computers has made nonparametric density estimation a relatively "cheap" computation, helping to easily detect unexpected aspects of the distribution such as bimodality. However, it is also often neglected that nonparametric methods can only provide an estimate of the true density, whose reliability depends on various factors, such as the number of data available and the bandwidth. We will focus here on kernel density estimation and discuss the problem of computing bootstrap confidence intervals and test statistics for point-wise density estimation using Stata. Construction of confidence intervals and test of hypothesis about the true density are carried out using an asymptotically pivotal studentized statistic after computing a suitable estimator for its variance. The issue of asymptotic biased correction is also discussed and tackled.

Additional information
bsciker.pdf

Adaptive kernel density estimation

Philippe Van Kerm, CEPS/INSTEAD, Differdange, G.-D. Luxembourg

Abstract

The talk illustrates a user-written command that extends the official kdensity to estimate density functions by the kernel method. The extensions are of two types. Firstly, the new command allows the use of an 'adaptive kernel' approach with varying, rather than fixed, bandwidths. Secondly, estimates of pointwise variability bands around the estimated density functions are computed.

Additional information
uksug_slides_anim.pdf
akdensity.pdf

Multiple test procedures and smile plots

Roger Newson, King's College University of London

Abstract

Scientists often have good reasons for wanting to calculate multiple confidence intervals and/or p-values, especially when scanning a genome. However, if we do this, then the probability of not observing at least one "significant" difference tends to fall, even if all null hypotheses are true. A skeptical public will rightly ask whether a difference is "significant" when considered as one of a large number of parameters estimated. This presentation demonstrates some solutions to this problem, using the unofficial Stata packages parmest and smileplot. The parmest package allows the calculation of Bonferroni-corrected or Sidak-corrected confidence intervals for multiple estimated parameters. The smileplot package contains two programs, multproc (which carries out multiple test procedures) and smileplot (which presents their results graphically by plotting the p-value on a reverse log scale on the vertical axis against the parameter estimate on the horizontal axis). A multiple test procedure takes, as input, a set of estimates and p-values, and rejects a subset (possibly empty) of the null hypotheses corresponding to these p-values. Multiple test procedures have traditionally controlled the family-wise error rate (FWER), typically enabling the user to be 95% confident that all the rejected null hypotheses are false, and that all the corresponding "discoveries" are real. The price of this confidence is that the power to detect a difference of a given size tends to zero as the number of measured parameters become large. Therefore, recent work has concentrated on procedures that control the false disco very rate (FDR), such as the Simes procedure and the Yekutieli-Benjamini procedure. FDR-controlling procedures attempt to control the number of false discoveries as a proportion of the number of true discoveries, typically enabling the user to be 95% confident that some of the discoveries are real, or 90% confident that most of the discoveries are real. This less stringent requirement causes power to "bottom out" at a non-zero level as the number of tests becomes large. The smileplot package offers a selection of multiple test procedures of both kinds.

Additional information
TRANSP1.pdf

Report to users

Bill Gould, StataCorp

Discussion: Wishes and grumbles

Bill Gould, StataCorp

Scientific organizers

Sophia Rabe–Hesketh, Institute of Psychiatry, King's College London
Stephen Jenkins, University of Essex

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Ireland, Spain, and Portugal.