Search

10th UK Stata Users Group meetings: Abstracts

Analyzing linked employer–employee data with Stata

Richard Upward
School of Economics, University of Nottingham

Abstract

The use of datasets that contain information on both workers and the firms they work for is growing rapidly, especially in fields such as applied econometrics and labor economics. Similar data structures may also arise in the analysis of data on patients and doctors or students and schools. Many of these datasets are extremely large, some containing a substantial fraction of the population of firms and workers.

The analysis of this kind of data poses two related problems. The first is a problem of computing power, memory, and storage. The second is the statistical problem of how to control for and estimate the "unobserved effects" (also known as "fixed effects") for both workers and firms.

In this presentation, we explain the basic issues and how we have dealt with them using Stata. We illustrate using both simulated data and a large linked employer–employee panel collected by the Institut für Arbeitsmarkt und Berufsforschung in Germany. We show how to implement various potential methods and suggest problems and limitations that the analyst using Stata may encounter.

Reference

Abowd, J. and F. Kramarz. 1999. The analysis of labor markets using matched employer–employee data. In Ashenfelter, O. and Card, D. (eds.) Handbook of Labor Economics Volume 3. North–Holland.

Approximating the bias of the LSDV estimator for dynamic panel data models

Giovanni S. F. Bruno
Università Commerciale Luigi Bocconi, Milano

Abstract

It is well known that the LSDV estimator for dynamic panel data models is not consistent for N large and finite T. Nickell (1981) derives an expression for the inconsistency for N→∞, which is O(1/T). Kiviet (1995) uses asymptotic expansion techniques to approximate the small sample bias of the LSDV estimator to also include terms of at most order 1/NT, thus offering a method to correct the LSDV estimator for samples where N is small or only moderately large. In Kiviet (1999) and Bun and Kiviet (2003), the bias expression is more accurate, including higher order terms. Monte Carlo evidence in Judson and Owen (1999) strongly supports the corrected LSDV estimator compared to more traditional GMM estimators when N is only moderately large. Bruno (2004) extends the bias approximation formulas in Bun and Kiviet (2003) to accommodate unbalanced panels with a strictly exogenous selection rule.

This paper describes the Stata codes used in Bruno (2004) to compute the bias approximations and carry out the Monte Carlo experiment estimating the actual LSDV bias for various data generating processes. The analysis covers both balanced and unbalanced panels. It is found that the actual bias as estimated by Monte Carlo replications, besides following the same patterns as in Bun and Kiviet (2003), turns out non-increasing in the degree of unbalancedness. Moreover, the approximations are always accurate with a decreasing contribution to the actual bias of the higher order terms.

References

Bruno, G. S. F. 2004. Approximating the bias of the LSDV estimator for unbalanced dynamic panel data models. mimeo.

Bun, M. J. G. and J. F. Kiviet. 2003. On the diminishing returns of higher order terms in asymptotic expansions of bias. Economics Letters 79: 145–152.

Judson, R. A. and A. L. Owen. 1999. Estimating dynamic panel data models: a guide for macroeconomists. Economics Letters 65: 9–15.

Kiviet, J. F. 1995. On bias, inconsistency and efficiency of various estimators in dynamic panel data models. Journal of Econometrics 68: 53–78.

------. 1999. Expectation of expansions for estimators in a dynamic panel data model: some results for weakly exogenous regressors. In Hsiao, C., K. Lahiri, L-F Lee, and M. H. Pesaran. eds. Analysis of Panel Data and Limited Dependent Variables. Cambridge: Cambridge University Press.

Nickell, S. J. 1981. Biases in dynamic models with fixed effects. Econometrica 49: 1417–1426.

Multiple imputation of missing data: an implementation of van Buuren's MICE, and more

Patrick Royston
MRC Clinical Trials Unit, London

Abstract

Following the seminal publications of Rubin starting about 30 years ago, statisticians have become increasingly aware of the inadequacy of complete case' analysis of datasets with missing observations. In medicine, for example, observations may be missing in a sporadic way for different covariates; a complete-case analysis may omit as many as half of the available cases. Hotdeck' imputation was implemented in Stata by Mander and Clayton (1999). However, this technique may perform poorly in the common case when many rows of data have at least one missing value. In this talk, I will describe an implementation for Stata of the MICE' method of multiple multivariate imputation described by van Buuren et al. (1999) (see also www.multiple-imputation.com). MICE stands for Multivariate Imputation by Chained Equations. The basic idea of data analysis with multiple imputation is to create a small number (e.g., 3–5) copies of the data, each of which has the missing values suitably imputed. Then, each complete dataset is analyzed independently. Estimates of parameters of interest are averaged across the copies to give a single estimate. Standard errors are computed according to the Rubin rules' (Rubin 1987), devised to allow for the between- and within-imputation components of variation in the parameter estimates. In the talk, I will present briefly five ado-files. mvis creates multiple multivariate imputations. uvis imputes missing values for a single variable as a function of several covariates, each with complete data. micombine fits a wide variety of regression models to a multiply imputed dataset, combining the estimates using Rubin's rules. micombine supports survival analysis models (stcox and streg), categorical data models, generalized linear models, and more. Finally, misplit and mijoin are utilities to inter-convert datasets created by mvis and by Carlin et al. (2003)'s miset routine. The use of the routines will be illustrated by example.

References

Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3: 226–244.

Mander, A. and D. Clayton. 1999. Hotdeck imputation. Stata Technical Bulletin 51: 32–34.

Rubin, D. B. 1987. Multiple Imputation for Non-response in Surveys. New York: John Wiley.

van Buuren, S., H. C. Boshuizen, and D. L Knook. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18: 681–694.

Smooth hazard functions for survival time data

Margaret May
Department of Social Medicine, University of Bristol

Abstract

In medical prognosis based on survival analysis, there is an interest in visualizing the shape of the hazard function. In fully parametric models, the shape of the hazard function is constrained by the properties of the chosen distribution (Weibull, log-logistic, lognormal, Gompertz, gamma). The semi-parametric Cox model only assumes proportional hazards and has no specification of the baseline hazard. In Stata 8, a method of illustrating hazard functions using a kernel smooth of the hazard contributions is implemented for the Cox model which will allow more flexible shapes. However, if the proportional hazards assumption is violated, a method based on smoothing the Nelson–Aalen cumulative hazard function followed by numerical differentiation to give the hazard function and further kernel-density smoothing of the resulting function may be useful.

This method will be illustrated using data from ART-CC, an international collaboration of 12 cohorts with data on over 19,000 HIV positive patients. The hazard of AIDS or death by risk factor groups defined by initial CD4 count (a measure of immune system functioning) or injection drug use (IDU) is modeled from the time of starting antiretroviral therapy for up to 5 years.

A new Stata command for estimating confidence intervals for the variance component of random-effects linear models

Matteo Bottai
Arnold School of Public Health, University of South Carolina, Columbia, SC and Institute of Information Science and Technology, National Research Council, Pisa

Nicola Orsini
Institute of Information Science and Technology, National Research Council, Pisa and Institute of Environmental Medicine, Karolinska Institutet, Stockholm

Abstract

The Stata command xtreg estimates the random-effects linear regression model, for which the random effects are assumed to be normally distributed with zero mean and non-negative variance, su2. Testing homogeneity across units is equivalent to testing the null hypothesis H0:su2 = 0, which is a value on the boundary of the parameter space. The command xtreg provides the upper-tail probability of the appropriate asymptotic distribution of the likelihood ratio test statistic. However, such a method cannot be used to construct confidence intervals for the parameter, su2. Besides, confidence intervals for the random-effect variance that are based on a Wald-type test, too often used, can be shown to be asymptotically wrong.

Based on the asymptotic theory for singular information problems, a method is developed and implemented in the Stata command xtci, which provides asymptotically-correct confidence intervals. Also, when testing the hypothesis of homogeneity across units, the proposed method is shown to have better small-sample properties than one based on the likelihood ratio test statistic.

References

Bottai, M. 2003. Confidence regions when the Fisher information is zero. Biometrika 90(1): 73–84.

Bottai, M. and N. Orsini. 2004. Confidence intervals for the variance component of random-effects linear models. Stata Journal : in press.

A comment on infrequency of purchase models in Stata

Julian A. Fennema
Centre for Economic Reform and Transformation, Heriot–Watt University

Abstract

This paper introduces the dhurdle command for Stata, a maximum-likelihood routine (d2) to estimate the Cragg double hurdle model with independent or dependent errors. We give a brief description of the procedure and its application to durable goods consumption and market participation models. We briefly demonstrate the construction of the program and present evidence of its consistency. We compare its efficiency to the results reported by Flood and Gråsjö for the routine programmed in Gauss and repeat their tests of the effect of misspecification on the parameter estimates. We also outline extensions in the pipeline, particularly the inverse hyperbolic sine heteroskedasticity correction, and also invite suggestions.

Reference

Flood, L. and U. Gråsjö. 2001. A Monte Carlo Simulation of Tobit models. Applied Economics Letters 8: 581–584.

Topics in time series regression modeling

Christopher F. Baum
Department of Economics, Boston College, Chestnut Hill, MA
Abstract

This talk will discuss the use of a number of Stata commands, some "official" and some user-contributed, in the context of working with time-series and panel data. Testing for endogeneity/exogeneity of regressors, heteroskedasticity in an instrumental variables context, and fitting regression models with ARMA errors will be considered, as well as a number of tests for stationarity of single or multiple time series, including stationarity in the presence of structural breaks.

Stata graphics, under the hood

Vince Wiggins
StataCorp, College Station, TX
Abstract

Stata's graphics are more flexible than many realize. We will exploit this flexibility and explore a potpourri of topics, some of interest to all graphers and others primarily of interest to those creating highly customized graphs or new graph commands. Among the topics will be creating custom schemes to control the appearance of graphs, including an overview of the format and contents of scheme files. We will also examine twoway graphs as a platform for creating custom graphs, some of which are not readily apparent. We will discuss techniques for managing data and leveraging twoway's native plottypes. Along the way we will introduce some new official and unofficial tools, and perhaps some downright dangerous, but useful, undocumented tricks.

There are annotated materials for this talk that can be viewed and run from within Stata. To find, install, and begin the materials, type the following commands in Stata:
        . net from http://www.stata.com/users/vwiggins
. net describe uk04
. net install uk04

. ukgrtalk
. whelp ukgrtalk


Separation brings analysts and their graphs together

Matthew Barnes
Office for National Statistics, London
Abstract

The recent addition of support for the CMYK color model to Stata 8 allows graphs from Stata to be used where color separation is required for printing. This paper outlines work being done at the Office for National Statistics to use Stata for graphics in our flagship "Economic Trends" publication. This work aims to reduce the burden on our design teams, allow later deadlines and ensure that analysts have more control over the appearance of their graphics in the final publication.

Circular statistics in Stata, revisited

Nicholas J. Cox
Department of Geography, Durham University
Abstract

Circular data are a large class of directional data, which are of interest to scientists in many fields, including biologists (movements of migrating animals), meteorologists (winds), geologists (directions of joints and faults), and geomorphologists (landforms, oriented stones). These examples are all recordable as compass bearings relative to North. Other examples include phenomena that are periodic in time, including those dependent on time of day (in biomedical statistics: hospital visits or times of birth) or time of year (in applied economics: unemployment or sales variations). The analysis of circular data is an odd corner of statistical science that many never visit, even though it has a long and curious history. Moreover, it seems that no major statistical language provides direct support for circular statistics. This talk describes the development and use of some routines that have been written in Stata, primarily to allow graphical and exploratory analyses. In 2004, such routines are being rewritten, especially to allow use of the new graphics of Stata 8.

Biplots, revisited

Ulrich Kohler
WZB, Berlin
Abstract

Biplots display correlations and differences in means and standard deviations of many variables on one graph, together with the values of the plotted variables and approximations of the Euclidean distance between the observations. Biplots are useful for identifying clusters of observations, guiding interpretation of factor analyses, detecting multivariate outliers and getting an idea about the correlation structure of the data. The talk will demonstrate the merits of biplots and discuss the development of a new version of biplot for Stata 8.2.

Tabulation of multiple responses

Ben Jann
Soziologie, ETH Zürich
Abstract

Although multiple response questions are quite common in survey research, Stata's official release does not provide much possibility for an effective analysis of multiple response variables. For example, in a study on drug addiction an interview question might be, "Which substances did you consume during the last four weeks?" The respondents just list all the drugs they took if any, e.g., an answer could be "cannabis, cocaine, heroin" or "ecstasy, cannabis" or "none", etc. Usually, the responses to such questions are held as a set of variables and, therefore, cannot be easily tabulated. I will address this issue and present a new module to compute one- and two-way tables of multiple responses. The module supports several types of data structure, provides significance tests, and offers various options to control the computation and display of the results.

Controlling for time-dependent confounding using marginal structural models

Zoe Fewell
Department of Social Medicine, University of Bristol

M. A. Hernán, Harvard School of Public Health

F. Wolfe, National Data Bank for Rheumatic Diseases, USA

K. Tilling, Department of Social Medicine, University of Bristol

H. Choi, Harvard Medical School

J. A. C. Sterne, Department of Social Medicine, University of Bristol

Abstract

Longitudinal studies in which exposures, confounders, and outcomes are measured repeatedly over time have the potential to allow causal inferences about the effects of exposure on outcome. There is particular interest in estimating the causal effects of medical treatments (or other interventions) in circumstances in which a randomized controlled trial is difficult or impossible. However, standard methods for estimating exposure effects in longitudinal studies are biased in the presence of time-dependent confounders affected by prior treatment.

This talk describes the use of marginal structural models (described by Robins et al.) to estimate exposure or treatment effects in the presence of time-dependent confounders affected by prior treatment. The method is based on deriving inverse-probability-of-treatment weights, which are then used in a pooled logistic regression model to estimate the causal effect of treatment on outcome. We demonstrate the use of marginal structural models to estimate the effect of methotrexate on mortality in persons suffering from rheumatoid arthritis.

Meta-analysis in Stata: history, progress, and prospects

Jonathan Sterne
Department of Social Medicine, University of Bristol

Abstract

Systematic reviews of randomized trials are now widely recognized to be the best way to summarize the evidence on the effects of medical interventions. A systematic review may (though it need not) contain a meta-analysis, `a statistical analysis that combines the results of several independent studies considered by the analyst to be "combinable" '. The first researcher to do a meta-analysis was probably Karl Pearson, in 1904. Sadly, Stata was not available at this time. The first Stata command for meta-analysis — the meta command — was published in the Stata Technical Bulletin in 1997 and exploited a facility, introduced in Stata version 5, to program graphics. It requires the user to derive an estimate of the effect of intervention, together with its standard error, for each study. The metan command, published in 1998, does analyses based on the 2 ×2 table for each study and provides more detailed graphical displays. Facilities for cumulative meta-analysis and meta-regression and tools for examining bias in meta-analysis have since been introduced.

It is perhaps surprising that Stata commands for meta-analysis are still entirely user-written. This means that the existing commands that produce graphics (a major advantage of the Stata commands compared with those available in other statistical packages) are outdated since the introduction of Stata 8 graphics. Possible ways forward will be discussed, and the talk will conclude with a discussion of developments in meta-analysis that could usefully be addressed by future Stata commands.

Compliance-adjusted intervention effects in survival data

Lois G. Kim
MRC Biostatistics Unit, Cambridge

Ian, R. White, MRC Biostatistics Unit, Cambridge

Abstract

Time-to-event endpoints are a common outcome of interest in randomized clinical trials. The primary analysis should usually be by intention-to-treat, giving an indication of the effectiveness of the intervention in a population as a whole. However, the benefit specifically for an individual receiving the intervention is becoming increasingly important as patient decisions become more evidence-based.

Effectiveness is defined as the benefit of intervention as actually applied, and may be estimated from simple all-or-nothing compliance data. Efficacy, on the other hand, is the benefit of intervention under ideal circumstances, and requires more complex compliance data. Intervention effectiveness and efficacy after accounting for non-compliance can be estimated in various ways, some of which have already been implemented in Stata (e.g., strbee).

Recently, Loeys and Goetghebeur (2003) provided new methodology using proportional-hazards techniques in survival data where compliance is all-or-nothing in the intervention arm and perfect in the control arm. Here, their method is implemented in Stata. The output is a hazard ratio for the effectiveness of intervention, adjusted for observed adherence to intervention in the treated group. An example application is discussed for a subset of a large, randomized trial of screening where the average benefit of 26% risk reduction becomes a 34% risk reduction for individuals attending screening.

Reference

Loeys, T. and E. Goetghebeur. 2003. A causal proportional hazards estimator for the effect of treatment actually received in a randomized trial with all-or-nothing compliance. Biometrics 59: 100–105.

From datasets to resultssets in Stata

Roger Newson
Department of Public Health Sciences, King's College, London

Abstract

A resultsset is a Stata dataset created as output by a Stata program. It can be used as input to other Stata programs, which may in turn output the results as publication-ready plots or tables. Programs that create resultssets include xcontract, xcollapse, parmest, parmby, and descsave. Stata resultssets do a similar job to SAS output datasets, which are saved to disk files. However, in Stata, the user typically has the options of saving a resultsset to a disk file, writing it to the memory (overwriting any pre-existing data set), or simply listing it. Resultssets are often saved to temporary files, using the tempfile command. This lecture introduces programs that create resultssets, and also programs that do things with resultssets after they have been created. listtex outputs resultssets to tables that can be inserted into a Microsoft Word, HTML, or TeX document. eclplot inputs resultssets and creates confidence interval plots. Other programs, such as sencode and tostring, process resultssets after they are created and before they are listed, tabulated, or plotted. These programs, used together, have a power not always appreciated if the user simply reads the online help for each package.

Applying the Cox proportional hazards regression model to competing risks

Abdel G. A. Babiker
MRC Clinical Trials Unit, London

Mohamed M. Ali
RHR/WHO, Geneva

Abstract

In the presence of dependent competing risks in survival analysis, the Cox proportional hazards model can be utilized to examine covariate effects on the cause-specific hazard function for each type of failure. The method proposed by Lunn and McNeil (1995) requires data augmentation. With k failure types, the data would be duplicated k times, one record for each failure type. Either a stratified or an unstratified analysis could be used, depending on whether the assumption of proportional hazards holds. If the proportional hazards assumption does not hold across the causes, the stratified analysis should be used, which is equivalent to fitting a separate model for each failure type. The unstratified analysis assumes a constant hazard ratio between failure types and this could be fitted by including an indicator variable as a covariate.

We will show how both approaches could be fitted on augmented data using stcox. In addition to the parameter estimates and their standard errors, the program has an option to produce cumulative incidence functions with pointwise confidence limits.

Reference

Lunn, M. and D. McNeil. 1995. Applying Cox regression to competing risks. Biometrics 51: 524–532.

Genome-wide linkage scans and basic bioinformatics implemented using Stata/SE

Toby Andrew
Twin & Genetic Epidemiology Research Unit, Department of Medicine, St Thomas' Hospital

Abstract

Searches for genes using linkage analyses with genetic markers placed across the entire human genome are hypothesis-free experiments, which represent an extreme form of multiple testing. As such, the low p-values required to obtain nominal significance make accurate diagnostics essential to assess model fit and to eliminate naive incorrect results. In hypothesis-driven single tests, researchers usually take good care to assess model fit and the validity of model assumptions, but such concerns are usually ignored when it comes to linkage analysis. This is particularly problematic where low thresholds (p < 0.0001) can result in extreme sensitivity to outlying observations and for some models (e.g., standard variance component analysis), greater sensitivity to violation of model assumptions.

Here, we attempt to address these problems for genomic data based on 1,300 healthy sib-pairs (dizygotic twins) using modified Haseman–Elston regression-based linkage analysis for quantitative traits, in which sib-pair phenotypic covariance is correlated with genetic marker covariance. The statistical theory underpinning the implementation of tests for linkage using generalized linear models (GLM) (glm in Stata) is documented in detail elsewhere. In brief, the advantage of analyzing sib-pairs using GLM is that the approach shares all of the strengths of OLS and variance components, but none of their weaknesses. These are that (1) unlike OLS, the residual errors are correctly specified with a gamma distribution and known heteroskedasticity is accounted for; (2) unlike standard variance components, by freely estimating the coefficient of variation, GLM is robust to phenotypic deviations from multivariate normality.

Just as important are the practical advantages. With the release of Stata8/SE for large datasets, we have been able to store and check genetic markers for all 22 pairs of autosomal chromosomes plus sex chromosomes. In addition, we have generated 2-point and multipoint allele-sharing identical by descent (IBD) elsewhere and imported this into Stata. Using Stata scripts with a simple loop structure that calls on the glm command, we are able to perform genome-wide scans and save any summary statistics to file. We have been able to utilize the following features in Stata:

1. correct diagnostics on a genome-wide basis that are not normally made available to users of applied linkage packages
2. robust estimates of significance, such as Huber sandwich estimates, bootstrap routines, permutation tests, etc.
3. probability weighting to utilize the full probability distribution of the number of alleles shared IBD
4. computationally fast and easy to implement

Finally, we also can perform basic, but powerful bioinformatics tasks such as:

1. using the xpose command to summarize marker information by chromosome and sib-pair
2. resolving marker order more accurately, which is essential for correct multipoint IBD generation, by interpolating genetic distance using the latest physical and genetic marker maps

Reference

Barber, M. J., H. J. Cordell, A. J. MacGregor, and T. Andrew. 2004. Gamma regression improves Haseman-Elston and variance components linkage analysis for sib-pairs. Genetic Epidemiology 26(2): 97–107.

Evaluation of diagnostic tests for diseases in pregnancy: some statistical issues

Paul T. Seed
Dept of Obstetrics & Gynaecology, GKT School of Medicine, King's College, London

Abstract

A diagnostic test is used typically because it is cheaper, quicker, or less invasive than the reference standard but may not be as reliable. Diagnostic tests are evaluated against a reference standard (sometimes called "Gold Standard"), regarded as completely accurate.

Commands diagt and diagti have been developed to evaluate binary tests and provide all the standard measures of performance (including sensitivity, specificity, likelihood ratios, and predictive values, with appropriate confidence intervals. A prevalence option adjusts for different case-mix, and evaluates the test result for a particular patient with known pre-test risk.

The use of ROC curves for ordered categorical and continuous data will be considered, in particular the determining of a suitable cutoff value.

Where the distribution of a continuous measure can be adequately modeled, the likelihood ratio can be used to determine the absolute risk of an individual patient.

Appropriate Stata commands for these analyses will be demonstrated.

Report to users

William W. Gould
StataCorp, College Station, TX
Abstract

Bill Gould, who is President of StataCorp, and, more importantly for this meeting, the head of development, will ruminate about work at Stata over the last year and about ongoing activity.