Home  /  Users Group meetings  /  2018 London


A fleet of packages for inputting United Kingdom primary care data
Abstract: The Clinical Practice Research Datalink (CPRD) is a centrally-managed data warehouse, storing data provided by the primary care sector of the United Kingdom (UK) National Health Service (NHS). Medical researchers request retrievals from this database, which take the form of a collection of text datasets whose format can be complicated. I have written a flagship package cprdutil with multiple modules to input into Stata the many text dataset types provided in a CPRD retrieval. These text datasets may be converted either to Stata value labels or to Stata datasets, which can be created complete with value labels, variable labels, and numeric Stata dates. I have also written a fleet of satellite packages to input into Stata the text datasets for retrievals of linked data, in which data are provided from non-CPRD sources, with CPRD identifier variables as a foreign key to allow data linkage. I introduce the modules of cprdutil and give a demonstration example in which I produce a minimal CPRD database in Stata, using cprdutil, and in which I illustrate some principles of sensible programming practice for creating large databases.

Additional information:

Roger B. Newson
Imperial College London
Multiarm, multistage randomised controlled trials with stopping boundaries for efficacy and lack of benefit: An update to nstage
Abstract: Multiarm multistage (MAMS) adaptive clinical trials offer several practical advantages over traditional two-arm designs. The framework proposed by Royston et al. (2011) uses intermediate outcomes at interim analyses to drop research arms demonstrating insufficient benefit prior to the final analysis on the primary outcome. To our knowledge, the nstage command developed for Stata (Barthel, Royston, and Parmar, 2009) is the only sample size software for MAMS trials with time-to-event outcomes, a common outcome measure in modern trials in cancer, cardiovascular disease, and other disease areas. We present an update to nstage to increase the efficiency and uptake of MAMS designs.

nstage can accommodate efficacy-stopping boundaries at interim analyses with a new option. Users choose a stopping rule, and the program estimates the operating characteristics for a design that can assess for early evidence of overwhelming efficacy on the primary outcome when interim analyses for lack of benefit occur on an intermediate outcome. The user specifies whether the trial is expected to terminate or continue with the remaining arms should an efficacious research arm be identified before the final analysis of the trial. Because the probability of a type I error is increased through such a design, the updated command offers an option to search for a design that strongly controls the maximum familywise error rate at the desired level, if it is required.

The command estimates the operating characteristics of the chosen design within a reasonable timeframe, allowing users to compare trial designs for different input parameters easily. We illustrate how the updates can be used to design a trial with the dropdown menu, using the MAMS trial STAMPEDE as an example. We hope the new functionality of the command will serve a broader range of trial objectives and thus increase adoption of the design in practice.

Additional information:

Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL, London
Alexandra Blenkinsop
MRC Clinical Trials Unit at UCL, London
Spaghetti, paella, and alternatives: Graphics for multiple series and groups
Abstract: Spaghetti plots show many tangled lines (say, for multiple time series or other functional traces) that are hard to distinguish and interpret. Paella plots show multiple point patterns for many groups, sufficiently mixed up so that comparisons are made difficult. The talk surveys several tactics and strategies for better, friendlier comparisons. Devices range from showing data several times over to selection, smoothing, and transformation.

Additional information:

Nicholas J. Cox
Durham University
Customizing Stata graphs made easy
Abstract: The overall look of Stata's graphs is determined by so-called scheme files. Scheme files are system components, that is, they are part of the local Stata installation. In this talk, I will argue that style settings deviating from default schemes should be part of the script producing the graphs rather than being kept in separate scheme files, and I will present software that supports such practice. In particular, I will present a command called grstyle that allows users to quickly change the overall look of graphs without having to fiddle around with external scheme files. I will also present a command called colorpalette that provides a wide variety of colour schemes for use in Stata graphics.

Additional information:

Ben Jann
University of Bern
Implementing machine learning methods in Stata
Abstract: In this presentation, I will discuss some popular supervised and unsupervised machine learning algorithms, and their recommended uses, and then I will present implementations in Stata. The emphasis is on prediction and causal inference, and how to tailor a method to a specific application.

Additional information:

Austin Nichols
Abt Associates
Nonlinear mixed-effects models
Abstract: Stata 15 introduced the new estimation command menl for fitting nonlinear mixed-effects models, also known as nonlinear multilevel models and nonlinear hierarchical models. These models can be thought of in two ways: as nonlinear models containing random effects or as linear mixed-effects models in which some or all fixed and random effects enter nonlinearly. The overall error distribution is assumed to be Gaussian. Nonlinear mixed-effects models have been used to model drug absorption in the body, intensity of earthquakes, and growth of plants, to name a few.

In my presentation, I will demonstrate how to use the menl command to fit nonlinear mixed-effects models in a variety of applications, including population pharmacokinetics and macroeconomics.

Additional information:

Yulia Marchenko
Analysing time-to-event data in the presence of competing risks within the flexible parametric modeling framework. What tools are available in Stata, which one to use, and when?
Abstract: In a typical survival analysis, researchers study the time to an event of interest. For example, in cancer studies, researchers often wish to analyse a patient's time to death since diagnosis. Similar applications also exist in economics and engineering. In any case, the event of interest is often not distinguished between different causes. Although this may sometimes be useful, in many situations this will not paint the entire picture and restricts analysis. More commonly, the event may occur because of different causes, which better reflects real-world scenarios. For instance, if the event of interest is death due to cancer, it is also possible for the patient to die because of other causes. This means that the time at which the patient would have died because of cancer is never observed. These are known as competing causes of death or competing risks. In a competing-risks analysis, interest lies in the cause-specific cumulative incidence function (CIF). This can be calculated by either (1) transforming on (all) cause-specific hazards, or (2) using a direct relationship with the subdistribution hazards.

Obtaining cause-specific CIFs within the flexible parametric modeling framework by adopting approach (1) is possible by using the stpm2 postestimation command, stpm2cif. Alternatively, because competing risks is a special case of a multistate model, an equivalent model can be fit using the multistate package. To estimate cause-specific CIFs using approach (2), one can use stpm2 by applying time-dependent censoring weights that are calculated on restructured data using stcrprep.

The above methods involve some form of data augmentation. Instead, estimation on individual-level data may be preferred because of computational advantages. This is possible using either approach, (1) or (2), with stpm2cr.

In this talk, I provide an overview of these various tools and discuss which of these to use and when.

Additional information:

Sarwar Islam Mozumder
Biostatistics Research Group, University of Leicester
Making help files the easy way
Abstract: The command makehlp was released in July 2012, and it simplifies the construction of a help file by a SMCL help template. The command opens up the ado-file and produces a template help file from the syntax line. In the past, the user would need to edit this template and fill in the details such as the description, title, examples, etc. The new version of makehlp keeps the old functionality but also checks for the return codes to automatically produce a list of stored outputs. In addition, I introduce a new syntax so that all the necessary text can be included in the ado -ile, for the various sections such as: description, title, examples, author, references, see also, and all the options and returns descriptions. An example of the new syntax is desc[], which will place all the text between the brackets into the help file description and will be formatted as it is written, so SMCL commands are allowed. This means that the ado-file can store the majority of the help file, and the help file can subsequently be created using this ado-file.

Additional information:

Adrian Mander
MRC Biostatistics Unit, University of Cambridge
admetan: A new, comprehensive meta-analysis command
Abstract: Meta-analysis (MA) is a statistical technique for combining results from multiple independent studies, with the aim of estimating a single overall effect with a size, direction, and precision consistent with the data. Traditionally, MA is performed on aggregated data (AD), where each observation represents the effect observed in a study, often derived from study publications. The community-contributed command metan (Harris et al., 2008) is by far the most popular Stata command for performing AD MA, but it was last updated in 2010 and has various flaws and limitations.

The alternative to AD MA is to obtain and analyse individual participant data (IPD), where the totality of data from all studies is stacked to form a single large dataset. I have previously described (Fisher 2015) a community-contributed command, ipdmetan, that facilitates so-called "two-stage" IPD MA. The two stages are fitting a given model to the data from each study in turn and combining the results using AD techniques. The second stage, performed using the AD command admetan, has now been expanded into a fully comprehensive AD MA command, with all the functionality of metan and much more besides. The co-author and maintainer of metan, Ross Harris, has confirmed to me that he is no longer in a position to maintain it and is happy for admetan to take its place.

Another important aspect of ipdmetan (and hence also admetan) is its forest plot capabilities. Not only is the forest plot engine much more efficient and capable of better plots "out of the box" when compared with metan; it also allows the user to save and edit "forestplot results sets", which are interpreted directly by the stand alone command forestplot to produce fully flexible plots.

I will take you on a quick tour of admetan and forestplot and hope to encourage you (and your colleagues and collaborators!) to use them in preference to metan.


Fisher, D.J. 2015. Two-stage individual participant data meta-analysis and generalized forest plots. Stata Journal 15: 369–396.

Harris, R.J., J.D. Deeks, D.G. Altman, M.J. Bradburn, R.M. Harbord, J.A.C. Sterne, 2008. metan: fixed- and random-effects meta-analysis. Stata Journal 8: 3–28.

Additional information:

David Fisher
MRC Clinical Trials Unit at UCL
Standardized survival curves and related measures from flexible parametric survival models
Abstract: In observational studies with time-to-event outcomes, we expect that there will be confounding and would usually adjust for these confounders in a survival model. From such models, an adjusted hazard ratio comparing exposed and unexposed subjects is often reported. This is fine, but hazard ratios can be difficult to interpret and are not collapsible. There are further problems when trying to interpret hazard ratios as causal effects. Risks are much easier to interpret than rates, so quantifying the difference on the survival scale can be desirable.

In Stata, stcurve gives survival curves after fitting a model where certain covariates can be given specific values, but those not specified are given mean values. Thus, it gives a prediction for an individual who happens to have the mean values of each covariate and may not reflect the average in the population. An alternative is to use standardization to estimate marginal effects, where the regression model is used to predict the survival curve for unexposed and exposed subjects at all combinations of other covariates included in the model. These predictions are then averaged to give marginal effects.

I will describe a command, stpm2_standsurv, that obtains various standardized measures after fitting a flexible parametric survival model. The command can estimate standardized survival curves, the marginal hazard function, the standardized restricted mean survival time, and centiles of the standardized survival curve. Contrasts can be made between any of these measures (differences, ratios). A user-defined function can be given for more complex contrasts.

Additional information:

Paul C. Lambert
Biostatistics Research Group, University of Leicester
Karolinska Institutet
Mata and The Mata Book: What you want to know and why you should care
Abstract: Stepping back from Mata, and even stepping a little back from the book, I use its publication as an excuse to describe Mata, its features, and what programming in Mata can achieve.

Additional information:

William Gould
SJ Editors' prize presentation
Abstract: The 2017 Stata Journal Editors' Prize will be presented symbolically to Ben Jann.

Newton, H.J., and N.J. Cox. 2017. The Stata Journal Editors' Prize 2017: Ben Jann. Stata Journal 17: 781–785.

Implementing the Leybourne–Taylor test for seasonal unit roots in Stata
Abstract: We estimate response surface coefficients for a large range of quantiles of the Leybourne and Taylor (2003, Journal of Time Series Analysis 24: 441–460) test for the presence of seasonal unit roots. This test statistic offers greater power gains compared with the familiar regression-based approach advocated by Hylleberg et al. (1990, Journal of Econometrics 44: 215–238). This approach is currently implemented in Stata via the command sroot, developed by Depalo (2009, Stata Journal 9: 422–438), and the further extensions introduced by the command hegy by del Barrio Castro, Bodnar and Sansó (2016, Stata Journal 16: 740–760). The main feature of the Leybourne and Taylor test is that it achieves power gains through the use of forward and reverse HEGY regressions. The estimated response surfaces allow for different combinations of number of observations T and lag order in the test regressions p, where the latter can be either specified by the user or endogenously determined by the underlying data. The critical values depend on the method used to select the number of lags. We introduce the new Stata command ltur and illustrate its use with an empirical example. The new command permits the computation of the Leybourne and Taylor test statistics along with their associated critical values and approximate probability values.

Additional information:

Jesús Otero
Universidad del Rosario
Kit Baum
Boston College
ardl: Estimating autoregressive distributed lag and equilibrium correction models
Abstract: Autoregressive distributed lag (ARDL) models are often used to analyse dynamic relationships with time-series data in a single-equation framework. The current value of the dependent variable is allowed to depend on its own past realisations—the autoregressive part—as well as current and past values of additional explanatory variables—the distributed lag part. The variables can be stationary, nonstationary, or a mixture of both. In its equilibrium correction (EC) representation, the ARDL model can be used to separate the long-run and short-run effects, and to test for cointegration or, more generally, for the existence of a long-run relationship among the variables of interest.

This talk serves as a tutorial for the ardl Stata command that can be used to fit an ARDL or EC model with the optimal number of lags based on the Akaike or Schwarz/Bayesian information criteria. I will address frequently asked questions and provide a step-by-step instruction for the Pesaran, Shin, and Smith (2001, Journal of Applied Econometrics) bounds test for the existence of a long-run relationship. This test is implemented as the postestimation command estat ectest, which features newly computed finite-sample critical values and approximate p-values. These critical values cover many model configurations and supersede previous tabulations available in the literature. They account for the sample size, the chosen lag order, the number of explanatory variables, and the choice of unrestricted or restricted deterministic model components.

The ardl command uses Stata's regress command to fit the model. As a consequence, specification tests can be carried out with the standard postestimation commands for linear (time-series) regressions and the forecast command suite can be used to obtain dynamic forecasts.

Additional information:

Sebastian Kripfganz
University of Exeter Business School
Daniel C. Schneider
Max Planck Institute for Demographic Research
multishell: Running simulations efficiently using Stata's shell command
Abstract: The package multishell is intended to speed up simulations by using multicore processors and Stata's shell command. In a first step, one or multiple do-files are converted into batch files and added to a queue. After starting the main command, the current instance of Stata acts as an organiser and works through the queue. It allocates the batch files to a preset number of parallel running Stata instances.

multishell has several distinct features. If do-files include forvalues and foreach loops, multishell dissects the loops and creates for each combination a new do-file, which is added to the queue. This allows for an efficient allocation and use of processor power. multishell can be used to connect two or more computers to a cluster. multishell then allocates to each computer parts of the queue and a simulation is run parallel on multiple computers. Computational power is used efficiently and time saved.

Additional information:

Jan Ditzen
Centre for Energy Economics Research and Policy, Heriot-Watt University
merlin: Mixed effects regression for linear and nonlinear models
Abstract: merlin can do a lot of things. From linear regression to a Weibull survival model, from a three-level logistic model to a multivariate joint model of multiple longitudinal outcomes, a recurrent event, and survival. merlin can do things I haven't even thought of yet. I'll take a single dataset, attempt to show you the full range of capabilities of merlin, and talk about some of the new features following its rise from the ashes of megenreg. There'll even be some surprises.

Additional information:

Michael J. Crowther
Biostatistics Research Group, University of Leicester\n
Latent class analysis
Abstract: Latent class analysis (LCA) allows us to identify and understand unobserved groups in our data. These groups may be consumers with different buying preferences, adolescents with different patterns of behaviour, or different health status classifications.

Stata 15 introduced new features for performing LCA. In this presentation, I will demonstrate how to use gsem with categorical latent variables to fit standard latent class models—models that identify unobserved groups based on a set of categorical outcomes. I will also show how we can extend the standard model to include additional equations and to identify groups using continuous, count, ordinal, and even survival-times outcomes. We will use the results of these models to determine who is likely to be in a group and how that group's characteristics differ from other groups.

Additional information:

Kristin MacDonald
Prediction, model selection, and causal inference with regularized regression
Abstract: The field of machine learning is attracting increasing attention among social scientists and economists. At the same time, Stata offers only a limited set of machine learning tools to date. This one-hour session introduces two Stata packages, lassopack and pdslasso, which implement regularized regression methods, including but not limited to the lasso (Tibshirani 1996 Journal of the Royal Statistical Society Series B), for Stata. The packages include features intended for prediction, model selection, and causal inference, and are thus applicable in many settings. The commands allow for high-dimensional models, where the number of regressors may be large or even exceed the number of observations under the assumption of sparsity.

The package lassopack implements lasso, square-root lasso (Belloni et al. 2011 Biometrika; 2014 Annals of Statistics), elastic net (Zou and Hastie, 2005, Journal of the Royal Statistical Society Series B), ridge regression (Hoerl and Kennard, 1970, Technometrics), adaptive lasso (Zou, 2006, Journal of the American Statistical Association), and postestimation OLS. These methods rely on tuning parameters, which determine the degree and type of penalization. lassopack supports three approaches for selecting these tuning parameters: information criteria (implemented in lasso2), K-fold and h-step ahead rolling cross-validation (cvlasso), and theory-driven penalization (rlasso) due to Belloni et al. (2012, Econometrica). In addition, rlasso implements the Chernozhukov et al. (2013, Annals of Statistics) sup-score test of joint significance of the regressors.

The package pdslasso offers methods to facilitate causal inference in structural models. The package implements methods for selecting control variables (pdslasso), instruments (ivlasso), or both from a large set of variables in a setting where the researcher is interested in estimating the causal impact of one or more (possibly endogenous) causal variables of interest. pdslasso and ivlasso rely on the lasso and square-root-lasso estimator implemented in lassopack. ivlasso also supports weak-identification-robust hypothesis tests and confidence sets.

Additional information:

Achim Ahrens
Economic and Social Research Institute, Dublin
Christian B. Hansen
Booth School of Business, University of Chicago
Mark E. Schaffer
Heriot-Watt University, Edinburgh
Data-driven sensitivity analysis for matching estimators
Abstract: Matching is a popular estimator of the Average Treatment Effects (ATEs) within counterfactual observational studies. In recent years, however, many scholars have questioned the validity of this approach for causal inference because its reliability draws heavily upon the so-called selection-on-observables assumption.

When unobservable confounders are possibly at work, they say, it becomes hard to trust matching results, and the analyst should consider alternative methods suitable for tackling unobservable selection. Unfortunately, these alternatives require extra information that may be costly to obtain, or even not accessible.

For this reason, some scholars have proposed matching sensitivity tests for the possible presence of unobservable selection. The literature sets out two methods: the Rosenbaum (1987) and the Ichino, Mealli, and Nannicini (2008) tests. Both are implemented in Stata.

In this work, I propose a third and different sensitivity test for unobservable selection in matching estimation based on a "leave-covariates-out" (LCO) approach. Rooted in the machine learning literature, this sensitivity test recalls a bootstrap over different subsets of covariates and simulates various estimation scenarios to be compared with the baseline matching estimated by the analyst.

Finally, I will present sensimatch, the Stata routine I developed to run this method, and provide some instructional applications on real datasets.

Additional information:

Giovanni Cerulli
A sign-and-rank–based semiparametrically efficient estimator for regression analysis
Abstract: In regression analysis, it is well known that skewness and excessive tail heaviness affect the efficiency of classical estimators. In this work, we propose an estimator that is highly efficient for many distributions. More specifically, in accordance with standard Le Cam theory, we define a sign-and-rank–based estimator of the regression coefficients as a one-step update, based on a fully semiparametrically efficient central sequence, of an initial root n consistent estimator.

In the central sequence, the score function, initially defined on the basis of the exact underlying innovation density f, is estimated using the fact that f can be well adjusted by a Tukey g-and-h distribution. We present the results of some Monte Carlo simulations conducted to assess the finite sample performance of our estimator, compared with the ordinary least squares estimator and the approximated maximum-likelihood estimator. We propose a Stata command flexrank to implement it in practice. The procedure is very fast and has a low computational complexity.

Additional information:

Vincenzo Verardi
Université Libre de Bruxelles
Wishes and grumbles
Abstract: Stata developers present will carefully and cautiously consider wishes and grumbles from Stata users in the audience. Questions, and possibly answers, may concern reports of present bugs and limitations or requests for new features in future releases of the software.
StataCorp personnel

Scientific committee

Nicholas J. Cox
Durham University
Tim Morris
MRC Clinical Trials Unit at UCL
Patrick Royston
MRC Clinical Trials Unit at UCL

Logistics organizer

The logistics organizer for the 2018 London Stata Conference is Timberlake Consultants, the distributor of Stata in the UK and the Irish Republic.

View the proceedings of previous Stata Users Group meetings.