The 30th UK Stata Conference will take place on 12–13 September 2024 at the London School of Economics Marshall Building.
This two-day international conference provides Stata users worldwide with the opportunity to exchange ideas, experiences, and information on new applications of the software. Experience what happens when new and longtime Stata users from across all disciplines gather to discuss real-world applications of Stata. Everyone interested in Stata is welcome. The UK Conference is the longest-running series of Stata conferences. The event attracts a global audience, and StataCorp will be represented.
All times are in BST (UTC +1)
9:30–10:00 | Registration |
10:00–10:10 | Welcome |
10:10–10:30 | Balance and variance inflation checks for completeness-propensity weights
Abstract:
Inverse treatment-propensity weights are standard methods for
adjusting for predictors of exposure to a treatment. Because a
treatment-propensity score is a balancing score, it makes sense
to do balance checks on the corresponding treatment-propensity
weights. It is also a good idea to do variance-inflation checks
to estimate how much the propensity weights might inflate the
variance of an estimated treatment effect, in the pessimistic
scenario in which the weights are not really necessary. In
Stata, the SSC package somersd can be used for balance
checks, and the SSC package haif can be used for
variance-inflation checks. It is argued that balance and
variance-inflation checks are also necessary in the case of
completeness-propensity weights, which are intended to remove
inbalance in predictors of completeness between the subsample
with complete data and the full sample of subjects with complete
or incomplete data. However, the usage of somersd,
scsomersd, and haif must be modified because we
are removing imbalance between the complete sample and the full
sample, instead of between the treated subsample and the
untreated subsample. An example will be presented from a
clinical trial in which the author was involved and in which
nearly a quarter of randomized subjects had no final outcome
data. A post hoc sensitivity analysis is presented using
inverse completeness-propensity weights.
Roger B. Newson
King's College London
|
10:30–10:50 | Using GitHub for collaborative analysis
Abstract:
Recent trends have led to an increased importance being placed
upon formal quality control processes for analysis conducted
within the pharmaceutical industry and beyond. While a key
feature of Stata is reproducibility through do-files and
automated reporting, there are limited built-in tools for
version control, code review, and collaborative analysis.
Git is a distributed version control system widely used by software development teams for collaborative programming, change tracking, and enforcement of best practices. Git keeps a record of all changes to a codebase over time, providing the ability to easily revert to a previous state, manage temporary branches, and combine code written by multiple people. Services such as GitHub build on the Git framework, providing tools to conduct code review, host source files, and manage projects. We present an overview of Git and GitHub and explain how we use it for Stata projects at Adelphi Real World, an organization specializing in the collection and analysis of real-world healthcare data from physicians, patients, and caregivers. We share an example project to outline the benefits of code review both for data integrity and as a training tool. We also discuss how, through implementing a software-development-like approach to the creation of ado-files, we can enhance the process of creating new programs in Stata and gain confidence in the robustness and quality of our commands.
Contributor:
Liane Gillespie-Akar
Adelphi Real World
Chloe Middleton-Dalby
Adelphi Real World
|
10:50–11:10 | My favorite overlooked lifesavers in Stata
Abstract:
Everyone loves a good testing, estimation, or graphical
community-contributed package. However, a successful empirical
project relies on many small and overlooked but priceless programs.
I will present three of my personal lifesavers.
1. adotools: adotools has four main uses. It allows the user to create and maintain a library of ado-paths. Paths can be dynamically added to and removed from a running Stata session. When removing an ado-path, all ado-programs located in the folder are cleared from memory. adotools can also reset all user specified ado-paths. 2. psimulate2: Ever wanted to run Monte Carlo simulations in parallel? You can with psimulate2 and there are (almost) no setup costs at all. psimulate2 splits the number of repetitions into equal chunks, spreads them over multiple instances of Stata, and reduces the time to run Monte Carlo simulations. It also allows macros to be returned and can save and append simulation results directly into a .dta file or frame. It can be run on Windows, Unix, and Mac. 3. xtgetpca: Extracting principal components in panel data is common. However, no Stata solution exists. xtgetpca fills this gap. It allows for different types of standardization, removal of fixed effects, and removal of unbalanced panels.
Jan Ditzen
Free University of Bozen
|
11:10–12:00 | Professional statistical development: What, why, and how
Abstract:
In this presentation, I will talk about professional statistical
software development in Stata and the challenges of producing
and supporting a statistical software package. I will share some
of my experience on how to produce high-quality software,
including verification, certification, and reproducibility of
the results, and on how to write efficient and stable Stata
code. I will also discuss some of the aspects of commercial
software development such as clear and comprehensive
documentation, consistent specifications, concise and
transparent output, extensive error checks, and more.
Yulia Marchenko
StataCorp LLC
|
12:00–1:00 | Lunch |
1:00–1:30 | Stata: A short history viewed through epidemiology
Abstract:
In this talk, I will use personal recollections to revisit the
challenges many public health researchers have faced since the
birth of Stata in 1985. I will discuss how, from the 1990s
onward, the increasing demands for data management and analysis
were met by Stata developers and the broader Stata community,
particularly Michael Hills. Additionally, I will review how
Stata's expansion in scope and capacity with each new version
has enhanced our ability to train new generations of medical
statisticians and epidemiologists. Finally, I will reflect on
current and future challenges.
Bianca de Stavola
University College London
|
1:30–1:50 | compmed: A new command for estimating causal mediation effects with nonadherence to treatment allocation
Abstract:
In clinical trials, a standard intention-to-treat analysis will
unbiasedly estimate the causal effect of treatment offer, though
this ignores the impact of participant nonadherence. To account for
this, one can estimate a complier-average causal effect (CACE),
the average causal effect of treatment receipt in the principal
strata of participants who would comply with their randomization
allocation. Evaluating how interventions lead to changes in the
outcome (the mechanism) is also key for the development of more
effective interventions. A mediation analysis aims to decompose
a total treatment effect into an indirect effect, which
operates via changing the mediator, and a direct effect. To
identify mediation effects with nonadherence, it has been shown
that the CACE can be decomposed into a direct effect, the
complier-average natural direct effect (CANDE), and a mediated
effect, the complier-average causal mediated effect (CACME).
These can be estimated with linear structural equation models
(SEMs) with instrumental variables.
However, obtaining estimates of the CACME and CANDE in Stata requires (1) correct fitting of the SEM in Stata and (2) correct identification of the pathways that correspond to the CACME and CANDE. To address these challenges, we introduce a new command, compmed, that allows users to perform the relevant SEM fitting for estimating the CACME and CANDE using a single more intuitive and user-friendly interface. compmed requires the user to specify only the continuous outcome, continuous mediator, treatment receipt, and randomization variables. Estimates, standard errors, and 95% confidence intervals are reported for all effects.
Contributors:
Sabine Landau, Richard Emsley
Kings College London
Anca Chis Ster
Kings College London
|
1:50–2:30 | Causal mediation
Abstract:
Causal inference aims to identify and quantify a causal effect.
With traditional causal inference methods, we can estimate the
overall effect of a treatment on an outcome. When we want to
better understand a causal effect, we can use causal mediation
analysis to decompose the effect into a direct effect of the
treatment on the outcome and an indirect effect through another
variable, the mediator. Causal mediation analysis can be
performed in many situations—the outcome and mediator
variables may be continuous, binary, or count, and the treatment
variable may be binary, multivalued, or continuous.
In this talk, I will introduce the framework for causal mediation analysis and demonstrate how to perform this analysis with the mediate command, which was introduced in Stata 18. Examples will include various combinations outcome, mediator, and treatment types.
Kristin MacDonald
StataCorp LLC
|
2:30–2:50 | Imputation when data cannot be pooled
Abstract:
Distributed data networks are increasingly used to study human
health across different populations and countries. Analyses are
commonly performed at each study site to avoid the transfer of
individual data between study sites due to legal and logistical
barriers. Despite many benefits, however, a frequent challenge
in such networks is the absence of key variables of interest at
one or more study sites. Current imputation methods require the
availability of individual data from the involved studies to
impute missing data. This creates a need for methods that can
impute data in one study using only information that can be
easily and freely shared within a data network. To address this
need, we introduce a new Stata command, mi impute from,
designed to impute missing variables in a single study using a
linear predictor and the related variance/covariance matrix from
an imputation model fit from one or multiple external
studies. In this presentation, the syntax of mi impute
from will be presented along with motivating examples from
health-related research.
Contributors:
Robert Thiesmeier, Matteo Bottai
Karolinska Institutet
Nicola Orsini
Karolinska Institutet
|
2:50–3:00 | Break |
3:00–3:50 | Thirty graphical tips Stata users should know, revisited
Abstract:
In 2010, I gave a talk at the London conference presenting
30 graphical tips. The display materials remain accessible
on Stata's website but are awkward to view, because they are based
on a series of .smcl files. I will recycle the title, some of
the tips, and add new ones because some of what you or your students
or your research team should know about when coding graphics for
mainstream tasks. The theme of "thirty" matches this 30th London
conference, and to a good enough approximation my 33 years as a
Stata user. The talk mixes examples from official and
community-contributed commands and details both large and small.
Nicholas J. Cox
Durham University
|
3:50–4:10 | Fancy graphics: Small multiples carpentry
Abstract:
Using “small multiples” in data visualization and
statistical graphics consists in combining repeated small
diagrams to display variations in data patterns or associations
across a series of units. Sometimes, the small multiples are mere
replications of identical plots but with different plot
elements highlighted. Small displays are typically arranged on a
grid, and the overall appearance is, as Tufte puts it, akin to
the sequence of frames of a movie when ordering follows a time
dimension. Creating diagrams for use in gridded “small
multiples” is easy with Stata's graphics combination
commands. However, the grid pattern can be limiting. This talk
will present tips and tricks for building small multiple
diagrams and illustrate some coding strategies for arranging
individual frames in the most flexible way, opening up some
creative possibilities of data visualization.
Philippe Van Kerm
University of Luxembourg
|
4:10–4:30 | Scalable high-dimensional nonparametric density estimation with Bayesian applications
Abstract:
Few methods have been proposed for flexible, nonparametric
density estimation, and they do not scale well to
high-dimensional problems. I describe a new approach based on
smoothed trees called the kudzu density (Grant 2022). This fits
the little-known density estimation tree (Ram and Gray 2011) to a
dataset and convolves the edges with inverse logistic functions,
which are in the class of computationally minimal smooth ramps.
New Stata commands provide tree fitting, kudzu tuning, estimates
of joint, marginal and cumulative densities, and pseudorandom
numbers.
Results will be shown for fidelity and computational cost. Preliminary results will also be shown for ensembles of kudzu under bagging and boosting. Kudzu densities are useful for Bayesian model updating where models have many unknowns, require rapid update, datasets are large, and posteriors have no guarantee of convexity and unimodality. The input “dataset” is the posterior sample from a previous analysis. This is demonstrated with a real-life large dataset. A new command outputs code to use the kudzu prior in bayesmh evaluators, BUGS and Stan.
Robert Grant
BayesCamp Ltd
|
9:00–9:20 | Robust testing for serial correlation in linear panel-data models
Abstract:
Serial correlation tests are essential parts of standard model
specification toolkits. For static panel models with strictly
exogenous regressors, a variety of tests are readily available.
However, their underlying assumptions can be very restrictive.
For models with predetermined or endogenous regressors,
including dynamic panel models, the Arellano–Bond (1991,
Review of Economic Studies) test is predominantly used,
but it has low power against certain alternatives. While more
powerful alternatives exist, they are underused in empirical
practice. The recently developed Jochmans (2020, Journal of
Applied Econometrics) portmanteau test yields substantial
power gains when the time horizon is very short, but it can
quickly lose its advantage even for time dimensions that are
still widely considered as small.
I propose a new test based on a combination of short and longer differences, which overcomes this shortcoming and can be shown to have superior power against a wide range of stationary and nonstationary alternatives. It does not lose power as the process under the alternative approaches a random walk—unlike the Arellano–Bond test—and it is robust to large variances of the unit-specific error component—unlike the Jochmans portmanteau test. I present a new Stata command that flexibly implements these (and more) tests for serial correlation in linear error component panel-data models. The command can be run as a postestimation command after a variety of estimators, including generalized method of moments, maximum likelihood, and bias-corrected estimation.
Sebastian Kripfganz
University of Exeter Business School
|
9:20–9:40 | Estimating the wage premia of refugee immigrants
Abstract:
In this case study, I examine the wage earnings of
fully employed previous refugee immigrants in Sweden. Using
administrative employer–employee data from 1990 onward, about
100,000 refugee immigrants who arrived between 1980 and 1996 and
were granted asylum are compared with a matched sample of
native-born workers using coarsened exact matching. Employing
recentered influence function (RIF) quantile regressions to wage
earnings for the period 2011–2015, the
occupational-task-based Oaxaca–Blinder decomposition
approach shows that refugees perform better than natives at the
median wage, controlling for individual and firm
characteristics. The RIF quantile approach provides better
insights for the analysis of these wage differentials than the
standard regression model employed in earlier versions of the
study.
Kit Baum
Boston College
|
9:40–10:00 | The Oaxaca–Blinder decomposition in Stata: An update
Abstract:
In 2008, I published the Stata command oaxaca, which
implements the popular Oaxaca–Blinder (OB) decomposition
technique. This technique is used to analyze differences in
outcomes between groups, such as the wage gap by gender or race.
Over the years, both the functionality of Stata and the
literature on decomposition methods have evolved, so an
update of the oaxaca command is now long overdue. I will
present a revised version of oaxaca that uses modern
Stata features, such as factor-variable notation, and supports
additional decomposition variants that have been proposed in the
literature (for example, reweighted decompositions or
decompositions based on recentered influence functions).
Ben Jann
University of Bern
|
10:00–10:30 | Women in fintech: A difference-in-differences approach to the gender diversity risk relationship
Abstract:
In recent years, gender diversity has received great political
and economic attention. This talk analyzes the intersection of
gender and fintech, focusing on the gender diversity risk
relationship. Several studies suggest the existence of a
positive relationship between board gender diversity and
corporate performance and of a negative one between gender
diversity and risk taking at the cross-sectional level. However,
when we employ a more sophisticated identification strategy to
investigate the impact of female director appointments on risk
measures using the difference-in-differences matching
estimator, the negative relationship disappears. My results
suggest that the director appointment process is not gender
neutral but the negative relationship between gender diversity
and equity risk is driven by between-firm heterogeneous factors.
Malvina Marchese
Bayes Business School
|
10:30–11:00 | Break |
11:00–11:20 | Visualizations to evaluate and communicate adverse event data in randomized controlled trials
Abstract:
Introduction: Well-designed visualizations are powerful ways to
communicate information to a range of audiences. In randomized
controlled trials (RCT) where there is an abundance of complex
data on harms (known as adverse events) visualizations can be a
highly effective means to summarize harm profiles and identify
potential adverse reactions. Trial reporting guidelines such as
the CONSORT extension for harms encourage the use of
visualizations for exploring harm outcomes, but research has
demonstrated that their uptake is extremely low.
Methods: To improve the communication of adverse event data collected in RCTs, we developed recommendations to help trialists decide which visualizations to use to present this data. We developed Stata commands (aedot and aevolcano) to produce two of the visualizations, the volcano and dot plots, to present adverse event data with the aim of easing implementation and promoting increased uptake. Results: In this talk, using clinical examples, I will introduce and demonstrate application of these commands. I will contrast the produced visual summaries from the volcano and dot plots with traditional nongraphical presentations of adverse data with examples in the published literature, with the aim of demonstrating the benefits of graphical displays. Discussion: Visualizations offer an efficient means to summarize large amounts of adverse event data from RCTs, and statistical software eases the implementation of such displays. We hope that development of bespoke Stata commands to create visual summaries of adverse events will increase uptake of visualizations in this area by the applied clinical trial statistician.
Rachel Phillips
Imperial College London
|
11:20–11:40 | Optimizing adverse event analysis in clinical trials when dichotomising continuous harm outcomes
Abstract:
Introduction: The assessment of harm in randomized controlled
trials is vital to enable a risk-benefit assessment on the
intervention under evaluation. Many trials undertake regular
monitoring of continuous outcomes such as laboratory
measurements, for example, blood tests. Typical practice in a
trial analysis is to dichotomize this type of data into
abnormal/normal categories based on reference values. Frequently,
the proportion of participants with abnormal results between
treatment arms are then compared using a chi-squared or
Fisher’s exact test reporting a p-value. Because
dicotomization results in substantial loss of information
contained in the outcome distribution, this increases the chance
of missing a opportunity to detect signals of harm.
Methods: A solution to this problem is to use the outcome distribution in each arm to estimate the between-arm difference in proportions of participants with an abnormal result. This approach has been developed by Sauzet et. al (2016), and it protects against a loss of information and retains statistical power. Results: In this talk, I will introduce the distributional approach and associated Stata community-contributed command distdicho. I will compare the original analysis of blood test results from a small population drug trial in pediatric eczema with the results using the distributional approach and discuss inference from the trial based on these.
Contributor:
Odile Sauzet
Imperial College London
Victoria Cornelius
Imperial College London
|
11:40–12:00 | Implementing treatment-selection rules for multiarm multistage trials using nstage
Abstract:
Multiarm multistage (MAMS) randomized trial designs offer an
efficient and practical framework for addressing multiple
research questions. Typically, standard MAMS designs employ
prespecified interim stopping boundaries based on
lack of benefit and overwhelming efficacy. To facilitate
implementation, we have developed the nstage suite of
commands, which calculates the required sample sizes and trial
timelines for a MAMS design.
In this talk, we introduce the MAMS selection design, integrating an additional treatment selection rule to restrict the number of research arms progressing to subsequent stages in the event all demonstrate a promising treatment effect at interim analyses. The MAMS selection design streamlines the trial process by merging traditionally early-phase treatment selection with the late-phase confirmatory trial. As a result, it gains efficiency over the standard MAMS design by reducing overall trial timelines and required sample sizes. We present an update to the nstagebin Stata command that incorporates this additional layer of adaptivity and calculates required sample sizes, trial timelines, and overall familywise type I error rate and power for MAMS selection designs. Finally, we illustrate how a MAMS selection design can be implemented using the nstage suite of commands and outline its advantages using the ongoing trials in surgery (ROSSINI-2) and maternal health (WHO RED).
Contributors:
Alexandra Blenkinsop, Mahesh KB Parmar
MRC Clinical Trials Unit at UCL
Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL
|
12:00–12:20 | Poster lightning session
nmf: Implementation of nonnegative matrix factorization (NMF) in Stata Jonathan Batty
University of Leeds
Difference in differences using constraints in Stata Colin Birch
Animal and Plant Health Agency (APHA)
Machine-learning covariate adjustment in RCT Lukas Fervers
German Institute for Life-Long Learning
|
12:20–1:30 | Lunch |
1:30–1:50 | Advanced Bayesian survival analysis with merlin and morgana
Abstract:
In this talk, I will describe our latest work to bring advanced
Bayesian survival analysis tools to Stata. Previously, we
introduced the morgana prefix command (bayesmh in
disguise), which provides a Bayesian wrapper for survival models
fit with stmerlin (which is merlin’s more
user-friendly wrapper designed for working with st data). We
have now begun the work to sync morgana with the much
more general merlin command to allow for Bayesian
multiple-outcome models. Within survival analysis, multiple
outcomes arise when we consider competing-risks or the more
general setting of multistate processes. Using an example in
breast cancer, I will show how to estimate competing-risks and
illness-death multistate models within a Bayesian framework,
incorporating prior information for covariate effects and
baseline hazard parameters. Importantly, we have also developed
the predict functionality to obtain a wide range of easily
interpretable predictions, such as cumulative incidence
functions and (restricted) life expectancy, along with their
credible intervals.
Michael Crowther
Red Door Analytics
|
1:50–2:10 | codefinder: Optimizing Stata for the analysis of large, routinely collected healthcare data
Abstract:
Routinely collected healthcare data (including electronic
healthcare records and administrative data) are increasingly
available at the whole-population scale and may span decades of
data collection. These data may be analyzed as part of clinical,
pharmacoepidemiologic and health services research, producing
insights that improve future clinical care. However, the
analysis of healthcare data on this scale presents a number of
unique challenges. These include the storage of diagnosis,
medication and procedure codes using a number of discordant
systems (including ICD-9 and 10, SNOMED-CT, Read codes, etc.)
and the inherently relational nature of the data (each patient
has multiple clinical contacts, during which multiple codes may
be recorded). Preprocessing and analyzing these data using
optimized methods has a number of benefits, including
minimization of computational requirements, analytic time,
carbon footprint, and cost.
We will focus on one of the main issues faced by the healthcare data analyst: how to most efficiently collapse multiple disparate diagnosis codes (stored as strings across a number of variables) into a discrete disease entity using a predefined code list. A number of approaches (including the use of Boolean logic, the inlist function, string functions, and regular expressions) will be sequentially benchmarked in a large, real-world healthcare dataset (n = 192 million hospitalization episodes during a 12-year period; approximately 1 terabyte of data). The time and space complexity of each approach (in addition to its carbon footprint), will be reported. The most efficient strategy has been implemented into our newly developed Stata command codefinder, which will be discussed.
Contributor:
Marlous Hall
University of Leeds
Jonathan Batty
University of Leeds
|
2:10–2:30 | Data-driven decision making using Stata
Abstract:
This presentation focuses on implementing a model in Stata for
making optimal decisions in settings with multiple actions or
options, commonly known as multiaction (or multiarm) settings.
In these scenarios, a finite set of decision options is
available. In the initial part of the presentation, I provide a
concise overview of the primary approaches for estimating the
reward or value function as well as the optimal policy within
the multiarm framework. I outline the identification assumptions
and statistical properties associated with optimal policy
learning estimators.
Moving on to the second part, I explore the analysis of decision risk. This examination reveals that the optimal choice can be influenced by the decision maker's risk attitude, specifically regarding the tradeoff between the reward conditional mean and conditional variance. The third part of the paper presents a Stata implementation of the model, accompanied by an application to real data.
Giovanni Cerulli
CNR-IRCRES
|
2:30–2:50 | Pattern matching in Stata: Chasing the devil in the details
Abstract:
The vast majority of quantitative statistics now have to be
estimated through computer calculations. A computation script
strengthens the reproducibility of these studies but requires
carefulness from the researchers when writing their code to
avoid various mistakes. This presentation introduces a command
implementing some checks foreign to a dynamically typed language
such as Stata in the context of data analysis. This command uses
a new syntax, similar to switch or match expressions, to create
a variable based on other variables in place of chains of
“replace” statements with “if”
conditions. More than the syntax, the real interest of this
command lies in the two properties it checks for. The first one
is exhaustiveness: do the stated conditions cover all the
possible cases? The second one is usefulness: are all the
conditions useful, or is there redundancy between branches? I
borrow the present idea of pattern matching from the Rust
programming language and the earlier implementation in the OCaml
programming language of the algorithm detailed in Maranget
(2017). The command and source code are available on GitHub.
Mael Astruc-Le Souder
University of Bordeaux
|
2:50–3:10 | Break |
3:10–4:10 | Relationships among recent difference-in-differences estimators and how to compute them in Stata
Abstract:
I will provide an overview of the similarities and differences
among popular estimators in the context of staggered
interventions with panel data, illustrating how to compute and interpret
the estimates using built-in and community-contributed Stata commands.
Jeffrey Wooldridge
Michigan State University
|
4:10–5:00 | Open panel discussion with Stata developers
Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users.
|
All participants are responsible for their own travel and accommodation expenses.
Conference fees (VAT not incl.) |
Student | Other |
---|---|---|
Conference (both days) | £66 | £150 |
Conference (one day) | £48 | £96 |
Dinner (optional) | £65 |
There is an optional informal dinner at a London restaurant on Thursday, 12 September. The dinner will offer attendees a good opportunity to share their thoughts on the conference and network after the event.
Visit the official conference page for more information.
The logistics organizer for the 2024 UK Stata Conference is Timberlake Consultants, the Stata distributor to the United Kingdom and Ireland, France, Spain, Portugal, the Middle East and North Africa, Brazil, and Poland.
View the proceedings of previous Stata Conferences and international meetings.