The London Stata Users Group Meeting was Thursay, 7 September and Friday, 8 September 2017, but you can view the program and presentation slides below.
Abstract:
Given a random variable X, the ridit function R_X(ยท) specifies its
distribution. The SSC package wridit can compute ridits (possibly
weighted) for a variable. A ridit spline in a variable X is a spline in
the ridit R_X(X). The SSC package polyspline can be used with wridit to
generate an unrestricted riditspline basis for an Xvariable, with the
feature that, in a regression model, the parameters corresponding to the
basis variables are equal to mean values of the outcome variable at a
list of percentiles of the Xvariable. Ridit splines are especially
useful in propensity weighting. The user may define a primary propensity
score in the usual way, by fitting a regression model of the treatment
variable with respect to the confounders, then using the predicted
values of the treatment variable. A secondary propensity score is then
defined by regressing the treatment variable with respect to a
riditspline basis in the primary propensity score. We have found that
secondary propensity scores can predict the treatment variable and
the corresponding primary propensity scores, as measured using the
unweighted Somers' D with respect to the treatment variable. However,
secondary propensity weights frequently perform better than primary
propensity weights at standardizing out the treatmentpropensity
association, as measured using the propensityweighted Somers' D with
respect to the treatment variable. Also, when we measure the treatment
effect, secondary propensity weights may cause considerably less
variance inflation than primary propensity weights. This is because the
secondary propensity score is less likely to produce extreme propensity
weights than the primary propensity score.
Additional information: uk17_Newson.pdf
Roger Newson
Imperial College

Abstract:
Building on the papers by Abadie and Gardeazabal (2003) and Abadie,
Diamond, and Hainmueller (2010), I extend the Synthetic Control Method
for program evaluation to the case of a nonparametric identification of
the synthetic (or counterfactual) time pattern of the treated unit (for
instance: a country, region, city, etc.). I discuss the advantages of
this method over the method provided by previous authors and apply them to
the same example of Abadie, Diamond, and Hainmueller (2010), i.e. the
study of the effects of Proposition 99, a largescale tobacco control
program that California implemented in 1988. I will also show the use
of the Stata command synth, provided by Abadie, Diamond, and Hainmueller
(2014), and I will show the use of npsynth for the nonparametric
synthetic control method I
implemented in Stata. Given that many policy interventions and events
of interest in social sciences take place at an aggregate level
(countries, regions, cities, etc.) and affect a small number of
aggregate units, the potential applicability of synthetic control
methods to comparative case studies is very large, especially in
situations where traditional regression methods are not appropriate.
Additional information: uk17_Cerulli.pdf
Giovanni Cerulli
IRCrESCNR

Abstract:
There has recently been a tremendous amount of work in the area of joint
models. New extensions are constantly being developed
as methods become more widely accepted and used, especially as
the availability of software increases. In this talk, I will introduce
work focused on developing an overarching general framework and usable
software implementation, called (for now) nlmixed, for estimating many
different types of joint models. This will allow the user to fit a model
with any number of outcomes, each of which can be of various types
(continuous, binary, count, ordinal, survival), with any number of
levels, and with any number of random effects at each level. Random
effects can then be linked between outcomes in a number of ways. Of
course, all of this is nothing new and can be done (far better) with
gsem. My focus and motivation for writing my own simplified or extended
gsem is to extend the modeling capabilities to allow inclusion of
the expected value of an outcome (possibly timedependent) or its
gradient, integral, or general function in the linear predictor
of another. Furthermore, I develop simple utility functions to allow the
user to extend to nonstandard distributions in an extremely simple way
with a short Mata function, while still providing the complex syntax
that users of gsem will be familiar with. I will focus on a special case of
the general framework and joint modeling of multivariate longitudinal
outcomes and survival. I will particularly discuss some challenges
faced in fitting such complex models, such as high dimensional random
effects, and describe how we can relax the normally distributed random
effects assumption. I will also describe many new methodological
extensions, particularly in the field of survival analysis, each of
which is simple to implement in nlmixed.
Additional information: uk17_Crowther.pdf
Michael J Crowther
University of Leicester

Abstract:
Part of the art of coding is writing as little as possible to do as much
as possible. In this presentation, I expand on this truism and give examples
of Stata code to yield graphs and tables in which most of the real work
is delegated to workhorse commands. In graphics, a key
principle is that graph twoway is the most general command, even when
you do not want rectangular axes. Variations on scatter and line plots
are precisely that, variations on scatter and line plots. More
challenging illustrations include commands for circular and triangular
graphics, in which x and y axes are omitted with an inevitable but
manageable cost in recreating scaffolding, titles, labels, and other
elements. In tabulations and listings, the better known commands
sometimes seem to fall short of what you want. However, some
preparation commands (such as generate, egen, collapse or
contract) followed by list, tabdisp, or _tab can get
you a long way. The examples range in scope from a few lines of interactive code
to fully developed programs. This presentation is thus pitched at all
levels of Stata users.
Additional information: uk17_Cox.pdf
Nicholas J. Cox
Durham University

Abstract:
Stata includes many options to change design elements of graphs.
Invoking these may be necessary to satisfy corporate branding guidelines
or journal formatting requirements, or may be desirable because of personal taste.
Whatever the reason, many options get used repeatedly—some in every
graph—and the code required to produce a single publicationready
figure can run over tens of lines. Changing scheme can reduce the number
of options required. What many users are unaware of is that it is
simple to write your own personal graph scheme, greatly reducing the
number of lines of code needed for any given graph command. Opening a
graph scheme file reveals how unintimidating modifying a scheme is.
This presentation encourages users to "scheme scheme, plot plot",
showing both very simple and more complex examples, and showing how much
the coding effort this can save.
Additional information: uk17_Morris.pptx
Tim Morris
University College London

Abstract:
In the context of the instrumental variables (IV) approach, the control
function has been widely used in the applied econometrics literature.
The main objective is the same: to find (at least) one instrumental
variable that explains the variation in the endogenous explanatory
variable (EEV) of the structural equation. Once this goal is
accomplished, the researcher should regress the EEV on the exogenous
variables excluded from the structural equation (instrumental
variables). From this regression, usually denoted as first stage, one
should obtain the generalized residuals and plug them into the
structural equation (second stage). These residuals will then serve as a
control function to transform the EEV into an appropriate exogenous
variable. The main advantage of this method is that, unlike the
twostage least squares approach (2SLS), it can be applied to nonlinear
models (Wooldridge 2015). Such situations arise when the outcome
variable of the structural equation is discrete, truncated, or censored.
The estimation of a nonlinear model, as opposed to the
typical ordinary least squares regression (OLS), may also be required in
the first stage. In this presentation, I provide an application to the latter by
fitting an accelerated failure model to explain the unemployment
duration (my EEV). In order to apply the control function to nonlinear
models, Stata currently offers only the etregress command, which
allows for a binary treatment variable. To complement this option, I
propose a userwritten program that allows for a censored
treatment variable. Because the program is directed to duration models, the
user will be able to choose the type of survival analysis to perform in
the first stage. Because of the separate estimation of each stage, the
program calculates bootstrapped standard errors for the second stage.
Additional information: uk17_Lopes.pdf
Marta C. Lopes
Nova School of Business and Economics

Abstract:
It is well known that likelihood functions may have multiple (local)
maxima. Unfortunately, the algorithms Stata uses to estimate some
popular nonlinear models can converge to local maxima of the likelihood
function, and in these cases, the results obtained are meaningless. This
is a serious problem, and users do not seem to be aware of it. In this
presentation, I use both the heckman and zinb commands to illustrate
this problem. As an aside, I also note that Stata uses the incorrect
version of Vuong's test to compare zeroinflated models with their
standard counterparts.
Additional information: uk17_Santos_Silva.pdf
João M.C. Santos Silva
University of Surrey

Abstract:
Environmental noise—linked to traffic, industrial activities, wind
farms, etc.—is a matter of increasing concern, because its association with
sleep deprivation and a variety of health conditions have been studied in
more detail. The framework used for noise assessments assumes that there
is a basic level of background noise that will often vary with time of
day and vary spatially across monitoring locations. There are additional noise
components from random sources such as vehicles, machinery, or wind
affecting trees. The question is whether,
and by how much, the noise at each location will be increased by the
addition of one or more new sources of noise such as a road, a factory
or a wind farm. This presentation adopts a mixtures specification to identify
heterogeneity in the sources and levels of background noise.
In particular, it is important to distinguish between sources of background
noise that may be associated with covariates of noise from a new source
and from other sources independent of these covariates. A further
consideration is that noise levels are not additive, though sound
pressures are. The analysis uses an extended version of Deb's Stata
command (fmm) for fitting finite mixture models. The extended command
allows for imposing restrictions such as the restriction not all
components are affected by the covariates or that the probabilities that
particular components are observed depend upon exogenous factors. These
extensions allow for a richer specification of the determinants of
observed noise levels. The extended command is supplemented by
postestimation commands that use Monte Carlo methods to estimate how a
new source will affect the noise exposure at different locations and how
outcomes may be affected by noise control measures. The goal is to
produce results that can be understood by decision makers with little or
no statistical background.
Additional information: uk17_Hughes.pdf
Gordon Hughes
University of Edinburgh

Abstract:
I present the new Stata command xtseqreg, which implements sequential
(twostage) estimators for linear panel data models. Generally, the
conventional standard errors are no longer valid in sequential
estimation when the residuals from the first stage are regressed on
another set of (often timeinvariant) explanatory variables at a second
stage. xtseqreg computes the analytical standarderror correction of
Kripfganz and Schwarz (ECB Working Paper 1838, 2015), which accounts for
the firststage estimation error. xtseqreg can be used to fit both
stages of a sequential regression or either stage separately. OLS and
2SLS estimation are supported, as well as onestep and twostep
"difference"GMM and "system"GMM estimation with a flexible choice of the
instruments and weighting matrix. Available postestimation statistics
include the ArellanoBond test for absence of autocorrelation in the
firstdifferenced errors and Hansen's \({\displaystyle J}\)test
for the validity of the
overidentifying restrictions. While it is not intended to introduce
xtseqreg as a competitor for existing commands, it can mimic part of
their behaviour. In particular, xtseqreg can replicate results obtained
with xtdpd and xtabond2. In that regard, I will illustrate some common
pitfalls in the estimation of dynamic panel models.
Additional information: uk17_Kripfganz.pdf
Sebastian Kripfganz
University of Exeter Business School

Abstract:
We present response surface coefficients for a large range of quantiles
of the Elliott, Rothenberg and Stock (Econometrica 1996) DFGLS unit
root tests for different combinations of the number of observations and
the lag order in the test regressions, where the latter can be either
specified by the user or endogenously determined. The critical values
depend on the method used to select the number of lags. We also present the Stata
command ersur and illustrate its use with an empirical
example that tests the validity of the expectations hypothesis of the
term structure of interest rates.
Additional information: uk17_Baum.pdf
Kit Baum
Boston College
Jesús Otero
Universidad del Rosario, Bogota

Abstract:
In this talk, I will present a new matching package for Stata called
kmatch. The command matches treated and untreated observations with
respect to covariates and, if outcome variables are provided,
estimates treatment effects based on the matched observations,
optionally including regression adjustment bias correction. Multivariate
(Mahalanobis) distance matching and propensity score matching are
supported, either using kernel matching, ridge matching, or
nearestneighbor matching. For kernel and ridge matching, several
methods for datadriven bandwidth selection such as crossvalidation are
offered. The package also includes various commands for evaluating
balancing and commonsupport violations. A focus of the talk will be on
how kernel and ridge matching with automatic bandwidth selection compare
with nearestneighbor matching.
Additional information: uk17_Jann.pdf
Ben Jann
University of Bern

Abstract:
Stata is the software of choice for many analysts of household surveys,
particularly for poverty and inequality analysis. No dedicated suite of
commands comes bundled with the software, but many userwritten commands
are freely available for the estimation of various types of indices.
This talk will present a set of new tools that complement and
significantly upgrade some existing packages. The key feature of the new
packages is their ability to use Stata's builtin capacity for
dealing with survey design features (via the svy prefix), resampling
methods (via the bootstrap, jackknife, or permute prefixes),
multiplying imputed data (via mi) and various postestimation commands
for testing purposes. I will review basic indices, outline
estimation and inference for such nonlinear statistics with survey
data, show programming tips, and illustrate various uses of the new
commands.
Additional information: uk17_VanKerm.pdf
Philippe Van Kerm
Luxembourg Institute for Social and Economic Research

Abstract:
Stata 15 introduces the new bayes prefix for fitting Bayesian regression
models more easily. It combines Bayesian features with Stata's intuitive
and elegant specification of regression models. For example, you fit
classical linear regression by using
. regress y x1 x2 You can now fit Bayesian linear regression by using . bayes: regress y x1 x2 In addition to normal linear regression, the bayes prefix supports over 50 likelihood models, including models for continuous, binary, ordinal, categorical, count, censored, survival outcomes, and more. All of Stata's Bayesian features are supported with the bayes prefix. In my presentation, I will demonstrate how to use the new bayes prefix to fit a variety of Bayesian regression models, including survival and sampleselection models. Additional information: uk17_Marchenko.pdf
Yulia Marchenko
StataCorp

Abstract:
MREgger regression analyses are becoming increasingly common in
Mendelian randomization studies (MR) (Bowden, Smith, and Burgess 2015). MREgger
analyses use summarylevel data as reported by genomewide association
studies. Such data are conveniently available from the MRbase platform
(Hemani et al. 2016). MREgger and related methods treat a multipleinstrument
MR analysis as a metaanalysis across the multiple genotypes.
In the MREgger approach, bias from the pleiotropic effects of the
multiple genotypes is treated as smallstudy reporting bias in
metaanalysis. They represent an important quality control check for
any MR analysis incorporating multiple genotypes. We implemented
several of these methods (inversevariance weighted [IVW], MREgger, and
weighted median approaches, as well as a relevant plot) in a package for
Stata called mrrobust (pleiotropy robust methods for MR). There are also
implementations of these methods in R (Yavorska and Burgess 2016).
mrrobust is freely available from
https://github.com/remlapmot/mrrobust,
which includes instructions on how to install the package from within
Stata. We plan to add features over time.
References: Bowden, J., G. D. Smith, and S. Burgess. 2015. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. International Journal of Epidemiology, 44: 512–525. Hemani, G., et al. 2016. bioRxiv, doi: MRBase: a platform for systematic causal inference across the phenome using billions of genetic associations. https://doi.org/10.1101/078972; http://www.mrbase.org/. Yavorska, O., and S. Burgess. 2016. MendelianRandomization: Mendelian Randomization Package. https://CRAN.Rproject.org/package=MendelianRandomization Additional information: uk17_Palmer.pdf
Tom Palmer
Lancaster University

Abstract:
In a group sequential trial, accumulated data are analyzed at numerous
time points to allow early decisions to be made about a
hypothesis of interest. These designs have historically been recommended
for their ethical, administrative, and economic benefits, and indeed have
a long history of use in clinical research. In this presentation, we
begin by discussing the theory behind these designs. Then we describe a
collection of new Stata commands for computing the stopping boundaries
and required group size of various classical group sequential designs,
assuming a normally distributed outcome variable. Following this, we
demonstrate how the performance of several designs can be compared
graphically. We conclude by discussing the many possible future
extensions of this work.
Additional information: uk17_Grayling.pdf
Michael Grayling
University of Cambridge
James Wason
University of Cambridge
Adrian Mander
University of Cambridge

Abstract:
In this presentation, we introduce piecewise_ginireg, an extension to Mark
Schaffer's ginireg command. Gini regressions are based on the
Gini's Mean Difference as a measure of desperation and the estimator
can be interpreted a weighted average of slopes (see for example Olkin,
I. and Yitzhaki, S. 1992. Gini Regression Analysis. International
Statistical Review 60: 185–196). Compared with a simple OLS
regression, the covariance is replaced by the Gini covariance.
piecewise_ginireg splits the dataset into subsets and allows an
estimation for each of these subsets, giving the possibility to gain
estimated coefficients for each of the subsets and to test if the
linearity assumption is held by the data. Compared with a regular
Gini regression, piecewise_ginireg runs several Gini regressions on
subsets of the data. As a first step, piecewise_ginireg runs a normal
Gini regression on the entire dataset (Iteration 0). The estimated
coefficients are saved, the residuals computed, and the LMA curve
calculated. The LMA allows an interpretation of how the Gini covariance is
composed. In the next iterations, the dataset is split into separate
parts defined by a rule. piecewise_ginireg allows different rules, such
as the min or max of the LMA, or where the LMA crosses the origin. On each
section, a Gini regression is performed, where the dependent
variable is the error term of the preceding iteration. After each
iteration, the coefficients are saved, and the residuals and LMA are calculated.
piecewise_ginireg allows the user to specify the maximum number of
iterations in several ways. It is possible to set a fixed number or
until the normality conditions of the error terms hold.
Additional information: uk17_Ditzen.pdf
Jan Ditzen
HeriotWatt University
Shlomo Yitzhaki
The Hebrew University, Hadassah Academic College

Abstract:
The default method to calculate standard errors in regression models
requires idiosyncratic errors (uncorrelated on any dimension). More
general methods exist (e.g. HAC and clustered errors) but are not always
feasible, especially in smaller datasets or those with a complicated
(correlation) structure. However, if your residuals are uncorrelated,
the default standard errors might actually suffice and be more reliable
than their cluster robust version. In this presentation, I present three
new panel serial correlation tests that can be used to look for
correlation along the first dimension ("within" groups). Likewise, I
present two relatively new commands to test for correlation in the second
dimension ("between" groups). These commands are faster, more
versatile, and more robust than existing ones (e.g. xtserial and abar).
Additional information: uk17_Wursten.pdf
Jesse Wursten
KU Leuven

Abstract:
Modern epidemiology has been able to identify significant limitations of
classic epidemiological methods, like outcome regression analysis, when
estimating causal quantities such as the average treatment effect (ATE)
for observational data. For example, using classic regression models
to estimate the ATE requires one to assume that the effect measure is constant
across levels of confounders included in the model, i.e. that there is
no effect modification. Other methods do not require this assumption,
including gmethods (e.g. the gformula) and targeted maximum likelihood
estimation (TMLE). Many ATE estimators, but not all of them, rely on
parametric modeling assumptions. Therefore, the correct model
specification is crucial to obtain unbiased estimates of the true ATE.
TMLE is a semiparametric, efficient substitution estimator allowing for
dataadaptive estimation while obtaining valid statistical inference
based on the targeted minimum lossbased estimation. Being doubly
robust, TMLE allows inclusion of machine learning algorithms to minimize
the risk of model misspecification, a problem that persists for
competing estimators. Evidence shows that TMLE typically provides the
least unbiased estimates of the ATE compared with other double robust
estimators. eltmle is a Stata command implementing the targeted maximum
likelihood estimation for the ATE for a binary outcome and binary
treatment. eltmle uses a super learner called from the
Super Learner Rpackage v.2.021 (Polley E., et al. 2011). The
Super Learner uses Vfold crossvalidation (10fold by default) to
assess the performance of prediction regarding the potential outcomes
and the propensity score as weighted averages of a set of machine
learning algorithms. We used the default Super Learner algorithms
implemented in the base installation of the tmleR package v.1.2.0 5
(Susan G. and Van der Laan M., 2017), which included the following: i)
stepwise selection, ii) generalized linear modelling (GLM), iii) a GLM
variant that includes second order polynomials and twobytwo
interactions of the main terms included in the model. Additionally,
eltmle users will have the option to include Bayesian generalized linear
models and generalized additive models as additional Super Learner
algorithms. Future implementations will offer more advanced machine
learning algorithms.
Additional information: uk17_LuqueFernandez.pdf
MiguelAngel Luque Fernandez
London School of Hygiene and Tropical Medicine

Abstract:
I discuss how to use the new extended regression model (ERM) commands to
estimate average causal effects when the outcome is censored or when the sample is
endogenously selected. I also discuss how to use these commands to estimate causal
effects in the presence of endogenous explanatory variables, which these commands
also accommodate.
Additional information: uk17_Drukker.pdf
David Drukker
StataCorp

Abstract:
Among the many new features in Stata 14, arguably the most
exciting was completely unheralded: Stata graphs could now be exported
to SVG (Scalable Vector Graphics) format. SVG is a great option for
storing graphical outputs because it is compact, but images can be
enlarged without becoming blurred or pixelated. It is also relatively
humanreadable, particularly as Stata output, because the .svg files are
plain text XML code. We present three new commands as
the start of a larger SVG manipulation package, which amend an existing
Statagenerated .svg file to add features that are not available within
Stata. Individual objects such as markers or lines can be made
semitransparent, the SVG can be embedded within a web page with some
interactivity such as popup information, and a scatterplot can be
converted to a hexagonal bin (twodimensional histogram).
Robert Grant
BayesCamp
Tim Morris
University College London

Abstract:
Statistical methods often rely on restrictive assumptions that are
expected to be (approximately) true in reallife situations. For
example, many classic statistical models ranging from descriptive
statistics to regression models or multivariate analysis are based
on the assertion that data are normally distributed. The main
justification for assuming a normal distribution is that it generally
approximates many reallife situations well and, more conveniently,
allows the derivation of explicit formulas for optimal statistical
methods such as maximum likelihood estimators. However, the normality
assumption may be violated in practice, and results obtained via
classic estimations may be uninformative or misleading. For example,
it can happen that the vast majority of the observations are
approximately normally distributed as assumed, but a small cluster of
socalled outliers is generated from a different distribution. In this
situation, classical estimation techniques may break down and not convey
the desired information. To deal with such limitations, robust
statistical techniques have been developed. In this talk, we will give a
brief overview of situations in which robust methods should be used. We
will start by "theoretically" describing such techniques in descriptive
analysis, regression models, and multivariate statistics. We will then
present some robust packages that have been implemented to make these
estimators available (and fast to compute) in Stata. This talk is
related to a forthcoming Stata Press book we are writing.
Additional information: uk17_Jann2.pdf
Ben Jann
University of Bern
Vincenzo Verardi
University of Namur, Free University of Brussels, and FNRS

StataCorp

Organizers
Scientific committee
Stephen Jenkins
London School of Economics and Political Science
Roger Newson
Imperial College London
Michael Crowther
University of Leicester and Karolinska Institutet
Logistics organizer
The logistics organizer for the 2017 London Stata Users Group meeting is Timberlake Consultants, the distributor of Stata in the UK, Ireland, and Eire.
View the proceedings of previous Stata Users Group meetings.