»  Home »  Users Group meetings »  2017 London

The London Stata Users Group Meeting takes place on September 7–8, 2017, at Cass Business School.

The meeting will provide Stata users from across the United Kingdom and the world the opportunity to exchange ideas, experiences, and information on new applications of the software. Everyone who is interested in using Stata is welcome.

## Program

### Thursday, September 7

 8:45–9:25 Registration and Coffee/Tea 9:25–9:30 Introduction and welcome 9:30–10:00 Ridit splines with applications to propensity weighting Abstract: Given a random variable X, the ridit function R_X(·) specifies its distribution. The SSC package wridit can compute ridits (possibly weighted) for a variable. A ridit spline in a variable X is a spline in the ridit R_X(X). The SSC package polyspline can be used with wridit to generate an unrestricted ridit-spline basis for an X-variable, with the feature that, in a regression model, the parameters corresponding to the basis variables are equal to mean values of the outcome variable at a list of percentiles of the X-variable. Ridit splines are especially useful in propensity weighting. The user may define a primary propensity score in the usual way, by fitting a regression model of the treatment variable with respect to the confounders, then using the predicted values of the treatment variable. A secondary propensity score is then defined by regressing the treatment variable with respect to a ridit-spline basis in the primary propensity score. We have found that secondary propensity scores can predict the treatment variable and the corresponding primary propensity scores, as measured using the unweighted Somers' D with respect to the treatment variable. However, secondary propensity weights frequently perform better than primary propensity weights at standardizing out the treatment-propensity association, as measured using the propensity-weighted Somers' D with respect to the treatment variable. Also, when we measure the treatment effect, secondary propensity weights may cause considerably less variance inflation than primary propensity weights. This is because the secondary propensity score is less likely to produce extreme propensity weights than the primary propensity score. Roger Newson Imperial College 10:00–10:30 Nonparametric synthetic control method for program evaluation: Model and Stata implementation Abstract: Building on the papers by Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010), I extend the Synthetic Control Method for program evaluation to the case of a nonparametric identification of the synthetic (or counterfactual) time pattern of the treated unit (for instance: a country, region, city, etc.). I discuss the advantages of this method over the method provided by previous authors and apply them to the same example of Abadie, Diamond, and Hainmueller (2010), i.e. the study of the effects of Proposition 99, a large-scale tobacco control program that California implemented in 1988. I will also show the use of the Stata command synth, provided by Abadie, Diamond, and Hainmueller (2014), and I will show the use of npsynth for the nonparametric synthetic control method I implemented in Stata. Given that many policy interventions and events of interest in social sciences take place at an aggregate level (countries, regions, cities, etc.) and affect a small number of aggregate units, the potential applicability of synthetic control methods to comparative case studies is very large, especially in situations where traditional regression methods are not appropriate. Giovanni Cerulli IRCrES-CNR 10:30–11:00 A general multilevel estimation framework: Multivariate joint models and more Abstract: There has recently been a tremendous amount of work in the area of joint models. New extensions are constantly being developed as methods become more widely accepted and used, especially as the availability of software increases. In this talk, I will introduce work focused on developing an overarching general framework and usable software implementation, called (for now) nlmixed, for estimating many different types of joint models. This will allow the user to fit a model with any number of outcomes, each of which can be of various types (continuous, binary, count, ordinal, survival), with any number of levels, and with any number of random effects at each level. Random effects can then be linked between outcomes in a number of ways. Of course, all of this is nothing new and can be done (far better) with gsem. My focus and motivation for writing my own simplified or extended gsem is to extend the modeling capabilities to allow inclusion of the expected value of an outcome (possibly time-dependent) or its gradient, integral, or general function in the linear predictor of another. Furthermore, I develop simple utility functions to allow the user to extend to nonstandard distributions in an extremely simple way with a short Mata function, while still providing the complex syntax that users of gsem will be familiar with. I will focus on a special case of the general framework and joint modeling of multivariate longitudinal outcomes and survival. I will particularly discuss some challenges faced in fitting such complex models, such as high dimensional random effects, and describe how we can relax the normally distributed random effects assumption. I will also describe many new methodological extensions, particularly in the field of survival analysis, each of which is simple to implement in nlmixed. Michael J Crowther University of Leicester 11:00–11:30 Coffee/Tea 11:30–12:00 On the shoulders of giants, or not reinventing the wheel Abstract: Part of the art of coding is writing as little as possible to do as much as possible. In this presentation, I expand on this truism and give examples of Stata code to yield graphs and tables in which most of the real work is delegated to workhorse commands. In graphics, a key principle is that graph twoway is the most general command, even when you do not want rectangular axes. Variations on scatter and line plots are precisely that, variations on scatter and line plots. More challenging illustrations include commands for circular and triangular graphics, in which x and y axes are omitted with an inevitable but manageable cost in re-creating scaffolding, titles, labels, and other elements. In tabulations and listings, the better known commands sometimes seem to fall short of what you want. However, some preparation commands (such as generate, egen, collapse or contract) followed by list, tabdisp, or _tab can get you a long way. The examples range in scope from a few lines of interactive code to fully developed programs. This presentation is thus pitched at all levels of Stata users. Nicholas J. Cox Durham University 12:00–12:20 Scheme scheme, plot plot: DIY graph schemes in Stata Abstract: Stata includes many options to change design elements of graphs. Invoking these may be necessary to satisfy corporate branding guidelines or journal formatting requirements, or may be desirable because of personal taste. Whatever the reason, many options get used repeatedly—some in every graph—and the code required to produce a single publication-ready figure can run over tens of lines. Changing scheme can reduce the number of options required. What many users are unaware of is that it is simple to write your own personal graph scheme, greatly reducing the number of lines of code needed for any given graph command. Opening a graph scheme file reveals how unintimidating modifying a scheme is. This presentation encourages users to "scheme scheme, plot plot", showing both very simple and more complex examples, and showing how much the coding effort this can save. Tim Morris University College London 12:20–12:40 Unemployment duration and re-employment wages: A control function approach Abstract: In the context of the instrumental variables (IV) approach, the control function has been widely used in the applied econometrics literature. The main objective is the same: to find (at least) one instrumental variable that explains the variation in the endogenous explanatory variable (EEV) of the structural equation. Once this goal is accomplished, the researcher should regress the EEV on the exogenous variables excluded from the structural equation (instrumental variables). From this regression, usually denoted as first stage, one should obtain the generalized residuals and plug them into the structural equation (second stage). These residuals will then serve as a control function to transform the EEV into an appropriate exogenous variable. The main advantage of this method is that, unlike the two-stage least squares approach (2SLS), it can be applied to nonlinear models (Wooldridge 2015). Such situations arise when the outcome variable of the structural equation is discrete, truncated, or censored. The estimation of a nonlinear model, as opposed to the typical ordinary least squares regression (OLS), may also be required in the first stage. In this presentation, I provide an application to the latter by fitting an accelerated failure model to explain the unemployment duration (my EEV). In order to apply the control function to nonlinear models, Stata currently offers only the etregress command, which allows for a binary treatment variable. To complement this option, I propose a user-written program that allows for a censored treatment variable. Because the program is directed to duration models, the user will be able to choose the type of survival analysis to perform in the first stage. Because of the separate estimation of each stage, the program calculates bootstrapped standard errors for the second stage. Marta C. Lopes Nova School of Business and Economics 12:40–1:00 eltmle: Ensemble learning targeted maximum likelihood estimation Abstract: Modern epidemiology has been able to identify significant limitations of classic epidemiological methods, like outcome regression analysis, when estimating causal quantities such as the average treatment effect (ATE) for observational data. For example, using classic regression models to estimate the ATE requires one to assume that the effect measure is constant across levels of confounders included in the model, i.e. that there is no effect modification. Other methods do not require this assumption, including g-methods (e.g. the gformula) and targeted maximum likelihood estimation (TMLE). Many ATE estimators, but not all of them, rely on parametric modeling assumptions. Therefore, the correct model specification is crucial to obtain unbiased estimates of the true ATE. TMLE is a semiparametric, efficient substitution estimator allowing for data-adaptive estimation while obtaining valid statistical inference based on the targeted minimum loss-based estimation. Being doubly robust, TMLE allows inclusion of machine learning algorithms to minimize the risk of model misspecification, a problem that persists for competing estimators. Evidence shows that TMLE typically provides the least unbiased estimates of the ATE compared with other double robust estimators. eltmle is a Stata command implementing the targeted maximum likelihood estimation for the ATE for a binary outcome and binary treatment. eltmle uses a super learner called from the Super Learner R-package v.2.0-21 (Polley E., et al. 2011). The Super Learner uses V-fold cross-validation (10-fold by default) to assess the performance of prediction regarding the potential outcomes and the propensity score as weighted averages of a set of machine learning algorithms. We used the default Super Learner algorithms implemented in the base installation of the tmle-R package v.1.2.0- 5 (Susan G. and Van der Laan M., 2017), which included the following: i) stepwise selection, ii) generalized linear modelling (GLM), iii) a GLM variant that includes second order polynomials and two-by-two interactions of the main terms included in the model. Additionally, eltmle users will have the option to include Bayesian generalized linear models and generalized additive models as additional Super Learner algorithms. Future implementations will offer more advanced machine learning algorithms. Miguel-Angel Luque Fernandez London School of Hygiene and Tropical Medicine 1:00–2:00 Lunch 2:00–2:30 Estimating mixture models for environmental noise assessment Abstract: Environmental noise—linked to traffic, industrial activities, wind farms, etc.—is a matter of increasing concern, because its association with sleep deprivation and a variety of health conditions have been studied in more detail. The framework used for noise assessments assumes that there is a basic level of background noise that will often vary with time of day and vary spatially across monitoring locations. There are additional noise components from random sources such as vehicles, machinery, or wind affecting trees. The question is whether, and by how much, the noise at each location will be increased by the addition of one or more new sources of noise such as a road, a factory or a wind farm. This presentation adopts a mixtures specification to identify heterogeneity in the sources and levels of background noise. In particular, it is important to distinguish between sources of background noise that may be associated with covariates of noise from a new source and from other sources independent of these covariates. A further consideration is that noise levels are not additive, though sound pressures are. The analysis uses an extended version of Deb's Stata command (fmm) for fitting finite mixture models. The extended command allows for imposing restrictions such as the restriction not all components are affected by the covariates or that the probabilities that particular components are observed depend upon exogenous factors. These extensions allow for a richer specification of the determinants of observed noise levels. The extended command is supplemented by postestimation commands that use Monte Carlo methods to estimate how a new source will affect the noise exposure at different locations and how outcomes may be affected by noise control measures. The goal is to produce results that can be understood by decision makers with little or no statistical background. Gordon Hughes University of Edinburgh 2:30–3:00 Sequential (two-stage) estimation of linear panel data models Abstract: I present the new Stata command xtseqreg, which implements sequential (two-stage) estimators for linear panel data models. Generally, the conventional standard errors are no longer valid in sequential estimation when the residuals from the first stage are regressed on another set of (often time-invariant) explanatory variables at a second stage. xtseqreg computes the analytical standard-error correction of Kripfganz and Schwarz (ECB Working Paper 1838, 2015), which accounts for the first-stage estimation error. xtseqreg can be used to fit both stages of a sequential regression or either stage separately. OLS and 2SLS estimation are supported, as well as one-step and two-step "difference"-GMM and "system"-GMM estimation with a flexible choice of the instruments and weighting matrix. Available postestimation statistics include the Arellano-Bond test for absence of autocorrelation in the first-differenced errors and Hansen's $${\displaystyle J}$$-test for the validity of the overidentifying restrictions. While it is not intended to introduce xtseqreg as a competitor for existing commands, it can mimic part of their behaviour. In particular, xtseqreg can replicate results obtained with xtdpd and xtabond2. In that regard, I will illustrate some common pitfalls in the estimation of dynamic panel models. Sebastian Kripfganz University of Exeter Business School 3:00–3:30 Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit root test Abstract: We present response surface coefficients for a large range of quantiles of the Elliott, Rothenberg and Stock (Econometrica 1996) DF-GLS unit root tests for different combinations of the number of observations and the lag order in the test regressions, where the latter can be either specified by the user or endogenously determined. The critical values depend on the method used to select the number of lags. We also present the Stata command ersur and illustrate its use with an empirical example that tests the validity of the expectations hypothesis of the term structure of interest rates. Kit Baum Boston College Jesús Otero Universisad del Rosario, Bogota 3:30–4:00 Coffee/Tea 4:00–4:30 kmatch: Kernel matching with automatic bandwidth selection Abstract: In this talk, I will present a new matching package for Stata called kmatch. The command matches treated and untreated observations with respect to covariates and, if outcome variables are provided, estimates treatment effects based on the matched observations, optionally including regression adjustment bias correction. Multivariate (Mahalanobis) distance matching and propensity score matching are supported, either using kernel matching, ridge matching, or nearest-neighbor matching. For kernel and ridge matching, several methods for data-driven bandwidth selection such as cross-validation are offered. The package also includes various commands for evaluating balancing and common-support violations. A focus of the talk will be on how kernel and ridge matching with automatic bandwidth selection compare with nearest-neighbor matching. Ben Jann University of Bern 4:30–5:30 Estimation and inference for quantiles and indices of inequality and poverty with survey data: Leveraging built-in support for complex survey design and multiplying imputed data Abstract: Stata is the software of choice for many analysts of household surveys, particularly for poverty and inequality analysis. No dedicated suite of commands comes bundled with the software, but many user-written commands are freely available for the estimation of various types of indices. This talk will present a set of new tools that complement and significantly upgrade some existing packages. The key feature of the new packages is their ability to use Stata's built-in capacity for dealing with survey design features (via the svy prefix), resampling methods (via the bootstrap, jackknife, or permute prefixes), multiplying imputed data (via mi) and various postestimation commands for testing purposes. I will review basic indices, outline estimation and inference for such nonlinear statistics with survey data, show programming tips, and illustrate various uses of the new commands. Philippe Van Kerm Luxembourg Institute for Social and Economic Research 5:30–5:45 Presentation of the Stata Journal Editors' Prize 2016

## Registration

Participants are asked to travel at their own expense. The meeting fee covers costs for refreshments, lunch, and all meeting materials.

Meeting fees (VAT incl.) Price
Nonstudents - both days £96.00
Nonstudents - one day only £66.00
Students - both days £66.00
Students - one day only £48.00
Dinner (optional) £36.00

The dinner is an informal meal at a London restaurant on Thursday evening.

## Organizers

#### Scientific committee

Stephen Jenkins
London School of Economics and Political Science

Roger Newson
Imperial College London

Michael Crowther
University of Leicester and Karolinska Institutet

#### Logistics organizer

The logistics organizer for the 2017 London Stata Users Group meeting is Timberlake Consultants, the distributor of Stata in the UK, Ireland, and Eire.

For more information on the 2017 London Users Group meeting, visit the official meeting page.

View the proceedings of previous Stata Users Group meetings.