Home  /  Users Group meetings  /  2017 London


Ridit splines with applications to propensity weighting
Abstract: Given a random variable X, the ridit function R_X(ยท) specifies its distribution. The SSC package wridit can compute ridits (possibly weighted) for a variable. A ridit spline in a variable X is a spline in the ridit R_X(X). The SSC package polyspline can be used with wridit to generate an unrestricted ridit-spline basis for an X-variable, with the feature that, in a regression model, the parameters corresponding to the basis variables are equal to mean values of the outcome variable at a list of percentiles of the X-variable. Ridit splines are especially useful in propensity weighting. The user may define a primary propensity score in the usual way, by fitting a regression model of the treatment variable with respect to the confounders, then using the predicted values of the treatment variable. A secondary propensity score is then defined by regressing the treatment variable with respect to a ridit-spline basis in the primary propensity score. We have found that secondary propensity scores can predict the treatment variable and the corresponding primary propensity scores, as measured using the unweighted Somers' D with respect to the treatment variable. However, secondary propensity weights frequently perform better than primary propensity weights at standardizing out the treatment-propensity association, as measured using the propensity-weighted Somers' D with respect to the treatment variable. Also, when we measure the treatment effect, secondary propensity weights may cause considerably less variance inflation than primary propensity weights. This is because the secondary propensity score is less likely to produce extreme propensity weights than the primary propensity score.

Additional information:

Roger Newson
Imperial College
Nonparametric synthetic control method for program evaluation: Model and Stata implementation
Abstract: Building on the papers by Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010), I extend the Synthetic Control Method for program evaluation to the case of a nonparametric identification of the synthetic (or counterfactual) time pattern of the treated unit (for instance: a country, region, city, etc.). I discuss the advantages of this method over the method provided by previous authors and apply them to the same example of Abadie, Diamond, and Hainmueller (2010), i.e. the study of the effects of Proposition 99, a large-scale tobacco control program that California implemented in 1988. I will also show the use of the Stata command synth, provided by Abadie, Diamond, and Hainmueller (2014), and I will show the use of npsynth for the nonparametric synthetic control method I implemented in Stata. Given that many policy interventions and events of interest in social sciences take place at an aggregate level (countries, regions, cities, etc.) and affect a small number of aggregate units, the potential applicability of synthetic control methods to comparative case studies is very large, especially in situations where traditional regression methods are not appropriate.

Additional information:

Giovanni Cerulli
A general multilevel estimation framework: Multivariate joint models and more
Abstract: There has recently been a tremendous amount of work in the area of joint models. New extensions are constantly being developed as methods become more widely accepted and used, especially as the availability of software increases. In this talk, I will introduce work focused on developing an overarching general framework and usable software implementation, called (for now) nlmixed, for estimating many different types of joint models. This will allow the user to fit a model with any number of outcomes, each of which can be of various types (continuous, binary, count, ordinal, survival), with any number of levels, and with any number of random effects at each level. Random effects can then be linked between outcomes in a number of ways. Of course, all of this is nothing new and can be done (far better) with gsem. My focus and motivation for writing my own simplified or extended gsem is to extend the modeling capabilities to allow inclusion of the expected value of an outcome (possibly time-dependent) or its gradient, integral, or general function in the linear predictor of another. Furthermore, I develop simple utility functions to allow the user to extend to nonstandard distributions in an extremely simple way with a short Mata function, while still providing the complex syntax that users of gsem will be familiar with. I will focus on a special case of the general framework and joint modeling of multivariate longitudinal outcomes and survival. I will particularly discuss some challenges faced in fitting such complex models, such as high dimensional random effects, and describe how we can relax the normally distributed random effects assumption. I will also describe many new methodological extensions, particularly in the field of survival analysis, each of which is simple to implement in nlmixed.

Additional information:

Michael J Crowther
University of Leicester
On the shoulders of giants, or not reinventing the wheel
Abstract: Part of the art of coding is writing as little as possible to do as much as possible. In this presentation, I expand on this truism and give examples of Stata code to yield graphs and tables in which most of the real work is delegated to workhorse commands. In graphics, a key principle is that graph twoway is the most general command, even when you do not want rectangular axes. Variations on scatter and line plots are precisely that, variations on scatter and line plots. More challenging illustrations include commands for circular and triangular graphics, in which x and y axes are omitted with an inevitable but manageable cost in re-creating scaffolding, titles, labels, and other elements. In tabulations and listings, the better known commands sometimes seem to fall short of what you want. However, some preparation commands (such as generate, egen, collapse or contract) followed by list, tabdisp, or _tab can get you a long way. The examples range in scope from a few lines of interactive code to fully developed programs. This presentation is thus pitched at all levels of Stata users.

Additional information:

Nicholas J. Cox
Durham University
Scheme scheme, plot plot: DIY graph schemes in Stata
Abstract: Stata includes many options to change design elements of graphs. Invoking these may be necessary to satisfy corporate branding guidelines or journal formatting requirements, or may be desirable because of personal taste. Whatever the reason, many options get used repeatedly—some in every graph—and the code required to produce a single publication-ready figure can run over tens of lines. Changing scheme can reduce the number of options required. What many users are unaware of is that it is simple to write your own personal graph scheme, greatly reducing the number of lines of code needed for any given graph command. Opening a graph scheme file reveals how unintimidating modifying a scheme is. This presentation encourages users to "scheme scheme, plot plot", showing both very simple and more complex examples, and showing how much the coding effort this can save.

Additional information:

Tim Morris
University College London
Unemployment duration and re-employment wages: A control function approach
Abstract: In the context of the instrumental variables (IV) approach, the control function has been widely used in the applied econometrics literature. The main objective is the same: to find (at least) one instrumental variable that explains the variation in the endogenous explanatory variable (EEV) of the structural equation. Once this goal is accomplished, the researcher should regress the EEV on the exogenous variables excluded from the structural equation (instrumental variables). From this regression, usually denoted as first stage, one should obtain the generalized residuals and plug them into the structural equation (second stage). These residuals will then serve as a control function to transform the EEV into an appropriate exogenous variable. The main advantage of this method is that, unlike the two-stage least squares approach (2SLS), it can be applied to nonlinear models (Wooldridge 2015). Such situations arise when the outcome variable of the structural equation is discrete, truncated, or censored. The estimation of a nonlinear model, as opposed to the typical ordinary least squares regression (OLS), may also be required in the first stage. In this presentation, I provide an application to the latter by fitting an accelerated failure model to explain the unemployment duration (my EEV). In order to apply the control function to nonlinear models, Stata currently offers only the etregress command, which allows for a binary treatment variable. To complement this option, I propose a user-written program that allows for a censored treatment variable. Because the program is directed to duration models, the user will be able to choose the type of survival analysis to perform in the first stage. Because of the separate estimation of each stage, the program calculates bootstrapped standard errors for the second stage.

Additional information:

Marta C. Lopes
Nova School of Business and Economics
Local maxima in the estimation of the ZINB and sample selection models
Abstract: It is well known that likelihood functions may have multiple (local) maxima. Unfortunately, the algorithms Stata uses to estimate some popular nonlinear models can converge to local maxima of the likelihood function, and in these cases, the results obtained are meaningless. This is a serious problem, and users do not seem to be aware of it. In this presentation, I use both the heckman and zinb commands to illustrate this problem. As an aside, I also note that Stata uses the incorrect version of Vuong's test to compare zero-inflated models with their standard counterparts.

Additional information:

João M.C. Santos Silva
University of Surrey
Estimating mixture models for environmental noise assessment
Abstract: Environmental noise—linked to traffic, industrial activities, wind farms, etc.—is a matter of increasing concern, because its association with sleep deprivation and a variety of health conditions have been studied in more detail. The framework used for noise assessments assumes that there is a basic level of background noise that will often vary with time of day and vary spatially across monitoring locations. There are additional noise components from random sources such as vehicles, machinery, or wind affecting trees. The question is whether, and by how much, the noise at each location will be increased by the addition of one or more new sources of noise such as a road, a factory or a wind farm. This presentation adopts a mixtures specification to identify heterogeneity in the sources and levels of background noise. In particular, it is important to distinguish between sources of background noise that may be associated with covariates of noise from a new source and from other sources independent of these covariates. A further consideration is that noise levels are not additive, though sound pressures are. The analysis uses an extended version of Deb's Stata command (fmm) for fitting finite mixture models. The extended command allows for imposing restrictions such as the restriction not all components are affected by the covariates or that the probabilities that particular components are observed depend upon exogenous factors. These extensions allow for a richer specification of the determinants of observed noise levels. The extended command is supplemented by postestimation commands that use Monte Carlo methods to estimate how a new source will affect the noise exposure at different locations and how outcomes may be affected by noise control measures. The goal is to produce results that can be understood by decision makers with little or no statistical background.

Additional information:

Gordon Hughes
University of Edinburgh
Sequential (two-stage) estimation of linear panel data models
Abstract: I present the new Stata command xtseqreg, which implements sequential (two-stage) estimators for linear panel data models. Generally, the conventional standard errors are no longer valid in sequential estimation when the residuals from the first stage are regressed on another set of (often time-invariant) explanatory variables at a second stage. xtseqreg computes the analytical standard-error correction of Kripfganz and Schwarz (ECB Working Paper 1838, 2015), which accounts for the first-stage estimation error. xtseqreg can be used to fit both stages of a sequential regression or either stage separately. OLS and 2SLS estimation are supported, as well as one-step and two-step "difference"-GMM and "system"-GMM estimation with a flexible choice of the instruments and weighting matrix. Available postestimation statistics include the Arellano-Bond test for absence of autocorrelation in the first-differenced errors and Hansen's \({\displaystyle J}\)-test for the validity of the overidentifying restrictions. While it is not intended to introduce xtseqreg as a competitor for existing commands, it can mimic part of their behaviour. In particular, xtseqreg can replicate results obtained with xtdpd and xtabond2. In that regard, I will illustrate some common pitfalls in the estimation of dynamic panel models.

Additional information:

Sebastian Kripfganz
University of Exeter Business School
Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit root test
Abstract: We present response surface coefficients for a large range of quantiles of the Elliott, Rothenberg and Stock (Econometrica 1996) DF-GLS unit root tests for different combinations of the number of observations and the lag order in the test regressions, where the latter can be either specified by the user or endogenously determined. The critical values depend on the method used to select the number of lags. We also present the Stata command ersur and illustrate its use with an empirical example that tests the validity of the expectations hypothesis of the term structure of interest rates.

Additional information:

Kit Baum
Boston College
Jesús Otero
Universidad del Rosario, Bogota
kmatch: Kernel matching with automatic bandwidth selection
Abstract: In this talk, I will present a new matching package for Stata called kmatch. The command matches treated and untreated observations with respect to covariates and, if outcome variables are provided, estimates treatment effects based on the matched observations, optionally including regression adjustment bias correction. Multivariate (Mahalanobis) distance matching and propensity score matching are supported, either using kernel matching, ridge matching, or nearest-neighbor matching. For kernel and ridge matching, several methods for data-driven bandwidth selection such as cross-validation are offered. The package also includes various commands for evaluating balancing and common-support violations. A focus of the talk will be on how kernel and ridge matching with automatic bandwidth selection compare with nearest-neighbor matching.

Additional information:

Ben Jann
University of Bern
Estimation and inference for quantiles and indices of inequality and poverty with survey data: Leveraging built-in support for complex survey design and multiplying imputed data
Abstract: Stata is the software of choice for many analysts of household surveys, particularly for poverty and inequality analysis. No dedicated suite of commands comes bundled with the software, but many user-written commands are freely available for the estimation of various types of indices. This talk will present a set of new tools that complement and significantly upgrade some existing packages. The key feature of the new packages is their ability to use Stata's built-in capacity for dealing with survey design features (via the svy prefix), resampling methods (via the bootstrap, jackknife, or permute prefixes), multiplying imputed data (via mi) and various postestimation commands for testing purposes. I will review basic indices, outline estimation and inference for such nonlinear statistics with survey data, show programming tips, and illustrate various uses of the new commands.

Additional information:

Philippe Van Kerm
Luxembourg Institute for Social and Economic Research
Fitting Bayesian regression models using the bayes prefix
Abstract: Stata 15 introduces the new bayes prefix for fitting Bayesian regression models more easily. It combines Bayesian features with Stata's intuitive and elegant specification of regression models. For example, you fit classical linear regression by using

. regress y x1 x2

You can now fit Bayesian linear regression by using

. bayes: regress y x1 x2

In addition to normal linear regression, the bayes prefix supports over 50 likelihood models, including models for continuous, binary, ordinal, categorical, count, censored, survival outcomes, and more. All of Stata's Bayesian features are supported with the bayes prefix. In my presentation, I will demonstrate how to use the new bayes prefix to fit a variety of Bayesian regression models, including survival and sample-selection models.

Additional information:

Yulia Marchenko
mrrobust: A Stata package for MR-Egger regression type analyses
Abstract: MR-Egger regression analyses are becoming increasingly common in Mendelian randomization studies (MR) (Bowden, Smith, and Burgess 2015). MR-Egger analyses use summary-level data as reported by genome-wide association studies. Such data are conveniently available from the MR-base platform (Hemani et al. 2016). MR-Egger and related methods treat a multiple-instrument MR analysis as a meta-analysis across the multiple genotypes. In the MR-Egger approach, bias from the pleiotropic effects of the multiple genotypes is treated as small-study reporting bias in meta-analysis. They represent an important quality control check for any MR analysis incorporating multiple genotypes. We implemented several of these methods (inverse-variance weighted [IVW], MR-Egger, and weighted median approaches, as well as a relevant plot) in a package for Stata called mrrobust (pleiotropy robust methods for MR). There are also implementations of these methods in R (Yavorska and Burgess 2016). mrrobust is freely available from https://github.com/remlapmot/mrrobust, which includes instructions on how to install the package from within Stata. We plan to add features over time.


Bowden, J., G. D. Smith, and S. Burgess. 2015. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. International Journal of Epidemiology, 44: 512–525.

Hemani, G., et al. 2016. bioRxiv, doi: MR-Base: a platform for systematic causal inference across the phenome using billions of genetic associations. https://doi.org/10.1101/078972; http://www.mrbase.org/.

Yavorska, O., and S. Burgess. 2016. MendelianRandomization: Mendelian Randomization Package. https://CRAN.R-project.org/package=MendelianRandomization

Additional information:

Tom Palmer
Lancaster University
Group sequential clinical trial designs for normally distributed outcome variables
Abstract: In a group sequential trial, accumulated data are analyzed at numerous time points to allow early decisions to be made about a hypothesis of interest. These designs have historically been recommended for their ethical, administrative, and economic benefits, and indeed have a long history of use in clinical research. In this presentation, we begin by discussing the theory behind these designs. Then we describe a collection of new Stata commands for computing the stopping boundaries and required group size of various classical group sequential designs, assuming a normally distributed outcome variable. Following this, we demonstrate how the performance of several designs can be compared graphically. We conclude by discussing the many possible future extensions of this work.

Additional information:

Michael Grayling
University of Cambridge
James Wason
University of Cambridge
Adrian Mander
University of Cambridge
piecewise_ginireg: A Stata package to run piecewise Gini regressions
Abstract: In this presentation, we introduce piecewise_ginireg, an extension to Mark Schaffer's ginireg command. Gini regressions are based on the Gini's Mean Difference as a measure of desperation and the estimator can be interpreted a weighted average of slopes (see for example Olkin, I. and Yitzhaki, S. 1992. Gini Regression Analysis. International Statistical Review 60: 185–196). Compared with a simple OLS regression, the covariance is replaced by the Gini covariance. piecewise_ginireg splits the dataset into subsets and allows an estimation for each of these subsets, giving the possibility to gain estimated coefficients for each of the subsets and to test if the linearity assumption is held by the data. Compared with a regular Gini regression, piecewise_ginireg runs several Gini regressions on subsets of the data. As a first step, piecewise_ginireg runs a normal Gini regression on the entire dataset (Iteration 0). The estimated coefficients are saved, the residuals computed, and the LMA curve calculated. The LMA allows an interpretation of how the Gini covariance is composed. In the next iterations, the dataset is split into separate parts defined by a rule. piecewise_ginireg allows different rules, such as the min or max of the LMA, or where the LMA crosses the origin. On each section, a Gini regression is performed, where the dependent variable is the error term of the preceding iteration. After each iteration, the coefficients are saved, and the residuals and LMA are calculated. piecewise_ginireg allows the user to specify the maximum number of iterations in several ways. It is possible to set a fixed number or until the normality conditions of the error terms hold.

Additional information:

Jan Ditzen
Heriot-Watt University
Shlomo Yitzhaki
The Hebrew University, Hadassah Academic College
Three serial correlation tests for panel data regression models
Abstract: The default method to calculate standard errors in regression models requires idiosyncratic errors (uncorrelated on any dimension). More general methods exist (e.g. HAC and clustered errors) but are not always feasible, especially in smaller datasets or those with a complicated (correlation) structure. However, if your residuals are uncorrelated, the default standard errors might actually suffice and be more reliable than their cluster robust version. In this presentation, I present three new panel serial correlation tests that can be used to look for correlation along the first dimension ("within" groups). Likewise, I present two relatively new commands to test for correlation in the second dimension ("between" groups). These commands are faster, more versatile, and more robust than existing ones (e.g. xtserial and abar).

Additional information:

Jesse Wursten
KU Leuven
eltmle: Ensemble learning targeted maximum likelihood estimation
Abstract: Modern epidemiology has been able to identify significant limitations of classic epidemiological methods, like outcome regression analysis, when estimating causal quantities such as the average treatment effect (ATE) for observational data. For example, using classic regression models to estimate the ATE requires one to assume that the effect measure is constant across levels of confounders included in the model, i.e. that there is no effect modification. Other methods do not require this assumption, including g-methods (e.g. the gformula) and targeted maximum likelihood estimation (TMLE). Many ATE estimators, but not all of them, rely on parametric modeling assumptions. Therefore, the correct model specification is crucial to obtain unbiased estimates of the true ATE. TMLE is a semiparametric, efficient substitution estimator allowing for data-adaptive estimation while obtaining valid statistical inference based on the targeted minimum loss-based estimation. Being doubly robust, TMLE allows inclusion of machine learning algorithms to minimize the risk of model misspecification, a problem that persists for competing estimators. Evidence shows that TMLE typically provides the least unbiased estimates of the ATE compared with other double robust estimators. eltmle is a Stata command implementing the targeted maximum likelihood estimation for the ATE for a binary outcome and binary treatment. eltmle uses a super learner called from the Super Learner R-package v.2.0-21 (Polley E., et al. 2011). The Super Learner uses V-fold cross-validation (10-fold by default) to assess the performance of prediction regarding the potential outcomes and the propensity score as weighted averages of a set of machine learning algorithms. We used the default Super Learner algorithms implemented in the base installation of the tmle-R package v.1.2.0- 5 (Susan G. and Van der Laan M., 2017), which included the following: i) stepwise selection, ii) generalized linear modelling (GLM), iii) a GLM variant that includes second order polynomials and two-by-two interactions of the main terms included in the model. Additionally, eltmle users will have the option to include Bayesian generalized linear models and generalized additive models as additional Super Learner algorithms. Future implementations will offer more advanced machine learning algorithms.

Additional information:

Miguel-Angel Luque Fernandez
London School of Hygiene and Tropical Medicine
Estimating effects from extended regression models
Abstract: I discuss how to use the new extended regression model (ERM) commands to estimate average causal effects when the outcome is censored or when the sample is endogenously selected. I also discuss how to use these commands to estimate causal effects in the presence of endogenous explanatory variables, which these commands also accommodate.

Additional information:

David Drukker
Extending Stata graphics via SVG manipulation commands
Abstract: Among the many new features in Stata 14, arguably the most exciting was completely unheralded: Stata graphs could now be exported to SVG (Scalable Vector Graphics) format. SVG is a great option for storing graphical outputs because it is compact, but images can be enlarged without becoming blurred or pixelated. It is also relatively human-readable, particularly as Stata output, because the .svg files are plain text XML code. We present three new commands as the start of a larger SVG manipulation package, which amend an existing Stata-generated .svg file to add features that are not available within Stata. Individual objects such as markers or lines can be made semitransparent, the SVG can be embedded within a web page with some interactivity such as popup information, and a scatterplot can be converted to a hexagonal bin (two-dimensional histogram).
Robert Grant
Tim Morris
University College London
Robust Statistics in Stata
Abstract: Statistical methods often rely on restrictive assumptions that are expected to be (approximately) true in real-life situations. For example, many classic statistical models ranging from descriptive statistics to regression models or multivariate analysis are based on the assertion that data are normally distributed. The main justification for assuming a normal distribution is that it generally approximates many real-life situations well and, more conveniently, allows the derivation of explicit formulas for optimal statistical methods such as maximum likelihood estimators. However, the normality assumption may be violated in practice, and results obtained via classic estimations may be uninformative or misleading. For example, it can happen that the vast majority of the observations are approximately normally distributed as assumed, but a small cluster of so-called outliers is generated from a different distribution. In this situation, classical estimation techniques may break down and not convey the desired information. To deal with such limitations, robust statistical techniques have been developed. In this talk, we will give a brief overview of situations in which robust methods should be used. We will start by "theoretically" describing such techniques in descriptive analysis, regression models, and multivariate statistics. We will then present some robust packages that have been implemented to make these estimators available (and fast to compute) in Stata. This talk is related to a forthcoming Stata Press book we are writing.

Additional information:

Ben Jann
University of Bern
Vincenzo Verardi
University of Namur, Free University of Brussels, and FNRS
Report to users & Wishes and grumbles


Scientific committee

Stephen Jenkins
London School of Economics and Political Science

Roger Newson
Imperial College London

Michael Crowther
University of Leicester and Karolinska Institutet

Logistics organizer

The logistics organizer for the 2017 London Stata Users Group meeting is Timberlake Consultants, the distributor of Stata in the UK, Ireland, and Eire.

View the proceedings of previous Stata Users Group meetings.