Home  /  Stata Conferences and Users Group meetings  /  2015 UK Stata Users Group meeting

2015 UK Stata Users Group meeting

10–11 September 2015

Big Ben, London

Centre for Econometric Analysis
Cass Business School

106 Bunhill Row
London     EC1 8TZ
United Kingdom


Somers' D: A common currency for associations

Roger B. Newson
Imperial College London
Somers' D(Y | X) is an asymmetric measure of ordinal association between two variables Y and X, on a scale from –1 to 1. It is defined as the difference between the conditional probabilities of concordance and discordance between two randomly sampled (X, Y ) pairs, given that the two X values are ordered. The somersd package enables the user to estimate Somers' D for a wide range of sampling schemes, allowing clustering or sampling probability weighting or restriction to comparisons within strata. Somers' D has the useful feature that a larger D(Y | X) cannot be secondary to a smaller D(W | X) with the same sign, enabling us to make scientific statements that the first ordinal association cannot be caused by the second. An important practical example, especially for public health scientists, is the case where Y is an outcome, X an exposure, and W a propensity score. However, an audience accustomed to other measures of association may be culture-shocked if we present associations measured using Somers' D. Fortunately, under some commonly used models, Somers' D is related monotonically to an alternative association measure, which may be more clearly related to the practical question of how much good we can do. These relationships are nearly linear (or log-linear) over Somers' D values from –0.5 to 0.5. We present examples with X and Y binary, with X binary and Y survival time, with X binary and Y conditionally normal, and with X and Y bivariate normal. Somers' D can, therefore, be used as a common currency for comparing numerous associations between variables not limited to a particular model.

Additional information

rscore: A Stata module to compute responsiveness scores

Giovanni Cerulli
Research Institute on Sustainable Economic Growth
This paper presents rscore, a Stata module to compute unit-responsiveness scores using an iterated random-coefficient regression (RCR). The basic econometrics of this model can be found in Wooldridge (2002, pp. 638–642). The model estimated by rscore starts from a classical regression of Y, the target variable, on a series of factors x (the regressors) by assuming a different reaction (or responsiveness) of each unit to each factor contained in X. This is done using a random-coefficient regression (RCR), an approach in which the usual regression coefficients vary across units. The application of such an approach can convey new and interesting analytical findings compared with the traditional regression approach. In particular, by measuring a unit-specific regression coefficient for each regressor, this model allows for: (i) ranking units according to the level of the responsiveness score obtained; (ii) detecting factors that are more influential in driving unit performance; (iii) studying the general distribution (variety) of the factors' responsiveness scores across units. The knowledge of these idiosyncratic scores can be also exploited to test the presence of increasing, constant, or decreasing returns of Y to X in a straightforward and graphically easy-to-read way.

Additional information

Use of simulation with ipdpower in designing a randomized cluster study of an oral health intervention in care homes

David Boniface
University College London
This paper illustrates the use of the recently developed Stata procedure ipdpower (by E. Kontopantelis) in designing a cluster randomized trial. The trial compared change (pre and post) between intervention and non-intervention care homes. Forty-nine residential care homes ranging in size from 3 to 112 beds (median 27 beds) were available to take part. Primary outcome measures were tooth cleaning (a dichotomy) and the Geriatric Oral Health Assessment Index (GOHAI, a continuous score). As is common in this situation, it was required to explore the effect on sample size and power of a range of values of cluster sizes, within-cluster correlation, between-group variation, and intraclass correlation. Ranges of parameter values for multiple runs of the simulation procedure were obtained from published results of studies with similar features, transformed where necessary through standard formulae. The final design resulted in a recommendation of use of 16 homes with estimated statistical power of 80% for comparison of intervention with non-intervention participants, adjusting for baseline values. Simulation can be recommended as a valuable approach because it accounts for all features of the design, it facilitates communication among members of the study team in balancing design features, and it provides a clear sense of the size required for the necessary statistical power.

Additional information

Efficient multivariate-normal distribution calculations in Stata

Michael J. Grayling
MRC Biostatistics Unit Cambridge
Adrian P. Mander
MRC Biostatistics Unit Cambridge
The normal distribution holds significant importance in statistics. Much gathered real-world data either are, or are assumed to be, normally distributed. Today though, a considerable amount of statistical analysis performed is not univariate, but multivariate in nature. Consequently, the multivariate normal distribution is of increasing importance. However, the complexity of this distribution makes computational analysis almost certainly necessary, and thus there has been much research in developing efficient algorithms for its numerical analysis. Here we discuss our implementation of a specific algorithm in Mata that allows its distribution function and equi-coordinate quantiles to be identified seamlessly for any choice of location vector and positive semi-definite covariance matrix. Moreover, we detail new commands to efficiently compute its density and to generate pseudorandom variables. We then discuss the performance of our commands relative to the presently available alternatives, and we present how they provide greater generalization and improved computational speed. Finally, through the example of designing a group sequential clinical trial, we demonstrate how our commands can be used easily to solve real-world problems facing Stata users.

Additional information

Between and beyond: Irregular series, interpolation, variograms, and smoothing

Nicholas J. Cox
Durham University
Time series (and similar one-dimensional series) are more often irregularly spaced than many methods texts or courses admit. Even with a plan of regular measurements, gaps can arise for many human or inhuman reasons, while some series are naturally irregular. Interpolation of values between known values is a centuries-old need, but one neglected by official Stata, which offers only linear interpolation and cubic spline interpolation (in Mata). I review additional user-written commands for interpolation, including those for cubic, nearest-neighbour, and piecewise cubic Hermite methods available from SSC.

Beyond interpolation of irregular series lie the questions of characterizing the structure of such series and smoothing in various ways. One useful tool standard in spatial statistics is the variogram, which relates dissimilarity as squared differences between values to their separation in time or distance in space. Diggle and others have shown uses for variograms in time series and longitudinal data analysis. I discuss user-written Stata commands for variogram calculation, plotting, and use in relation to exploratory data analysis on the one hand and smoothing on the other.

Additional information

Using MICE to investigate loss to follow-up in a 10-year cohort of HIV-positive patients in Haiti

Deanna Jannat-Khah
Weill Cornell Medical College
Michelle Unterbrink
Weill Cornell Medical College
Margaret McNairy
Weill Cornell Medical College
Dan Fitzgerald
Weill Cornell Medical College
Arthur Evans
Weill Cornell Medical College
Samuel Pierre
Jean Pape
Weill Cornell Medical College and GHESKIO
Loss to follow-up is unavoidable in many public health studies. Tracing all subjects may be impractical or prohibitively expensive. Traditional methods, including Kaplan–Meier analysis and inverse-probability weighting (IPW), produce biased estimates if loss is not independent of survival. Multiple imputation with chained equations (MICE) provides an acceptable, robust, and cost-saving solution to this problem for HIV research in developing countries with limited resources. To illustrate its utility, we applied MICE to ascertain outcome status of people who were lost to follow-up within a cohort of N=910 HIV-positive people followed for 10 years in Port-au-Prince Haiti. In this study, 17% (n=156) were lost to follow-up and 8% (n=71) transferred facilities. Contact tracing was performed and 45 of the 156 subjects identified as lost to follow-up were found: 37 alive and 8 deceased. Analysis using IPW based on the traced subjects predicted that 63% of all subjects were alive at 10 years (95% CI 0.59, 0.67).

Results from MICE predicted that within 6 months, 12%, (95% CI 0.86- 0.90) of those who were lost to follow-up or transferred were dead and 88% were alive (95% CI 0.10-0.14). At 10 years, 33% were predicted to be dead (95% CI 0.29- 0.36) and 67% (95%CI: 0.64-0.71) were predicted to be alive. We found MICE to be more robust in predicting status because it allowed us to impute missing data so that we had the maximum number of observations to perform regression analyses. Additionally, the results were easier to interpret, less likely to be biased, and provided an interesting insight into a problem that is often commented upon in the extant literature. Overall, MICE is a useful cost saving method for studying survival compared with contact tracing for HIV research in developing countries.

Additional information

Ethnicity recording in primary care: Multiple imputation of missing data in ethnicity recording using The Health Improvement Network (THIN) database

Tra Pham
University College London
Irene Petersen
University College London
Tim P. Morris
University College London
Ethnicity is an important factor to be considered in many epidemiological studies because of its association with inequality in disease prevalence and the utilization of healthcare. Ethnicity recording has been incorporated in primary care electronic health records and, therefore, is available in many large UK primary care databases, such as The Health Improvement Network (THIN). However, because primary care data are routinely collected to serve clinical purposes, a large amount of data that are relevant for research purposes including ethnicity is often missing. A popular approach is to use multiple imputation, but the standard multiple imputation does not give plausible estimates of the ethnicity distribution in THIN compared with the general UK population. However, census data can be used to form weights to use in multiple imputation such that the correct ethnicity distribution is recovered. We will describe how the method of weighted multiple imputation of missing data is implemented using the Stata's mi impute suite, note some issues, and introduce a new procedure to implement the method for multiple incomplete variables that require different imputation weights. Finally, we will give an example showing how the method works when ethnicity is used as an explanatory variable in a cohort study.

Additional information

Logistic regression: Why we often can do what we think we can do

Maarten Buis
University of Konstanz
There is increasing criticism of the ways in which the raw coefficients and odds ratios from logistic regression have been used. The argument is that logistic regression models offer a latent propensity of success and that the scale of that latent variable is fixed by fixing the variance of the error term. If one adds a variable to a model, the variance of the residual is likely to decrease, and the scale of the dependent variable thus changes. Comparing models with and without that additional variable thus becomes problematic. Similarly, a comparison of models in groups that are likely to have different residual variances will also be problematic. However, I will argue that logistic regression has an unusual dependent variable: a probability that measures how certain we are that an event of interest happens. This degree of certainty is a function of how much information we have, which in the case of logistic regression is captured by the variables we add to the model. If the dependent variable is interpreted in that way, many of the problems with logistic regression turn out to be desirable properties of the logistic regression model.

Additional information

Big Data in Stata

Andrew Maurer
Quantitative Risk Management
With more and more data being stored by organizations across industries – from academia, to health care, to banking – along with plummeting storage and RAM costs, there is a growing need for tools to analyze "big data". The world is moving from needing to analyze megabytes of data to needing to analyze many gigabytes. While Stata is very user friendly, many of the most basic commands—summarize, sample, collapse, and encode, etc.—are not optimized for speed. These commands—as of Stata 14—all rely on sorting, which makes them tens, or even hundreds (in the case of sample), of times slower than what is possible with better algorithms. In this presentation, I illustrate alternative algorithms along with coded examples in Stata, Mata, and C++ plugins that may be used to more quickly analyze big data. fastsample and fastcollapse are available from the SSC.

Robust covariance estimation for quantile regression

João Santos Silva
University of Surrey
Quantile regression is increasingly used by practitioners, but there are still some misconceptions about how difficult it is to obtain valid standard errors in this context. In this presentation, I discuss the estimation of the covariance matrix of the quantile regression estimator, focusing special attention on the case where the regression errors may be heteroskedastic or clustered. I discuss specification tests to detect heteroskedasticity and intra-cluster correlation, and I present small simulation studies to illustrate the finite-sample performance of the tests and of the covariance matrix estimators. I conclude the presentation with a brief description of qreg2, which is a wrapper for qreg that implements all the methods discussed in the presentation.

Additional information

Influence functions at work

Philippe Van Kerm
Luxembourg Institute of Socio-Economic Research (LISER)
This presentation illustrates three practical uses of influence functions (IFs) in Stata. First (and most obviously), inspection of IFs helps detect influential sample observations. I show how this can be done in practice and how similar this is to examining jackknife replicates. Second, IFs make it easy to calculate (asymptotic) standard errors and confidence intervals for a wide range of statistics. I illustrate how this can be done in Stata with the total command to account for complex survey design easily. Third, application of "recentered influence function (RIF) regression" has recently been advocated to approximate the impact of covariates on (unconditional) distribution statistics. I demonstrate this use of IFs in Stata and discuss interpretation of RIF regression model coefficients. Empirical applications are to income distribution analysis. Several user-written utilities and commands are illustrated along the way.

Additional information

A large-scale application of Stata's forecast suite: Challenges and potential

Christopher F. Baum
Boston College and DIW Berlin
Stata 13 added a very important feature for macroeconomists: the forecast suite of commands that implements the definition of a model, consisting of various estimated equations and potentially nonlinear identities. Stata's features include model solution, dynamic forecasting, scenario analysis, and stochastic simulation. I report on my attempt to apply the forecast suite to a well-known large-scale macroeconomic model. I discuss the challenges related to use of these features in a much more complex context than that illustrated in the manual's examples. I also suggest enhancements that would improve forecast's capabilities in comparison with other popular forecasting tools.

Additional information

Log-linear models for cross-tabulations using Stata

Maarten Buis
University of Konstanz
Log-linear models for cross-tabulations are models for describing and testing patterns in cross-tabulations. These cross-tabulations could have two dimensions (e.g. father's occupation versus son's occupation) or more than two dimensions (e.g. father's occupation versus son's occupation for different cohorts and different countries). A wide range of patterns can be investigated and tested with these models. Some examples of these patterns are: one can investigate whether the dimensions are independent (e.g. father's occupation has no relevance for the son's occupation); whether the dimensions are independent except for the diagonals (e.g. sons are more likely to enter the occupation of their father, but the father has no influence once the son chooses to do something else than the father); or assume that the categories are ordinal and estimate a scale for each dimension and summarize the strength of the association with one number, which can be compared across cohorts or countries. The purpose of this talk is to give an overview of this family of models, discuss how to trick Stata (in particular, poisson and gsem) into estimating these models, and how to get interpretable parameters out of these models.

Additional information

Bayesian analysis using Stata

Yulia Marchenko
Stata 14 provides a suite of commands for performing Bayesian analysis. Bayesian analysis is a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that a person accused of a crime is guilty? What is the probability that there is a positive effect of schooling on wage? What is the probability that the odds ratio is between 0.3 and 0.5? And many more. In my presentation, I will describe Stata's Bayesian suite of commands and demonstrate its use in various applications.

Additional information

Fast Bayesian modeling in Stan using the StataStan program

Robert Grant
St George's, University of London & Kingston University
Over the last three years, a new package for Bayesian modeling called Stan (after Stanislaw Ulam, coauthor of the Metropolis algorithm) has been developing quickly and making an impact on computing for complex Bayesian models. By translating the model into C++ and then compiling that, it can run much faster than BUGS. A particular benefit is for simulation studies, because the model needs to be compiled only once. Furthermore, it includes a much faster and better mixing algorithm (NUTS: the No U-Turn Sampler), especially for correlated parameters that Gibbs samplers like BUGS cope with badly. I present the program StataStan, which sends your data and specifications to Stan, displays results, and can read the chains of samples back into Stata. There are also specific commands to run the commonly used models in the BUGS and Stan user manuals with your own data, avoiding the need to write the Stan model.

Additional information

Stata for Internet applications: A web interface for Stata user-written commands

Alexander Zlotnik
Technical University of Madrid
The integration of Stata with web applications can be of great use in some contexts. One such scenario is to make user-written Stata commands available directly through a webpage from any web-enabled device, such as a smartphone, tablet computer, personal digital assistant (PDA), or any personal computer with a web browser. This would allow reaching a large and diverse audience. Another scenario is the integration of subroutines written in Stata or Mata in web applications, which is desirable in organizations where statistical applications are developed by one team with Stata, while the rest of the business logic and front-end applications are developed by another team using different technologies. If Stata programs can be used directly, the often costly translation from Stata into other programming languages can be avoided, thus saving development resources, time, and eliminating the errors and discrepancies due to translation mistakes and limitations of target languages.

I demonstrate an approach for executing user-written commands on Stata/IC, Stata/SE, and Stata/MP through a web application based on the WAMP stack (Microsoft Windows, Apache, MySQL, PHP). Then, I introduce the adjustments needed for other operating systems, web servers and server-side scripting programming languages. I describe the requirements for Stata user-written commands accessible through web applications, their limitations, the bidirectional communication between Stata and generic web applications, possible solutions for concurrent execution scenarios, as well as the transformation of Stata dialog-box (DLG) files into web-ready HTML, CSS, and JavaScript interfaces. Finally, I mention web application security principles, Stata-based web services, and software licensing approaches.

Additional information

A new Stata command for computing and graphing percentile shares

Ben Jann
University of Bern
Percentile shares provide an intuitive and easy-to-understand way for analyzing income or wealth distributions. A celebrated example is the top income shares sported by the works of Thomas Piketty and colleagues. Moreover, series of percentile shares, defined as differences between Lorenz ordinates, can be used to visualize whole distributions or changes in distributions. In this talk, I present a new command called pshare that computes and graphs percentile shares (or changes in percentile shares) from individual level data. The command also provides confidence intervals and supports survey estimation.

Additional information

How used are user-released commands? Introducing ssccount

Tim Morris
University College London
Babak Choodari-Oskooei
University College London
Statisticians and econometricians developing new methods are keen for their methods to be adopted, and releasing user-friendly software plays an important role in uptake. Methods that were not initially applied much, and became so after software implementations, include Cox's proportional-hazards model, multiple imputation, and propensity-score matching. It is easy to release packages to the Stata community via the Boston College Statistical Software Components (SSC) archive, but gauging the uptake can be difficult. Stata's ssc whatshot command lists the number of hits for a recent month for packages available on SSC. The new ssccount command goes further, obtaining monthly files of hits (from July 2007 when records began) for specified authors and packages, and optionally plots the number of hits over time. This can give authors an impression of how well their commands are being used. Funders are increasingly asking for evidence of impact, and thus ssccount provides a useful soft measure.

Additional information

Frequentist inference in spatial discrete choice models with endogenous congestion effects and club-correlated random effects

Arnab Bhattacharjee
Heriot-Watt University
Robert L. Hicks
College of William and Mary
Kurt E. Schnier
University of California Merced
Agents may consider information and other signals from their peers (especially close peers) when making their spatial site choices. However, the presence of other agents in a spatial location may generate congestion or agglomeration effects. Disentangling the potential peer effects with issues of congestion is difficult because it is hard to ascertain whether the observed congestion effects are a result of observing others' behavior or the influence of peer effects within the same network encouraging a fisherman to visit a site even in the presence of congestion. The research develops an empirical framework to decompose both motivations in a spatial discrete choice model in an effort to synthesize the congestion and agglomeration literature with the peer effects literature. Using Monte Carlo analysis we investigate the robustness of our proposed estimation routine to the conventional random utility model (RUM), which ignores both peer and congestion and agglomeration effects, as well as the spatial sorting equilibrium model, which ignores peer effects. Our results indicate that both the RUM and sorting equilibrium models can be used to successfully investigate the presence of peer effects. However, the estimates of congestion effects are poor because of ignored correlated random effects. Recent literature has largely used Bayesian methods for this hard problem. We also explore the use of fixed-effects multinomial logit estimates to first estimate the base model and then extract generalized residuals to estimate the peer effects.

Additional information

Who has won Rugby Union World Cups, and why? A sequential approach based on multinomial probit

Vincenzo Verardi
Université Libre de Bruxelles and Universitée Namur
Brian O'Rourke
Final economic outcomes are often determined over consecutive process stages. The most prevalent approach is to model internodal transition and and event probabilities using techniques such as sequential logit. Transition success for survivors at each stage is then regressed on explanatory variables using standard logit (allowing for correlation in the error terms). This seemingly unrelated approach benefits from methodological convenience. It crucially depends, however, on the assumption that at each stage, any unobservable factors are independent. We believe that error-term independence may often be an excessively strong assumption. We propose an alternative approach based on multinomial probit that does not rely on that very, restrictive assumption. Implementation is no more demanding. We describe the procedure using Stata 13. To illustrate the usefulness of the method, we estimate the determinants of success for each stage at the Rugby World Cup.

Additional information

Social network analysis using Stata

Thomas Grund
Linköping University
Social network analysis is one of the most rapidly growing fields of the social sciences. Social network analysis focuses on the relationships that exist between individuals (or other units of analysis), such as friendship, advice, trust, or trade relationships. Network analysis is concerned with the visualization and analysis of network structures, as well as with the importance of networks for individuals' propensities to adopt different behaviours. Until now, such analyses have been possible to perform using specialized software for network analysis only. This tutorial introduces the nwcommands, a software suite with over 80 Stata commands for social network analysis. The software includes commands (and dialog boxes) for importing, exporting, loading, saving, handling, manipulating, replacing, generating, visualizing, and animating networks. It also includes commands for measuring various properties of the networks and the individual nodes, for detecting network patterns and measuring the similarity of different networks, as well as advanced statistical techniques for network analysis including MR-QAP and ERGM.

Additional information

Wishes and grumbles

William Gould & colleagues
William Gould, president of StataCorp and chief developer of Stata, and colleagues will be happy to receive wishes for developments in Stata and almost as happy to receive grumbles about the software.

Scientific organizers

Stephen P. Jenkins, London School of Economics

Roger B. Newson, Imperial College London

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Middle East, Poland, Portugal, and Spain.