» Home » Stata Conferences and Users Group meetings » 2015 UK Stata Users Group meeting

Centre for Econometric Analysis

Cass Business School

106 Bunhill Row

London EC1 8TZ

United Kingdom

Roger B. Newson

Imperial College London

Somers' D(Y | X) is an asymmetric measure of ordinal association between two variables Y and X, on a scale from –1 to 1. It is defined as the difference between the conditional probabilities of concordance and discordance between two randomly sampled (X, Y ) pairs, given that the two X values are ordered. The **somersd** package enables the user to estimate Somers' D for a wide range of sampling schemes, allowing clustering or sampling probability weighting or restriction to comparisons within strata. Somers' D has the useful feature that a larger D(Y | X) cannot be secondary to a smaller D(W | X) with the same sign, enabling us to make scientific statements that the first ordinal association cannot be caused by the second. An important practical example, especially for public health scientists, is the case where Y is an outcome, X an exposure, and W a propensity score. However, an audience accustomed to other measures of association may be culture-shocked if we present associations measured using Somers' D. Fortunately, under some commonly used models, Somers' D is related monotonically to an alternative association measure, which may be more clearly related to the practical question of how much good we can do. These relationships are nearly linear (or log-linear) over Somers' D values from –0.5 to 0.5. We present examples with X and Y binary, with X binary and Y survival time, with X binary and Y conditionally normal, and with X and Y bivariate normal. Somers' D can, therefore, be used as a common currency for comparing numerous associations between variables not limited to a particular model.

**Additional information**

uk15_newson.pdf

newson_examples1.do

uk15_newson.pdf

newson_examples1.do

Giovanni Cerulli

Research Institute on Sustainable Economic Growth

This paper presents **rscore**, a Stata module to compute unit-responsiveness scores using an iterated random-coefficient regression (RCR). The basic econometrics of this model can be found in Wooldridge (2002, pp. 638–642). The model estimated by **rscore** starts from a classical regression of Y, the target variable, on a series of factors x (the regressors) by assuming a different reaction (or responsiveness) of each unit to each factor contained in X. This is done using a random-coefficient regression (RCR), an approach in which the usual regression coefficients vary across units. The application of such an approach can convey new and interesting analytical findings compared with the traditional regression approach. In particular, by measuring a unit-specific regression coefficient for each regressor, this model allows for: (i) ranking units according to the level of the responsiveness score obtained; (ii) detecting factors that are more influential in driving unit performance; (iii) studying the general distribution (variety) of the factors' responsiveness scores across units. The knowledge of these idiosyncratic scores can be also exploited to test the presence of increasing, constant, or decreasing returns of Y to X in a straightforward and graphically easy-to-read way.

**Additional information**

uk15_cerulli.pdf

uk15_cerulli.pdf

David Boniface

University College London

This paper illustrates the use of the recently developed Stata procedure **ipdpower** (by E. Kontopantelis) in designing a cluster randomized trial. The trial compared change (pre and post) between intervention and non-intervention care homes. Forty-nine residential care homes ranging in size from 3 to 112 beds (median 27 beds) were available to take part. Primary outcome measures were tooth cleaning (a dichotomy) and the Geriatric Oral Health Assessment Index (GOHAI, a continuous score). As is common in this situation, it was required to explore the effect on sample size and power of a range of values of cluster sizes, within-cluster correlation, between-group variation, and intraclass correlation. Ranges of parameter values for multiple runs of the simulation procedure were obtained from published results of studies with similar features, transformed where necessary through standard formulae. The final design resulted in a recommendation of use of 16 homes with estimated statistical power of 80% for comparison of intervention with non-intervention participants, adjusting for baseline values. Simulation can be recommended as a valuable approach because it accounts for all features of the design, it facilitates communication among members of the study team in balancing design features, and it provides a clear sense of the size required for the necessary statistical power.

**Additional information**

uk15_boniface.pptx

uk15_boniface.pptx

Michael J. Grayling

MRC Biostatistics Unit Cambridge

Adrian P. Mander

MRC Biostatistics Unit Cambridge

The normal distribution holds significant importance in statistics. Much gathered real-world data either are, or are assumed to be, normally distributed. Today though, a considerable amount of statistical analysis performed is not univariate, but multivariate in nature. Consequently, the multivariate normal distribution is of increasing importance. However, the complexity of this distribution makes computational analysis almost certainly necessary, and thus there has been much research in developing efficient algorithms for its numerical analysis. Here we discuss our implementation of a specific algorithm in Mata that allows its distribution function and equi-coordinate quantiles to be identified seamlessly for any choice of location vector and positive semi-definite covariance matrix. Moreover, we detail new commands to efficiently compute its density and to generate pseudorandom variables. We then discuss the performance of our commands relative to the presently available alternatives, and we present how they provide greater generalization and improved computational speed. Finally, through the example of designing a group sequential clinical trial, we demonstrate how our commands can be used easily to solve real-world problems facing Stata users.

**Additional information**

uk15_grayling.pdf

uk15_grayling.pdf

Nicholas J. Cox

Durham University

Time series (and similar one-dimensional series) are more often irregularly spaced than many methods texts or courses admit. Even with a plan of regular measurements, gaps can arise for many human or inhuman reasons, while some series are naturally irregular. Interpolation of values between known values is a centuries-old need, but one neglected by official Stata, which offers only linear interpolation and cubic spline interpolation (in Mata). I review additional user-written commands for interpolation, including those for cubic, nearest-neighbour, and piecewise cubic Hermite methods available from SSC.
Beyond interpolation of irregular series lie the questions of characterizing the structure of such series and smoothing in various ways. One useful tool standard in spatial statistics is the variogram, which relates dissimilarity as squared differences between values to their separation in time or distance in space. Diggle and others have shown uses for variograms in time series and longitudinal data analysis. I discuss user-written Stata commands for variogram calculation, plotting, and use in relation to exploratory data analysis on the one hand and smoothing on the other.

**Additional information**

uk15_cox.ppt

uk15_cox.ppt

Deanna Jannat-Khah

Weill Cornell Medical College

Michelle Unterbrink

Weill Cornell Medical College

Margaret McNairy

Weill Cornell Medical College

Dan Fitzgerald

Weill Cornell Medical College

Arthur Evans

Weill Cornell Medical College

Samuel Pierre

GHESKIO

Jean Pape

Weill Cornell Medical College and GHESKIO

Loss to follow-up is unavoidable in many public health studies. Tracing all subjects may be impractical or prohibitively expensive. Traditional methods, including Kaplan–Meier analysis and inverse-probability weighting (IPW), produce biased estimates if loss is not independent of survival. Multiple imputation with chained equations (MICE) provides an acceptable, robust, and cost-saving solution to this problem for HIV research in developing countries with limited resources. To illustrate its utility, we applied MICE to ascertain outcome status of people who were lost to follow-up within a cohort of N=910 HIV-positive people followed for 10 years in Port-au-Prince Haiti. In this study, 17% (n=156) were lost to follow-up and 8% (n=71) transferred facilities. Contact tracing was performed and 45 of the 156 subjects identified as lost to follow-up were found: 37 alive and 8 deceased. Analysis using IPW based on the traced subjects predicted that 63% of all subjects were alive at 10 years (95% CI 0.59, 0.67).
Results from MICE predicted that within 6 months, 12%, (95% CI 0.86- 0.90) of those who were lost to follow-up or transferred were dead and 88% were alive (95% CI 0.10-0.14). At 10 years, 33% were predicted to be dead (95% CI 0.29- 0.36) and 67% (95%CI: 0.64-0.71) were predicted to be alive. We found MICE to be more robust in predicting status because it allowed us to impute missing data so that we had the maximum number of observations to perform regression analyses. Additionally, the results were easier to interpret, less likely to be biased, and provided an interesting insight into a problem that is often commented upon in the extant literature. Overall, MICE is a useful cost saving method for studying survival compared with contact tracing for HIV research in developing countries.

**Additional information**

uk15_jannatkhah.pdf

uk15_jannatkhah.pdf

Tra Pham

University College London

Irene Petersen

University College London

Tim P. Morris

University College London

Ethnicity is an important factor to be considered in many epidemiological studies because of its association with inequality in disease prevalence and the utilization of healthcare. Ethnicity recording has been incorporated in primary care electronic health records and, therefore, is available in many large UK primary care databases, such as The Health Improvement Network (THIN). However, because primary care data are routinely collected to serve clinical purposes, a large amount of data that are relevant for research purposes including ethnicity is often missing. A popular approach is to use multiple imputation, but the standard multiple imputation does not give plausible estimates of the ethnicity distribution in THIN compared with the general UK population. However, census data can be used to form weights to use in multiple imputation such that the correct ethnicity distribution is recovered. We will describe how the method of weighted multiple imputation of missing data is implemented using the Stata's **mi impute** suite, note some issues, and introduce a new procedure to implement the method for multiple incomplete variables that require different imputation weights. Finally, we will give an example showing how the method works when ethnicity is used as an explanatory variable in a cohort study.

**Additional information**

uk15_pham.pdf

uk15_pham.pdf

Maarten Buis

University of Konstanz

There is increasing criticism of the ways in which the raw coefficients
and odds ratios from logistic regression have been used. The argument is
that logistic regression models offer a latent propensity of success and that
the scale of that latent variable is fixed by fixing the variance of the
error term. If one adds a variable to a model, the variance of the
residual is likely to decrease, and the scale of the dependent variable
thus changes. Comparing models with and without that additional variable
thus becomes problematic. Similarly, a comparison of models in groups
that are likely to have different residual variances will also be
problematic. However, I will argue that logistic regression has an
unusual dependent variable: a probability that measures how certain we
are that an event of interest happens. This degree of certainty is a
function of how much information we have, which in the case of logistic
regression is captured by the variables we add to the model. If the
dependent variable is interpreted in that way, many of the problems with
logistic regression turn out to be desirable properties of the logistic
regression model.

**Additional information**

uk15_buis_cando.pdf

uk15_buis_cando.pdf

Andrew Maurer

Quantitative Risk Management

With more and more data being stored by organizations across industries – from academia, to health care, to banking – along with plummeting storage and RAM costs, there is a growing need for tools to analyze "big data". The world is moving from needing to analyze megabytes of data to needing to analyze many gigabytes. While Stata is very user friendly, many of the most basic commands—**summarize**, **sample**, **collapse**, and **encode**, etc.—are not optimized for speed. These commands—as of Stata 14—all rely on sorting, which makes them tens, or even hundreds (in the case of **sample**), of times slower than what is possible with better algorithms. In this presentation, I illustrate alternative algorithms along with coded examples in Stata, Mata, and C++ plugins that may be used to more quickly analyze big data. **fastsample** and **fastcollapse** are available from the SSC.

João Santos Silva

University of Surrey

Quantile regression is increasingly used by practitioners, but there are still some misconceptions about how difficult it is to obtain valid standard errors in this context. In this presentation, I discuss the estimation of the covariance matrix of the quantile regression estimator, focusing special attention on the case where the regression errors may be heteroskedastic or clustered. I discuss specification tests to detect heteroskedasticity and intra-cluster correlation, and I present small simulation studies to illustrate the finite-sample performance of the tests and of the covariance matrix estimators. I conclude the presentation with a brief description of **qreg2**, which is a wrapper for **qreg** that implements all the methods discussed in the presentation.

**Additional information**

uk15_santossilva.pdf

uk15_santossilva.pdf

Philippe Van Kerm

Luxembourg Institute of Socio-Economic Research (LISER)

This presentation illustrates three practical uses of influence functions (IFs) in Stata. First (and most obviously), inspection of IFs helps detect influential sample observations. I show how this can be done in practice and how similar this is to examining jackknife replicates. Second, IFs make it easy to calculate (asymptotic) standard errors and confidence intervals for a wide range of statistics. I illustrate how this can be done in Stata with the **total** command to account for complex survey design easily. Third, application of "recentered influence function (RIF) regression" has recently been advocated to approximate the impact of covariates on (unconditional) distribution statistics. I demonstrate this use of IFs in Stata and discuss interpretation of RIF regression model coefficients. Empirical applications are to income distribution analysis. Several user-written utilities and commands are illustrated along the way.

**Additional information**

uk15_vankarm.pdf

uk15_vankarm.pdf

Christopher F. Baum

Boston College and DIW Berlin

Stata 13 added a very important feature for macroeconomists: the **forecast** suite of commands that implements the definition of a model, consisting of various estimated equations and potentially nonlinear identities. Stata's features include model solution, dynamic forecasting, scenario analysis, and stochastic simulation. I report on my attempt to apply the **forecast** suite to a well-known large-scale macroeconomic model. I discuss the challenges related to use of these features in a much more complex context than that illustrated in the manual's examples. I also suggest enhancements that would improve **forecast**'s capabilities in comparison with other popular forecasting tools.

**Additional information**

uk15_baum.pdf

uk15_baum.pdf

Maarten Buis

University of Konstanz

Log-linear models for cross-tabulations are models for describing and testing patterns in cross-tabulations. These cross-tabulations could have two dimensions (e.g. father's occupation versus son's occupation) or more than two dimensions (e.g. father's occupation versus son's occupation for different cohorts and different countries). A wide range of patterns can be investigated and tested with these models. Some examples of these patterns are: one can
investigate whether the dimensions are independent (e.g. father's occupation has no relevance for the son's occupation); whether the dimensions are independent except for the diagonals (e.g. sons are more likely to enter the occupation of their father, but the father has no influence once the son chooses to do something else than the father); or assume that the categories are ordinal and estimate a scale for each dimension and summarize the strength of the association with one number, which can be compared across cohorts or countries. The purpose of this talk is to give an overview of this family of models, discuss how to trick Stata (in particular, **poisson** and **gsem**) into estimating these models, and how to get interpretable parameters out of these models.

**Additional information**

uk15_buis_loglinear.pdf

uk15_buis_loglinear.pdf

Yulia Marchenko

StataCorp

Stata 14 provides a suite of commands for performing Bayesian analysis. Bayesian analysis is a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that a person accused of a crime is guilty? What is the probability that there is a positive effect of schooling on wage? What is the probability that the odds ratio is between 0.3 and 0.5? And many more. In my presentation, I will describe Stata's Bayesian suite of commands and demonstrate its use in various applications.

**Additional information**

uk15_marchenko.pdf

uk15_marchenko.pdf

Robert Grant

St George's, University of London & Kingston University

Over the last three years, a new package for Bayesian modeling called Stan (after Stanislaw Ulam, coauthor of the Metropolis algorithm) has been developing quickly and making an impact on computing for complex Bayesian models. By translating the model into C++ and then compiling that, it can run much faster than BUGS. A particular benefit is for simulation studies, because the model needs to be compiled only once. Furthermore, it includes a much faster and better mixing algorithm (NUTS: the No U-Turn Sampler), especially for correlated parameters that Gibbs samplers like BUGS cope with badly. I present the program **StataStan**, which sends your data and specifications to Stan, displays results, and can read the chains of samples back into Stata. There are also specific commands to run the commonly used models in the BUGS and Stan user manuals with your own data, avoiding the need to write the Stan model.

**Additional information**

uk15_grant.pdf

uk15_grant.pdf

Alexander Zlotnik

Technical University of Madrid

The integration of Stata with web applications can be of great use in some contexts. One such scenario is to make user-written Stata commands available directly through a webpage from any web-enabled device, such as a smartphone, tablet computer, personal digital assistant (PDA), or any personal computer with a web browser. This would allow reaching a large and diverse audience. Another scenario is the integration of subroutines written in Stata or Mata in web applications, which is desirable in organizations where statistical applications are developed by one team with Stata, while the rest of the business logic and front-end applications are developed by another team using different technologies. If Stata programs can be used directly, the often costly translation from Stata into other programming languages can be avoided, thus saving development resources, time, and eliminating the errors and discrepancies due to translation mistakes and limitations of target languages.
I demonstrate an approach for executing user-written commands on Stata/IC, Stata/SE, and Stata/MP through a web application based on the WAMP stack (Microsoft Windows, Apache, MySQL, PHP). Then, I introduce the adjustments needed for other operating systems, web servers and server-side scripting programming languages. I describe the requirements for Stata user-written commands accessible through web applications, their limitations, the bidirectional communication between Stata and generic web applications, possible solutions for concurrent execution scenarios, as well as the transformation of Stata dialog-box (DLG) files into web-ready HTML, CSS, and JavaScript interfaces. Finally, I mention web application security principles, Stata-based web services, and software licensing approaches.

**Additional information**

uk15_zlotnik.pdf

uk15_zlotnik.pdf

Ben Jann

University of Bern

Percentile shares provide an intuitive and easy-to-understand way for analyzing income or wealth distributions. A celebrated example is the top income shares sported by the works of Thomas Piketty and colleagues. Moreover, series of percentile shares, defined as differences between Lorenz ordinates, can be used to visualize whole distributions or changes in distributions. In this talk, I present a new command called **pshare** that computes and graphs percentile shares (or changes in percentile shares) from individual level data. The command also provides confidence intervals and supports survey estimation.

**Additional information**

uk15_jann.pdf

uk15_jann.pdf

Tim Morris

University College London

Babak Choodari-Oskooei

University College London

Statisticians and econometricians developing new methods are keen for their methods to be adopted, and releasing user-friendly software plays an important role in uptake. Methods that were not initially applied much, and became so after software implementations, include Cox's proportional-hazards model, multiple imputation, and propensity-score matching. It is easy to release packages to the Stata community via the Boston College Statistical Software Components (SSC) archive, but gauging the uptake can be difficult. Stata's **ssc whatshot** command lists the number of hits for a recent month for packages available on SSC. The new **ssccount** command goes further, obtaining monthly files of hits (from July 2007 when records began) for specified authors and packages, and optionally plots the number of hits over time. This can give authors an impression of how well their commands are being used. Funders are increasingly asking for evidence of impact, and thus **ssccount** provides a useful soft measure.

**Additional information**

uk15_morris.pdf

uk15_morris.pdf

Arnab Bhattacharjee

Heriot-Watt University

Robert L. Hicks

College of William and Mary

Kurt E. Schnier

University of California Merced

Agents may consider information and other signals from their peers
(especially close peers) when making their spatial site choices.
However, the presence of other agents in a spatial location may
generate congestion or agglomeration effects. Disentangling the
potential peer effects with issues of congestion is difficult because it
is hard to ascertain whether the observed congestion effects are a
result of observing others' behavior or the influence of peer effects
within the same network encouraging a fisherman to visit a site even
in the presence of congestion. The research develops an empirical
framework to decompose both motivations in a spatial discrete choice
model in an effort to synthesize the congestion and agglomeration
literature with the peer effects literature. Using Monte Carlo
analysis we investigate the robustness of our proposed estimation
routine to the conventional random utility model (RUM), which ignores
both peer and congestion and agglomeration effects, as well as the spatial
sorting equilibrium model, which ignores peer effects. Our results
indicate that both the RUM and sorting equilibrium models can be used
to successfully investigate the presence of peer effects. However,
the estimates of congestion effects are poor because of ignored correlated
random effects. Recent literature has largely used Bayesian
methods for this hard problem. We also explore the use of
fixed-effects multinomial logit estimates to first estimate the base model
and then extract generalized residuals to estimate the peer effects.

**Additional information**

uk15_bhattacharjee.pdf

uk15_bhattacharjee.pdf

Vincenzo Verardi

Université Libre de Bruxelles and Universitée Namur

Brian O'Rourke

Final economic outcomes are often determined over consecutive process stages. The most prevalent approach is to model internodal transition and and event probabilities using techniques such as sequential logit. Transition success for survivors at each stage is then regressed on explanatory variables using standard logit (allowing for correlation in the error terms). This seemingly unrelated approach benefits from methodological convenience. It crucially depends, however, on the assumption that at each stage, any unobservable factors are independent. We believe that error-term independence may often be an excessively strong assumption. We propose an alternative approach based on multinomial probit that does not rely on that very, restrictive assumption. Implementation is no more demanding. We describe the procedure using Stata 13. To illustrate the usefulness of the method, we estimate the determinants of success for each stage at the Rugby World Cup.

**Additional information**

uk15_verardi.pdf

uk15_verardi.pdf

Thomas Grund

Linköping University

Social network analysis is one of the most rapidly growing fields of the social sciences. Social network analysis focuses on the relationships that exist between individuals (or other units of analysis), such as friendship, advice, trust, or trade relationships. Network analysis is concerned with the visualization and analysis of network structures, as well as with the importance of networks for individuals' propensities to adopt different behaviours. Until now, such analyses have been possible to perform using specialized software for network analysis only. This tutorial introduces the **nwcommands**, a software suite with over 80 Stata commands for social network analysis. The software includes commands (and dialog boxes) for importing, exporting, loading, saving, handling, manipulating, replacing, generating, visualizing, and animating networks. It also includes commands for measuring various properties of the networks and the individual nodes, for detecting network patterns and measuring the similarity of different networks, as well as advanced statistical techniques for network analysis including MR-QAP and ERGM.

**Additional information**

uk15_grund.pdf

uk15_grund.pdf

William Gould & colleagues

StataCorp

William Gould, president of StataCorp and
chief developer of Stata, and colleagues will be happy to
receive wishes for developments in Stata and almost as happy to
receive grumbles about the software.

Stephen P. Jenkins, London School of EconomicsRoger B. Newson, Imperial College London

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Middle East, Poland, Portugal, and Spain.