*Last updated: 27 September 2007*

Centre for Econometric Analysis

Cass Business School

106 Bunhill Row

London EC1 8TZ

United Kingdom

Roger Newson

Imperial College

The cendif module is part of the somersd package, and calculates
confidence intervals for the Hodges–Lehmann median difference
between values of a variable in two subpopulations. The traditional
Lehmann formula, unlike the formula used by cendif, assumes that the two
subpopulation distributions are different only in location, and that the
subpopulations are therefore equally variable. The cendif formula
therefore contrasts with the Lehmann formula as the unequal-variance
t-test contrasts with the equal-variance t-test. In a simulation study,
designed to test cendif to destruction, the performance of cendif was
compared to that of the Lehmann formula, using coverage probabilities and
median confidence interval width ratios. The simulations involved sampling
from pairs of Normal or Cauchy distributions, with subsample sizes ranging
from 5 to 40, and between-subpopulation variability scale ratios ranging
from 1 to 4. If the sample numbers were equal, then both methods gave
coverage probabilities close to the advertized confidence level. However,
if the sample numbers were unequal, then the Lehmann coverage
probabilities were over-conservative if the smaller sample was from the
less variable population, and over-liberal if the smaller sample was from
the more variable population. The cendif coverage probability was usually
closer to the advertized level, if the smaller sample was not very small.
However, if the sample sizes were 5 and 40, and the two populations were
equally variable, then the Lehmann coverage probability was close to its
advertised level, while the cendif coverage probability was over-liberal.
The cendif confidence interval, in its present form, is therefore robust
both to non-Normality and to unequal variablity, but may be less robust to
the possibility that the smaller sample size is very small. Possibilities
for improvement are discussed.

**Additional information**

newson-ohp1.pdf (presentation slides)

newson-ohp1.pdf (presentation slides)

Adrian Mander

MRC Human Nutrition Research

In an attempt to learn Mata I have translated the LARS package, written for
R by Trevor Hastie and Brad Efron, into Mata. The LARS package is an
efficient implementation of an entire lasso sequence with the cost of a
single least-squares estimation. Mata and R/S+ are incredibly similar in
terms of syntax and on the whole can be translated by altering the syntax
“wording”, however, there was the occasional need for additional
functions. It is certainly not the best approach to learning a new
language. I shall describe the new Stata command and apply this approach to
model selection to some nutrition data.

Tom Palmer

Department of Health Sciences, Leicester University

WinBUGS is a program for Bayesian model fitting by Gibbs sampling. WinBUGS
has limited facilities for data handling, whereas Stata has excellent data
handling but no routines for Bayesian analysis; therefore, much can be
gained by running Stata and WinBUGS together. This talk explains the use of
the winbugsfromstata package, described in Thompson et al. (2006), a set of
programs that enable data to be processed in Stata and then passed to
WinBUGS for model fitting. Finally, the results can be read back into Stata
for further processing. Examples will be chosen to illustrate the range of
models that can be fitted within WinBUGS and where possible the results will
be compared with frequentist analyses in Stata. Issues to consider when
fitting models under Markov Chain Monte Carlo methods will be discussed
including assessment of convergence, length of burn-in and the form and
impact of prior distributions. J. Thompson, T. Palmer, and S. Moreno, 2006,
Bayesian analysis in Stata with WinBUGS, The Stata Journal, 6(4), p530–549.

**Additional information**

palmer_winbugsfromstata.presentation.pdf (presentation)

palmer_winbugsfromstata.slides.pdf (presentation slides)

palmer_winbugsfromstata.presentation.pdf (presentation)

palmer_winbugsfromstata.slides.pdf (presentation slides)

Neil Shephard

University of Sheffield

An overview of using Stata to perform candidate gene association analysis
will be presented. Areas covered will include data manipulation,
Hardy–Weinberg equilibrium, calculating and plotting linkage
disequilibrium, estimating haplotypes, and interfacing with external
programs.

Maarten Buis

Department of Social Research Methodology,
Vrije Universiteit Amsterdam

Stata has for a long time the capability of imposing the constraint that
parameters are a linear function of one another. It does not have the
capability to impose the constraint that if a set of parameters change (due
to interaction terms) they will maintain the relative differences among
them. Such a proportionality constraint has a nice interpretation: the
constrained variables together measure some latent concept. For instance if
a proportionality constraint is imposed on the variables father’s
education, mother’s education, father’s occupational status, and
mother’s occupational status, than together they might be thought to
measure the latent variable family socioeconomic status. With the
proportionality constraint one can estimate the effect of the latent
variable and how strong each observed variable loads on the latent variable
(i.e. does the mother, the father, or the highest status parent matter
most). Such a model is a special case of a so called MIMIC model. In
principle these models can be estimated using standard
ml algorithms, however as the parameters are rather
strongly correlated ml has a hard time finding the
maximum. An EM algorithm is proposed that will find the maximum. This
maximum is than fed into ml to get the right
standard errors.

**Additional information**

buis_propcnsreg.pdf (presentation slides)

buis_propcnsreg.pdf (presentation slides)

Alfonso Miranda

Department of Economics, Keele University

Three different methods have been suggested in the econometrics literature
to deal with the initial conditions problem in dynamic Probit models for
panel data. Heckman (1981) suggest to approximate the reduced form marginal
probability of the initial state with a Probit model and allow free
correlation between unobserved individual heterogeneity entering the initial
conditions and the main dynamic equations. Alternatively, Wooldridge (2002)
suggest to write a dynamic model conditional on the first observation and to
specify a distribution for the unobserved individual heterogeneity term
conditional on the initial state and any other exogenous explanatory
variables. Finally, Orme (1996) introduces a two-step bias corrected
procedure that is locally valid when the correlation between unobserved
individual heterogeneity determining the initial state and the dynamic
Probit equations approximates to zero. Orme suggest that this two-step
procedure can perform well even when such correlation is strong. I present
some results from a Monte Carlo simulation study comparing the performance
of all these three methods using small and medium sample sizes and low and
high correlation among unobservables.

**Additional information**

miranda_Dprob_pe.pdf (presentation slides)

miranda_Dprob_pe.pdf (presentation slides)

Ian White

MRC Biostatistics Unit, Cambridge

A new command metamiss performs meta-analysis when
some or all studies have missing data. A variety of assumptions are
available, including missing-at-random, missing=failure, worst and best
cases, and incorporating a user-specified prior distribution for the degree
of informative missingness. This is joint work with Julian Higgins.

**Additional information**

Ian_White.ppt (presentation slides)

Ian_White.ppt (presentation slides)

Tommaso Nannicini

Universidad Carlos III de Madrid

This article presents a Stata program (sensatt)
that implements the sensitivity analysis for matching estimators proposed by
Ichino, Mealli and Nannicini (2007). The analysis simulates a potential
confounder in order to assess the robustness of the estimated treatment
effects with respect to deviations from the Conditional Independence
Assumption (CIA). The program makes use of the commands for propensity-score
matching (att*) developed by Becker and Ichino
(2002). An example is given by using the National Supported Work (NSW)
demonstration, widely known in the program evaluation literature.

**Additional information**

pres_stata_2.pdf (presentation slides)

pres_stata_2.pdf (presentation slides)

Shuk-Li Man

Center for Sexual Health and HIV Research,
University College London

Using loops and macros in Stata can hold many advantages, mainly reducing
the length of your do files, allowing errors to be tracked and fixed quickly
and efficiently, faster running do files and providing us with re-usable
programs which can be used in subsequent data analyses with similar
scenarios. In this presentation we shall cover the following areas:
**Additional information**

man_Stata_user_groupoct2007v5.ppt (presentation slides)

- Storing global and local macros within Stata, with applied examples including storing categories of a variable, storing data summaries and names of files within a directory.
- The commands foreach, forval and while, with applied examples.
- Applied examples of how to combine macros with loops and show why this can be useful.

man_Stata_user_groupoct2007v5.ppt (presentation slides)

Carlo Fiorio

Department of Economic Scieces,
Universita degli Studi di Milano

Stephen P. Jenkins

Institute for Social and Economic Research,
University of Essex

This talk discusses ineqrbd, a program for OLS
regression-based decomposition suggested by G.S. Fields (“Accounting
for Income Inequality and Its Change: A New Method, with Application to the
Distribution of Earnings in the United States”, Research in Labor
Economics, 2003). It provides an exact decomposition of the inequality of
total income into inequality contributions from each of the factor
components (or determinants) of total income.

**Additional information**

fiorio_ineqrbd_UKSUG07.pdf (presentation slides)

fiorio_ineqrbd_UKSUG07.pdf (presentation slides)

Ben Jann

ETH Zürich

A new package called adolist is presented.
adolist is a tool to create, install, and uninstall
lists of user ado-packages (“adolists”). For example,
adolist can create a list of all user packages
installed on a system and then install the same packages on another system.
Moreover, ado-list can be used to put together
thematic lists of packages such as, say, a list on income inequality
analysis or time-series add-ons, or the list of “41 user ados
everyone should know”. Such lists can then be shared with others,
who can easily install and uninstall the listed packages using the
adolist command.

**Additional information**

jann_London07_adolist.pdf

jann_London07_adolist.pdf

Bill Rising

StataCorp

One of Stata’s great strengths is its data management abilities. When
either building or sharing datasets, some of the most time-consuming
activities are validating the data and writing documentation for the data.
Much of this futility could be avoided if datasets were self-contained,
i.e., if they could validate themselves. I will show how to achieve this
goal within Stata. I will demonstrate a package of commands for attaching
validation rules to the variables themselves, via characteristics, along
with commands for running error checks and marking suspicious observations
in the dataset. The validation system is flexible enough that simple checks
continue to work even if variable names change or if the data are reshaped,
and it is rich enough that validation may depend on other variables in the
dataset. Since the validation is at the variable level, the self-validation
also works if variables are recombined with data from other datasets. With
these tools, Stata’s datasets can become truly self-contained.

**Additional information**

rising_ckvarTalk.beamer.pdf (presentation slides)

rising_ckvarTalk.beamer.pdf (presentation slides)

Austin Nichols

Urban Institute

A brief survey of clustered errors, focusing on estimating cluster–robust
standard errors: when and why to use the cluster
option (nearly always in panel regressions), and implications. Additional
topics may include using svyset to specify
clustering, multidimensional clustering, clustering in meta-analysis, how
many clusters are required for asymptotic approximations, testing
coefficients when the Var–Cov matrix has less than full rank, and
testing for clustering of errors.

**Additional information**

nichols_crse.pdf (presentation slides)

nichols_crse.pdf (presentation slides)

Nick Cox

Department of Geography, Durham University

Describing batches of data in terms of their order statistics or quantiles
has long roots but remains underrated in graphically based exploration,
data reduction, and data reporting. Hosking in 1990 proposed L-moments based
on quantiles as a unifying framework for summarizing distribution
properties, but despite several advantages they still appear to be very
little known outside their main application areas of hydrology and
climatology. Similarly, the mode can be traced to the prehistory of
statistics, but it is often neglected or disparaged despite its value as a
simple descriptor and even as a robust estimator of location. This paper
reviews and exemplifies these approaches with detailed reference to Stata
implementations. Several graphical displays are discussed, some novel.
Specific attention is given to the use of Mata for programming core
calculations directly and rapidly.

**Additional information**

njctalkNASUG2007.zip (presentation in smcl, plus ado- and do-files and datasets)

njctalkNASUG2007.zip (presentation in smcl, plus ado- and do-files and datasets)

Philippe Van Kerm

CEPS/INSTEAD, G.-D. Luxembourg

Distributive analysis typically consists in estimating summary measures
capturing aspects of the distribution of sample points beyond central
tendency. Stochastic dominance analysis is also central for comparisons of
distributions. Unfortunately, data contamination, and extreme data more
generally, are known to be highly influential in both types of
analyses—much more so, than for central tendency analysis—and
potentially jeopardize the validity of one’s conclusions even with
relatively large sample sizes. This presentation illustrates the problems
raised by extreme data in distributive analysis and describes robust
parametric and semi-parametric approaches for addressing it. The methods are
based on the use of “optimal B-robust” (OBRE) estimators, as an
alternative to maximum likelihood. A prototype of Stata implementation of
these estimators is described and empirical examples in income distribution
analysis show how robust inequality estimates and dominance checks can be
derived from these parametric or semiparametric models.

**Additional information**

vankerm-uksug_slides.pdf (presentation slides)

vankerm-uksug_slides.pdf (presentation slides)

Vince Wiggins

StataCorp

We will take a quick tour of the graph editor, covering the basic concepts:
adding text, lines, and markers; changing the defaults for added objects;
changing properties; working quickly by combining the contextual toolbars
with the more object dialogs; and using the object browser effectively.
Leveraging these concepts, we’ll discuss how and when to use the grid
editor and techniques for combined and by-graphs. Finally,we will look at
some tricks and features that aren’t apparent at first blush.

Kit Baum

Boston College

The talk will present the instrumental variables (IV) regression estimator,
a key tool for the estimation of relationships incorporating
endogeneity/two-way causality or measurement error, focusing on the
Baum/Schaffer/Stillman ivreg2 package and Stata
10’s new ivregress command. The IV or
two-stage least squares estimator is a special case of a Generalized Method
of Moments (GMM) estimator. GMM techniques are appropriate when non-i.i.d.
disturbances are encountered. We will discuss tests of overidentification,
weak instruments, endogeneity/exogeneity and recently developed tools for
testing functional form specification (ivreset) and
autocorrelation in the IV context (ivactest).

**Additional information**

baumUKSUG2007.pdf (presentation slides)

baumUKSUG2007smcltalk.zip (presentation in smcl)

baumUKSUG2007.pdf (presentation slides)

baumUKSUG2007smcltalk.zip (presentation in smcl)

Patrick Royston

MRC Clinical Trials Unit, London

There has been a considerable growth of interest among Stata users and more
widely in the practical use of multiple imputation as a principled route to
the analysis of datasets with missing covariate values. Sophisticated Stata
software (ice) is available for creating multiply
imputed datasets. However, equally sophisticated and flexible tools are
required to carry out the analyses. Carlin et al (2003)’s MI Tools package
and Royston’s micombine command (packaged
with ice) made a start. We present a new set of
tools, called mim, which carries the postimputation
process a step further. mim defines a standardized
architecture for MI datasets and has features for manipulating MI data. More
importantly, it supports a wide range of regression models, including those
for panel and survey data. Limited facilities for postestimation analysis
are provided, and these are expected to be further developed. The package is
in beta-testing form and has been submitted for publication in the
Stata
Journal.

**Additional information**

Royston_SUG_2007.ppt (presentation slides)

Royston_SUG_2007.ppt (presentation slides)

Tim Collier, London School of Hygiene & Tropical MedicineStephen Jenkins, University of Essex

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.