*Last updated: 9 October 2009*

Centre for Econometric Analysis

Cass Business School

106 Bunhill Row

London EC1 8TZ

United Kingdom

Massimiliano Bratti

University of Milan

Alfonso Miranda

Institute of Education, University of London

In this presentation we define two qualitative response models: 1) Selection
Endogenous Dummy Ordered Probit model (SED-OP); 2) a Selection Endogenous
Dummy Dynamic Selection Ordered Probit model (SED-DOP). The SED-OP model is
a three-equation model constituted of an endogenous dummy equation, a
selection equation, and a main equation which has an ordinal response form.
The main feature of the model is that the endogenous dummy enters both the
selection equation and the main equation. The dynamic SED-DOP model allows
both the selection equation and the ordered equation to be dynamic by
including lagged individual behaviour. Initial conditions are properly
accounted for and free correlation among unobservables entering each of the
three equations is allowed. We show how these models can be estimated in
Stata using Maximum Simulated Likelihood.

**Additional information**

uk09_bratti_miranda.pdf

uk09_bratti_miranda.pdf

Vincenzo Verardi

University of Brussels and University of Namur

In data analysis, when some observations are outlying in one or several
dimensions, principal component analysis (PCA) is distorted and may lead to
questionable results. I therefore propose a simple solution to tackle this
problem by providing a short ado-file that is based on a robust estimation
of the covariance matrix. To illustrate the importance of this type of
approach, I present a PCA analysis based on the variables used to rank
universities according to academic excellence (as measured by the scores in
Shangai ARWU Ranking).

**Additional information**

uk09_verardi.ppt

uk09_verardi.ppt

Maarten Buis

University of Tuebingen

Sometimes we have multiple measures of the same concept. Combining the
information of these multiple measures would allow us to improve the
measurement. When combining the information from different indicators, one
needs to distinguish between two types of relationships between the observed
indicators and the underlying latent variable: either the latent variable
influences the indicators or the indicators influence the latent variable.
To distinguish between these two situations, some authors, following Bollen
(*Quality and Quantity*, 1984) and Bollen and Lennox (*Psychological Bulletin*,
1991), call the observed variables “effect indicators” when they are
influenced by the latent variable, while they call the observed variables
“causal indicators” when they influence the latent variable.
Distinguishing between these two is important as they require very different
strategies for recovering the latent variable. In a basic (exploratory)
factor analysis, which is a model for effect indicators, one assumes that
the only thing that the observed variables have in common is the latent
variable, so any correlation between the observed variables must be due to
the latent variable, and it is this correlation that is used to recover the
latent variable. In the models for causal indicators that I will discuss in
this talk, I assume that the latent variable is a weighted sum of the
observed variables (and optionally an error term), and the weights are
estimated such that they are optimal for predicting the dependent variable.
The three models for dealing with causal indicators that will be discussed
are a model with “sheaf coefficients” (Heise, *Sociological Methods &
Research*, 1972), a model with “parametrically weighted covariates”
(Yamaguchi, *Sociological Methodology*, 2002), and a multiple indicators and
multiple causes (MIMIC) model (Hauser and Goldberger, *Sociological Methodoloy*,
1971). The latter two can be estimated using **propcnsreg**, while the former
can be estimated using **sheafcoef**. Both are available from the SSC archive.

**Additional information**

uk09_buis.pdf

uk09_buis.pdf

Nicholas J. Cox

Durham University

Circular statistics are needed when one or more variables have outcome space
in the circle, which is for example true for data measured with reference to
compass, clock, or calendar. Applications abound in the earth and
environmental sciences, not to mention economic and medical fields well
represented among Stata users and other disciplines such as music. Previous
talks on circular statistics were given to the UK Stata Users Group meetings in 1997
and 2004. This update will survey the field with special reference to
recently revised or newly written programs for graphics, summary, testing,
and modeling.

Chuck Huber

Texas A&M University

Genetic association studies often explore the relationship between diseases
and collections of contiguous genetic markers located on the same chromosome
known as haplotypes. Haplotypes are usually not observed directly but are
inferred statistically using a variety of algorithms. One of the most
popular haplotype-inference programs is PHASE, and one of the most popular
programs for examining characteristics of the resulting haplotypes is
HaploView. I have developed a set of Stata commands for
exporting genotype data from Stata into PHASE, importing the resulting
haplotypes back into Stata for association analysis, and exporting the
haplotype data from Stata into HaploView.

**Additional information**

uk09_huber.ppt

uk09_huber.ppt

Adam Jacobs

Dianthus Medical Limited, London

Stata’s capabilities for statistical analysis, graphics, and data management
are world class, but its ability to produce well-presented textual output is
considerably more limited. Some problems that are particularly annoying are
a lack of appropriate page breaks or repetition of column headers in large
tables, Unicode support, and many of the other features taken for granted in
word processors, such as automatically generated tables of contents. But all
is not lost. Open Document Format (ODF) is an open ISO standard for
office-type documents, including word processing documents, and is the
default file format of the popular open source office software suite
OpenOffice.org. It is an xml-based format, which means that ODF files can be
written in a text editor, or with software that can produce output in
plain-text format. Happily, Stata is more than equal to the task of
producing plain-text output. In this talk, I shall explain how I have used
Stata to produce output in ODF xml files, thus making the appearance of
output considerably more user-friendly than native Stata output.

**Additional information**

uk09_jacobs.ppt

uk09_jacobs.ppt

Martin Weiss

University of Tuebingen

I have researched the economics of interactions on Statalist, based on the
full population of exchanges from January 1 to June 30, 2009. Both
the “demand side”—the questions asked on the
list—and the “supply side”—the answers
provided—are examined. Along the way, I have paid particular attention
to the role of unsatisfied demand (“orphans”), i.e. questions
that never attract a reply.

**Additional information**

uk09_weiss.pdf

uk09_weiss.pdf

Ian White

MRC Biostatistics Unit, Cambridge University

Simulation studies are a powerful tool, but their analyses are not always
done well; in particular, Monte Carlo standard errors are often not
reported. I present a Stata program, **simsum**, which can output a range
of summaries, including bias, precision of one method relative to another,
percentage difference between model-based and empirical standard error,
power, and coverage. Monte Carlo standard errors are computed for all these
quantities, using exact or approximate formulae.

**Additional information**

uk09_white.pdf

uk09_white.pdf

Michael Glencross

Community Agency for Social Enquiry, Johannesburg

In many research studies, respondents’ beliefs and opinions about
various concepts are often measured by means of five-, six-, and seven-point
scales. The widely used five point scale is commonly known as a Likert scale
(Likert, (1932) “A technique for the measurement of attitudes”,
*Archives of Psychology*, 22, 1–55). In such situations, it is desirable to
have a test statistic that provides a measure of the amount of agreement or
disagreement in the sample, that is, whether a particular item ‘pole’
is characteristic of the respondents. This is preferable to making arbitrary
decisions about the extremeness or otherwise of the sample responses. A
suitable test for this purpose was designed by Cooper (1976, “An exact
probability test for use with Likert-type scales”, *Educational and
Psychological Measurement*, 36, 647-655.) Cooper *z*, with
modifications suggested by Whitney (1978, “An alternative test for use with
Likert-type scales”, *Educational and Psychological Measurement*, 38, pp.
15–19) (Whitney *t*). Cooper showed that for large samples, the Cooper *z*
statistic has a sample distribution that is approximately normal. The
alternative Whitney *t* statistic has a sample distribution that is
approximately *t* with (n−1) degrees of freedom and is suitable for small
samples. Between them, these two statistics, although rarely used, provide a
quick and straightforward way of analyzing rating scales in an objective
way. In this presentation, I will describe the Stata syntax used to calculate the
Cooper *z* and Whitney *t* statistics and create the related bar
graphs. An illustrative example will be used to demonstrate their use in a
survey.

**Additional information**

uk09_glencross.ppt

uk09_glencross.ppt

Rosa Gini

Regional Agency for Public Health of Tuscany

Sylvia Forni

Regional Agency for Public Health of Tuscany

We introduce **funnelcompar**, a Stata routine that performs the analysis
suggested by David J. Spiegelhalter (“Funnel plots for comparing
institutional performance”, *Statistics in Medicine*, 24,
1185–1202). The basic idea in funnel plots is to plot performance
indicators against a measure of their precision in order to detect outliers.
A scatter plot of an indicator level is plotted together with a baseline and
control limits, which shrink as the sample size gets bigger. Our command
performs funnel plots for binomial (proportion), Poisson (crude and
standardized rates), and normal (means) distributed variables. The baseline
(and standard errors in case of normal variables) can either be specified by
the user (for instance, as literature reference) or be estimated from the
data as a weighted or nonweighted mean of the data. By default, confidence
limits are plotted at two and three standard errors, to detect alarm and
alert signals, as recommended by statistical process control theory. Options
have been implemented to mark single institutions, groups of institutions or
those institutions lying outside control limits. These plots are
increasingly used to report performance indicators at the institutional level.
Classical league tables imply the existence of ranking between institutions
and implicitly support the idea that some of them are worse/better than
others. A different approach is possible using statistical process control
theory: all institutions are part of a single system and perform at the same
level. Observed differences can never be completely eliminated and are
explained by chance (common cause variation). If observed variations exceed
that expected, special-cause variation exists and requires further
explanation to identify its cause.

**Additional information**

uk09_gini_forni.pdf

uk09_gini_forni.pdf

Stephen P. Jenkins

University of Essex

Philippe Van Kerm

CEPS/INSTEAD, Luxembourg

In this short talk, we describe the module **dsginideco**, which decomposes the
change in income inequality between two time periods into two components:
one representing the progressivity (pro-poorness) of income growth, and the
other representing reranking. Inequality is measured using the generalized
Gini coefficient, also known as the S-Gini, G(v). This is a
distributionally-sensitive inequality index, with larger values of v placing
greater weight on inequality differences among poorer (lower ranked)
observations. The conventional Gini coefficient corresponds to the case v =
2. The decomposition is of the form: final-period inequality −
initial-period inequality = R − P, where R is a measure of reranking,
and P is a measure of the progressivity of income growth. For full details
of the decomposition and an application, see S.P. Jenkins and P. Van Kerm
(2006), “Trends in income inequality, pro-poor income growth and income
mobility”, *Oxford Economic Papers*, 58: 531–548.

**Additional information**

uk09_jenkins_vankerm.pdf

uk09_jenkins_vankerm.pdf

Roy Costilla

LLECE/UNESCO, Santiago

A socioeconomic gradient describes the relationship between a social outcome
and socioeconomic status for individuals in a specific jurisdiction, such as
a school, a province or state, or a country (Willms [2003]). Ten hypotheses
about socioeconomic gradients and community differences in children’s
developmental outcomes. Within this framework, I will
analyze the relationship between students’ achievement in mathematics and
reading and their socioeconomic and cultural status in the case of Latin
American and Caribbean primary school students that were assessed by the
SERCE study (OREALC/UNESCO) (Santiago [2008]). . It is shown that there is a considerable variation of
the strength of this relationship among countries, suggesting different
degrees of success in reducing the disparities associated with socioeconomic
and cultural status.

**Additional information**

uk09_costilla.pdf

uk09_costilla.pdf

Yulia Marchenko

StataCorp

Stata 11’s **mi** command can be used to perform
multiple-imputation analysis, including imputation, data management, and
estimation. **mi impute** provides 5 univariate and 2 multivariate imputation
methods. **mi estimate** combines the estimation and pooling steps of the
multiple-imputation procedure into one easy step. **mi** also provides an
extensive ability to manage multiply-imputed data. I will give a brief
overview of all of **mi**’s capabilities with emphasis on **mi
impute** and **mi estimate**, and will also demonstrate examples of some of
**mi**’s unique data management features.

**Additional information**

uk09_marchenko.pdf

uk09_marchenko.pdf

Tom Palmer

University of Bristol

Funnel plots are commonly used to investigate publication and related biases
in meta-analysis. Although asymmetry in the appearance of a funnel plot is
often interpreted as being caused by publication bias, in reality the
asymmetry could be due to other factors that cause systematic differences in
the results of large and small studies, for example, confounding factors
such as differential study quality. Funnel plots can be enhanced by adding
contours of statistical significance to aid in interpreting the funnel plot.
If studies appear to be missing in areas of low statistical significance,
then it is possible that the asymmetry is due to publication bias. If
studies appear to be missing in areas of high statistical significance, then
publication bias is a less likely cause of the funnel asymmetry. Examples
will be given using the user-written **confunnel** command in conjunction with
some of the other user written commands for meta-analysis.

**Additional information**

uk09_palmer_presentation.pdf

uk09_palmer_handouts.pdf

uk09_palmer_presentation.pdf

uk09_palmer_handouts.pdf

Roger B. Newson

Imperial College, London

Insufficient confounder adjustment is viewed as a common source of “false
discoveries”, especially in the epidemiology sector. However,
adjustment for “confounders” that are correlated with the exposure, but
which do not independently predict the outcome, may cause loss of power to
detect the exposure effect. On the other hand, choosing confounders based on
"stepwise" methods is subject to many hazards, which imply that the
confidence interval eventually published is likely not to have the
advertised coverage probability for the effect that we wanted to know. We
would like to be able to find a model in the data on exposures and
confounders, and then to estimate the parameters of that model from the
conditional distribution of the outcome, given the exposures and
confounders. The **haif** package, downloadable from the SSC archive, calculates the
homoskedastic adjustment inflation factors (HAIFs), by which the variances
and standard errors of coefficients for a matrix of X-variables are scaled
(or inflated), if a matrix of unnecessary confounders A is also included in
a regression model, assuming equal variances (homoskedasticity). These can
be calculated from the A- and X-variables alone, and can be used to inform
the choice of a set of models eventually fitted to the outcome data,
together with the usual criteria involving causality and prior opinion.
Examples are given of the use of HAIFs and their ratios.

**Additional information**

uk09_newson.pdf

uk09_newson.pdf

Christopher F. Baum

Boston College

Mark E. Schaffer

Heriot-Watt University

We discuss how econometric estimators may be efficiently programmed in Mata.
The prevalence of matrix-based analytical derivations of estimation
techniques and the computational improvements available from just-in-time
compilation combine to make Mata the tool of choice for econometric
implementation. Two examples are given: computing the seemingly unrelated
regression (SUR) estimator for an unbalanced panel, a multivariate linear
approach, and computing the continuously updated GMM estimator (GMM-CUE) for
a linear instrumental variables model. The GMM–CUE estimator makes use
of Mata’s **optimize** suite of functions. Both illustrate the
power and effectiveness of a Mata-based approach.

**Additional information**

uk09_baum.pdf

uk09_baum.pdf

Paul Lambert

University of Leicester

Patrick Royston

MRC Clinical Trials Unit, London

The Cox model is the most popular method for the modeling of time-to-event
data. The fact that it does not directly estimate the baseline hazard
function is both an advantage and a disadvantage. This tutorial will
describe various aspects of flexible parametric alternatives to the Cox
model by describing a new command, **stpm2**. We will cover the following
areas:

**Additional information**

uk09_lambert_royston.pdf

- the general idea of the flexible parametric approach
- proportional hazards and proportional odds models
- model selection for the baseline hazard
- modeling time-dependent effects
- using age as the time-scale
- modeling with multiple time-scales
- using absolute or relative differences (hazard ratios or differences in hazard rates)
- multiple events
- time-varying covariates
- adjusted survival curves
- relative survival (incorporating expected mortality)
- estimating crude and net mortality (based on competing risks)

uk09_lambert_royston.pdf

Ben Jann

ETH, Zurich

This tutorial will show how results from various Stata commands can be
processed efficiently for inclusion in customized reports. A two-step
procedure is proposed in which results are gathered and archived in the
first step and then tabulated in the second step. Such an approach
disentangles the tasks of computing results (which may take long) and
preparing results for inclusion in presentations, papers, and reports (which
you may have to do over and over). Examples using results from model
estimation commands and also various other Stata commands such as
**tabulate**, **summarize**, or **correlate** are presented.
Furthermore, this tutorial shows how to dynamically link results into word
processors or into LaTeX documents.

**Additional information**

uk09_jann.pdf

uk09_jann.pdf

Roger Newson, Imperial College LondonStephen Jenkins, University of Essex

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.