12th UK Stata Users Group meeting: Abstracts
Monday, September 11, 2006
Automating the production of large reports from Stata
Department of Social Medicine, University of Bristol
We have been undertaking a systematic review of the literature on diet and
cancer, which included all study types reporting on any dietary exposure.
The data were presented in a mixture of category, mean difference, and
regression coefficients, which we analyzed in Stata to produce dose–response
estimates and other statistics for all results.
The resulting tables were large (more than 3000 results). To
rapidly produce formatted tables, we wrote the xtable command, which arranges
data for exporting with formatting tags. These tags are then recognized by
an Excel macro, which creates headings, merges across cells, and performs
other formatting actions as required. In this way the data are compact, as
study-level information is merged across cells to reduce duplication, and
neatly organized. The process allows users to arrange the data as they
wish, or the data can be sorted according to other variables within the
command—or a mix of both. The data are exported as text format, there
is one intermediate step as they are imported to Excel, and then it is a
single key press to format the table. In this way complex tables can be
produced with duplicate information merged across cells at more than one
level, and multiple levels of headings can be incorporated. After the initial
specification of the xtable command, it is then simple to rerun the
procedure, which makes updates and modifications to the analysis simple.
After developing these techniques, we wrote a program to form
simple sentences based on our data, e.g.: “The Iowa Women’s
Health study, a prospective cohort, reported an unadjusted OR of 1.09
(950.98, 1.21) per cup per day increase of coffee.” A program was then
created that produced a series of short texts for each exposure in a log
file, consisting of a title, subtitles, a small frequency table, and a
sentence summarizing each result. The log file was then opened in Word and
tags used to format the document as before to create titles and align the
frequency tables. This proved a massive labor-saving device, as much of the
report was rather repetitious, and had the added benefit of
creating a structure for the report and preventing typing errors and
accidental omission of results. The code for this method is too specific to
produce a general command, but the techniques will be discussed.
Automatic generation of documents
Agenzia Regionale di Sanità Toscana
This paper describes a natural interaction between Stata and markup
languages. Stata’s programming and analysis features, together with the
flexibility in output formatting of markup languages, allow generation
and/or update of whole documents (reports, presentations on screen or web,
etc.). Examples are given for both LaTeX and HTML.
Stata’s commands are mainly dedicated to analysis of data on a computer
screen and output of analysis stored in a log file available to researchers
for later reading. However, users may need to produce output in different
formats and to cooperate with professionals who are not familiar with log
files. An elegant solution to this problem is exporting output in the format
of a markup language, such as LaTeX or HTML.
The most common means for presenting the results of one or several analyses
are text on paper, screen presentations, and websites. While it is common to
generate such outputs by visual programs, such as MS Office or OpenOffice,
it is impossible for Stata to produce documents this way, as it lacks eyes
to format a table and hands to hold a mouse to cut and paste
graphs. Nevertheless, each of those presentation formats can also be
obtained with use of a markup language. Wikipedia defines a markup language
as “a kind of text encoding that represents text as well as details about
the structure and appearance of the text”.
To publish on the web, HTML is one of the best and most compatible formats.
On other hand, LaTeX is a complete language for editing and text formatting
on either paper or screen (most commonly via PDF files). Both
languages are easy to learn, free, and well documented.
Now Stata happens to be perfectly capable of writing text, such as the
instructions for a markup language to write a report, a sequence of
slides, or the pages of a website containing tables and graphs.
The problem of formatting the output of a command in LaTeX and/or HTML has
been addressed in various ways by several authors. The most comprehensive
reference to this issue is Newson (2003), who also provides a suite of tools
aimed at printing in markup language the list of a Stata dataset, in such a
way that variable labels, value labels, significant figures, and so forth are
formatted the way one would wish.
More generally, we can exploit Stata’s ability to write text files
to make it produce virtually any piece of markup language code: tables and
graphs, but also other kind of objects, like lists, trees, etc.
Finally, by further printing some code putting together all of the
ingredients, we make Stata produce a whole document, which is then
browsable, printable, or showable on a screen, according to the kind of
The key feature of this method is that the document automatically
produced can be completely updated as soon as the figures in the data
change. This is particularly suitable when the user needs to produce a large
amount of output or routinely performs analyses on the same dataset
structure, such as administrative data bases or collection of data from a
For an example of those facilities, we describe a do-file automatically
constructing a website for the Regional Agency for Public Health of
Tuscany. Finally, we remark that to apply this method, Stata
commands must store in memory their results—at least as many as
necessary to reproduce the screen output. This is generally the case, with
some notable counterexamples (dstdize, svyprop,...).
- Newson, R. 2003.
- Confidence intervals and p-values for delivery to the end
user. Stata Journal 3: 245–269.
Marginal effects and extending the Blinder–Oaxaca decomposition
for nonlinear models
Institute of Sociology and Social Policy, Corvinus University, Budapest
Students of racial and gender inequalities are often interested in knowing
to what extent an observed group difference can be attributed to differences
in returns to productive abilities (discrimination effect) or to
differences in the average of productive abilities (endowment effect). The
standard Blinder–Oaxaca decomposition technique, which applies to continuous
outcomes, measures the discrimination (endowment) effect in terms of
differences in group-specific regression parameters (means), weighted by
group-specific means (regression parameters). This article shows that the
standard decomposition technique can be meaningfully extended to categorical
outcomes if the regression coefficients are substituted with marginal
effects. A user-written program, gdecomp (working title), is also presented,
which basically processes marginal effects obtained from another
user-written program, margeff.
A comparison analysis of dynamic panel-data estimators in the
presence of endogenous regressors
Giovanni S. F. Bruno
Istituto di Economia Politica, Università Bocconi, Milano
Data used in applied econometrics are typically nonexperimental in nature.
This makes the assumption of exogeneity of regressors untenable and poses a
serious identification issue in the estimation of economic structural
As far as the source of endogeneity is confined to unobserved heterogenity
between groups (for example, time-invariant managerial ability in firm-level
labor demand equations), the availability of panel data can identify the
parameters of interest. If endogeneity, instead, is more pervasive, stemming
also from unobserved within-group variation (for example, a transitory
technology shock hitting at the same time both the labor demand of the firm
and the wage paid), then standard panel data estimators are biased and
instrumental variable or generalized method of moments estimators provide
valid alternative techniques.
This paper extends the analysis in Bruno (2005) focusing on dynamic
panel-data (DPD) models with endogenous regressors.
Various Monte Carlo experiments are carried out through my Stata code
xtarsim to assess the relative finite-sample performances of
some popular DPD estimators, such as Arellano and Bond (xtabond,
xtabond2), Blundell and Bond (xtabond2), Anderson and Hsiao
(ivreg, ivreg2, xtivreg, xtivreg2), and LSDVC
New versions of the commands xtarsim and xtlsdvc are also presented.
- Bruno, G. S. F. 2005.
- Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals.
Stata Journal 5: 473–500.
On the central role of Somers’ D
National Heart and Lung Institute, Imperial College London
Somers’ D and Kendall’s tau-a are parameters behind rank or
nonparametric statistics, interpreted as differences between proportions.
Given two bivariate data pairs (X1, Y1) and (X2, Y2), Kendall’s tau-a
parameter τXY is the difference between
the probability that the two X–Y pairs are concordant and the
probability that the two X–Y pairs are discordant, and Somers’ D
parameter DYX is the difference between the corresponding conditional
probabilities, given that the X-values are ordered. The somersd package
computes confidence intervals for both parameters. The Stata 9 version of
somersd uses Mata to increase computing speed and greatly extends the
definition of Somers’ D, allowing the X and/or Y variables to be
left- or right-censored and allowing multiple versions of Somers’ D
for multiple sampling schemes for the X–Y pairs. In particular, we may
define stratified versions of Somers’ D, in which we compare only
X–Y pairs from the same stratum. The strata may be defined by grouping
a Rubin–Rosenbaum propensity score, based on the values of multiple
confounders for an association between exposure variable X and an outcome
variable Y . Therefore, rank statistics can have not only confidence
intervals but also confounder-adjusted confidence intervals. Usually, we either
estimate DYX as a measure of the effect of X on Y , or we estimate DXY as a
measure of the performance of X as a predictor of Y, compared with other
predictors. Alternative rank-based measures of the effect of X on Y include
the Hodges–Lehmann median difference and the Theil–Sen median
slope, both of which are defined in terms of Somers’ D.
Consistency checking with assertk
MRC Clinical Trials Unit, London
We introduce the assertk command, beginning with a motivation and a
comparison with the built-in assert command. We will then show some examples
demonstrating the various options that can be used to produce customized
output and to perform more complex checks.
assertk is a simple utility that makes data consistency checking and
reporting on data quality easy.
The built-in Stata command assert checks each observation for a specified
condition and halts do-files and ado-files when the specified condition is
not satisfied. For example:
. assert age entry < .
2 contradictions in 149 observations
assertion is false;
end of do-file
Thus assert is a useful tool for checking important assumptions about the
data you are about to process; your do-file will simply not continue if
these assumptions do not pass the checks. The principle of the assert
command also lends itself to consistency checking, i.e., performing a suite
of checks on a dataset to identify potential errors. This is an important
part of the process of data cleaning. However, in this application, the
halting of do files is a hindrance, and there is a lack of detailed output
showing which observations failed the check.
In assertk, a condition is specified, and each observation is checked
against this condition. If any data do not pass the check, the
irregularities are output (with the output customizable by various options)
and the do-file continues. For example:
. assertk age ent < ., mess(Age at entry is missing) vars(id age ent)
Age at entry is missing (1 obs)
id age ent
Thus a suite of checks can be programmed easily, with one line per check,
and a meaningful log of data errors can be produced for use by data managers
Variance estimation for quantile group shares, cumulative shares,
and Gini coefficient
Stephen P. Jenkins
This short talk introduces and illustrates svylorenz, a Stata 9
program for computing variance estimates for quantile group shares of total
varname, cumulative quantile group shares (i.e., Lorenz curve ordinates), and
the Gini coefficient. The program implements the linearization methods
proposed by Kovačević and Binder (Journal of Official Statistics,
Econometric analysis of panel data using Stata
David M. Drukker
StataCorp, College Station, TX
This talk discusses estimation, inference, and interpretation of panel-data
models using Stata. The talk usually covers the linear RE and FE models,
linear RE and FE models with AR(1) errors, linear RE and FE models with
general within-panel correlation structures, Hausman–Taylor estimation,
linear RE and FE with endogenous variables, linear FE dynamic models, linear
mixed models, FE and RE nonlinear models, FE and RE logit models, FE and RE
Poisson models, and stochastic frontier models for panel data. The talk
briefly introduces each model discussed.
Graphs for all seasons
Nicholas J. Cox
Seasonal effects are dominant in many environmental time series, and are
important or notable in many economic and biomedical time series. In several
fields, using anything other than basic line graphs of responses versus time
to display series showing seasonality is rare. This presentation will focus
on a variety of tricks for graphically examining seasonality. Some of these
tricks have long histories in climatology and related sciences, but are
little known outside. I will discuss some original code, but the greater
emphasis will be on users needing to know Stata functions and commands well
to exploit the full potential of its graphics.
Scheming your way to consistent graphs
Vincent L. Wiggins
StataCorp, College Station, TX
If you find yourself repeatedly specifying the same options on graph
commands, you should write a graphics scheme. A scheme is nothing more than
a file containing a set of rules specifying how you want your graphs to
look. From the size of fonts used in titles and the color of lines and
markers in plots to the placement of legends and the number of default ticks
on axes, almost everything about a graph is controlled by the rules in a
graphics scheme. We will look at how to create your own graphics schemes and
where to find out more about all the rules available in schemes. The first
scheme we create will be only a few lines long, yet will produce graphs
distinctly different from any existing scheme.
Tuesday, September 12, 2006
Estimating and modeling the proportion cured of disease in
population-based cancer studies
Paul C. Lambert
Centre for Biostatistics & Genetic Epidemiology, University of Leicester
In population-based cancer studies, cure is said to occur when the mortality
(hazard) rate in the diseased group of individuals returns to the same level
as that expected in the general population. The cure fraction (the
proportion of patients cured of disease) is of interest to patients and a
useful measure to monitor trends in survival of curable disease. I will
describe two types of cure model, namely, the mixture and nonmixture cure
model (Sposto 2002); explain how they can be extended to incorporate the
expected mortality rate (obtained from routine data sources); and discuss
their implementation in Stata using the strsmix and strsnmix
commands. In both commands there is the choice of parametric distribution
(Weibull, generalized gamma, and log–logistic) and link function for
the cure fraction (identity, logit, and log(–log)). As well as modeling
the cure fraction it is possible to include covariates for the ancillary
parameters for the parametric distributions. This ability is important, as it allows
for departures from proportional excess hazards (typical in many
population-based cancer studies). Both commands incorporate delayed entry
and can therefore be used to obtain up-to-date estimates of the cure
fraction by using period analysis (Smith et al. 2004). There is also an
associated predict command that allows prediction of the cure fraction,
relative survival, and the excess mortality rate with associated confidence
intervals. For some cancers the parametric distributions listed above do not
fit the data well, and I will describe how finite mixture distributions can
be used to overcome this limitation. I will use examples from international cancer
registries to illustrate the approach.
- Smith, L. K., P. C. Lambert, J. L. Botha, and D. R. Jones. 2004.
- Providing more up-to-date estimates of patient survival: A comparison of
standard survival analysis with period analysis using life-table methods and
proportional hazards models. Journal of Clinical Epidemiology 57:
- Sposto, R. 2002.
- Cure model analysis in cancer: An application to data from
the Children’s Cancer Group. Statistics in Medicine 21: 293–312.
Two postestimation commands for assessing confounding effects
in medical and epidemiological studies
Centre for Chronic Disease, School of Medicine, University of Queensland
Controversy exists regarding proper methods for the selection of variables
in confounder control in epidemiological studies. Various approaches have
been proposed for selecting a subset of confounders among many
possible subsets. This paper describes the use of two practical tools, Stata
postestimation commands written by the author, to identify the presence and
direction of confounding.
One command, confall, plots all possible effect estimates against
a statistical value such as the p-value or Akaike information
criterion. This computing-intensive procedure allows researchers to inspect
the variability of effect estimates from different possible models. Another
command, confnd, uses a stepwise approach to identify confounders
that have caused substantial changes in the effect measurement.
Using three examples, the author illustrates the use of those programs in
different situations. When all possible effect estimates are similar,
indicating little confounding, the investigator can confidently report the
presence and direction of the association between exposure and disease
regardless of which variable selection method is used. On the other hand,
when all possible effect estimates vary substantially, indicating the
presence of confounding, a change-in-estimate plot and its corresponding
table are helpful for identifying important confounders. Both commands can
be used after most commonly used estimation commands for epidemiological
Problems with infinite solutions in logistic regression
MRC Biostatistics Unit, Cambridge
In teaching logistic regression for case–control studies, I ask
master’s students in epidemiology to assess an interaction between a 2-level
exposure and a 4-level exposure using a likelihood-ratio test. Theory
suggests that the test statistic has 3 degrees of freedom, but Stata uses 2
degrees of freedom. The explanation turns out to be that one exposure
combination contains controls but no cases, so that one parameter goes to
infinity. It is hard to convince the students (and myself) that this
combination contributes no degrees of freedom.
I will review how Stata handles situations in which parameters go to infinity.
Although asymptotics for likelihood-ratio tests may not work well in this
situation, I will argue that lrtest should be modified to reflect the
true number of degrees of freedom.
Visualizing and analyzing time to event data: Lifting the veil of
MRC Clinical Trials Unit, London
Most survival data are analyzed by using the Cox proportional hazards
model (in Stata: the stcox command). Almost by definition, a proportion of
the observations will be right-censored. Analysis of covariate effects in
the Cox model is couched in terms of (log) hazard ratios, and the
distribution of time itself is essentially ignored. This practice is totally
different from standard analysis of a continuous outcome variable, where
multiple (linear) regression is the technique most often used. Hazard ratios
are difficult to interpret and give little insight into how a
covariate affects the time to an event. Furthermore, the assumption of
proportional hazards is strong, and when there is long-term follow-up,
is often breached. I will illustrate how the censored lognormal model can be
used to good effect to remedy some of these deficiencies and give better
insight into the data. Multiple imputation of the censored observations may
be followed by use of familiar exploratory graphical tools, such as
dotplots, scatterplots, and scatterplot smoothers. Analyses using standard
linear regression methods may be done on the log time scale, leading to
simple interpretations and informative graphs of effect size. I will explore
these ideas in the context of a familiar breast cancer dataset and will
show how a treatment/covariate interaction is easily conveyed graphically.
Modeling for response variables that are proportions
Maarten L. Buis
Department of Social Research Methodology, Vrije Universiteit Amsterdam
When dealing with response variables that are proportions, people often use
regress. This approach can be problematic since the model can lead to
predicted proportions less than zero or more than one and errors that are
likely to be heteroskedastic and nonnormally distributed. This talk will
discuss three more appropriate methods for proportions as response
variables: betafit, dirifit, and glm.
betafit is a maximum likelihood estimator using a beta likelihood,
dirifit is a maximum likelihood estimator using a Dirichlet
likelihood, and glm can be used to create a quasi–maximum likelihood
estimator using a binomial likelihood. On an applied level, a difference
between dirifit and the others is that the others can handle only one
response variable, whereas dirifit can handle multiple response
variables. For instance, betafit and glm can model the
proportion of city budget spent on the category security (police and fire
department), whereas dirifit can simultaneously model the proportions
spent on categories security, social policy, infrastructure, and other.
Another difference between betafit and glm is that glm
can handle a proportion of exactly zero and one, whereas
betafit can handle only proportions between zero and one.
Special attention will be given on how to fit these models in Stata
and on how to interpret the results. This presentation will end with a
warning not to use any of these techniques for ecological inference, i.e.,
using aggregated data to infer about individual units. To use a classic
example: In the United States in the 1930s, states with a high proportion of immigrants
also had a high literacy rate (in the English language), whereas immigrants
were on average less literate than nonimmigrants. Regressing state level
literacy rate on state level proportion of immigrants would thus give a
completely wrong picture about the relationship between individual immigrant
status and literacy.
A brief introduction to Mata
David M. Drukker
StataCorp, College Station, TX
After presenting a general introduction to the Mata matrix programming
language, this talk discusses Mata’s many simple links to the Stata
dataset and other important objects in Stata’s memory. An application
to maximum simulated likelihood illustrates the programming techniques.
Time-series filtering techniques in Stata
Department of Economics, Boston College
I will describe several time-series filtering techniques, including the
Hodrick–Prescott, Baxter–King, and bandpass filters and variants,
and present new Mata-coded versions of these routines, which are considerably
more efficient than previous ado-code routines. Applications to several
economic and financial time series will be discussed.
User Group meetings