Ross Harris

*Department of Social Medicine, University of Bristol*

**Abstract**

We have been undertaking a systematic review of the literature on diet and cancer, which included all study types reporting on any dietary exposure. The data were presented in a mixture of category, mean difference, and regression coefficients, which we analyzed in Stata to produce dose–response estimates and other statistics for all results.

The resulting tables were large (more than 3000 results). To
rapidly produce formatted tables, we wrote the **xtable** command, which arranges
data for exporting with formatting tags. These tags are then recognized by
an Excel macro, which creates headings, merges across cells, and performs
other formatting actions as required. In this way the data are compact, as
study-level information is merged across cells to reduce duplication, and
neatly organized. The process allows users to arrange the data as they
wish, or the data can be sorted according to other variables within the
command—or a mix of both. The data are exported as text format, there
is one intermediate step as they are imported to Excel, and then it is a
single key press to format the table. In this way complex tables can be
produced with duplicate information merged across cells at more than one
level, and multiple levels of headings can be incorporated. After the initial
specification of the **xtable** command, it is then simple to rerun the
procedure, which makes updates and modifications to the analysis simple.

After developing these techniques, we wrote a program to form simple sentences based on our data, e.g.: “The Iowa Women’s Health study, a prospective cohort, reported an unadjusted OR of 1.09 (950.98, 1.21) per cup per day increase of coffee.” A program was then created that produced a series of short texts for each exposure in a log file, consisting of a title, subtitles, a small frequency table, and a sentence summarizing each result. The log file was then opened in Word and tags used to format the document as before to create titles and align the frequency tables. This proved a massive labor-saving device, as much of the report was rather repetitious, and had the added benefit of creating a structure for the report and preventing typing errors and accidental omission of results. The code for this method is too specific to produce a general command, but the techniques will be discussed.

**Additional information**

Rosa Gini

Jacopo Pasquini

*Agenzia Regionale di Sanità Toscana*

**Abstract**

This paper describes a natural interaction between Stata and markup languages. Stata’s programming and analysis features, together with the flexibility in output formatting of markup languages, allow generation and/or update of whole documents (reports, presentations on screen or web, etc.). Examples are given for both LaTeX and HTML.

Stata’s commands are mainly dedicated to analysis of data on a computer screen and output of analysis stored in a log file available to researchers for later reading. However, users may need to produce output in different formats and to cooperate with professionals who are not familiar with log files. An elegant solution to this problem is exporting output in the format of a markup language, such as LaTeX or HTML.

The most common means for presenting the results of one or several analyses are text on paper, screen presentations, and websites. While it is common to generate such outputs by visual programs, such as MS Office or OpenOffice, it is impossible for Stata to produce documents this way, as it lacks eyes to format a table and hands to hold a mouse to cut and paste graphs. Nevertheless, each of those presentation formats can also be obtained with use of a markup language. Wikipedia defines a markup language as “a kind of text encoding that represents text as well as details about the structure and appearance of the text”.

To publish on the web, HTML is one of the best and most compatible formats. On other hand, LaTeX is a complete language for editing and text formatting on either paper or screen (most commonly via PDF files). Both languages are easy to learn, free, and well documented.

Now Stata happens to be perfectly capable of writing text, such as the instructions for a markup language to write a report, a sequence of slides, or the pages of a website containing tables and graphs.

The problem of formatting the output of a command in LaTeX and/or HTML has been addressed in various ways by several authors. The most comprehensive reference to this issue is Newson (2003), who also provides a suite of tools aimed at printing in markup language the list of a Stata dataset, in such a way that variable labels, value labels, significant figures, and so forth are formatted the way one would wish.

More generally, we can exploit Stata’s ability to write text files to make it produce virtually any piece of markup language code: tables and graphs, but also other kind of objects, like lists, trees, etc.

Finally, by further printing some code putting together all of the ingredients, we make Stata produce a whole document, which is then browsable, printable, or showable on a screen, according to the kind of document.

The key feature of this method is that the document automatically produced can be completely updated as soon as the figures in the data change. This is particularly suitable when the user needs to produce a large amount of output or routinely performs analyses on the same dataset structure, such as administrative data bases or collection of data from a long-lasting study.

For an example of those facilities, we describe a do-file automatically
constructing a website for the Regional Agency for Public Health of
Tuscany. Finally, we remark that to apply this method, Stata
commands must store in memory their results—at least as many as
necessary to reproduce the screen output. This is generally the case, with
some notable counterexamples (**dstdize**, **svyprop**,...).

- Newson, R. 2003.
- Confidence intervals and p-values for delivery to the end
user.
*Stata Journal*3: 245–269.

**Additional information**

gini_pasquini.pdf

gini_pasquini_handout.pdf

gini_pasquini.zip

Tamás Bartus

*Institute of Sociology and Social Policy, Corvinus University, Budapest*

**Abstract**

Students of racial and gender inequalities are often interested in knowing
to what extent an observed group difference can be attributed to differences
in returns to productive abilities (discrimination effect) or to
differences in the average of productive abilities (endowment effect). The
standard Blinder–Oaxaca decomposition technique, which applies to continuous
outcomes, measures the discrimination (endowment) effect in terms of
differences in group-specific regression parameters (means), weighted by
group-specific means (regression parameters). This article shows that the
standard decomposition technique can be meaningfully extended to categorical
outcomes if the regression coefficients are substituted with marginal
effects. A user-written program, **gdecomp** (working title), is also presented,
which basically processes marginal effects obtained from another
user-written program, **margeff**.

Giovanni S. F. Bruno

*Istituto di Economia Politica, Università Bocconi, Milano*

**Abstract**

Data used in applied econometrics are typically nonexperimental in nature. This makes the assumption of exogeneity of regressors untenable and poses a serious identification issue in the estimation of economic structural relationships.

As far as the source of endogeneity is confined to unobserved heterogenity between groups (for example, time-invariant managerial ability in firm-level labor demand equations), the availability of panel data can identify the parameters of interest. If endogeneity, instead, is more pervasive, stemming also from unobserved within-group variation (for example, a transitory technology shock hitting at the same time both the labor demand of the firm and the wage paid), then standard panel data estimators are biased and instrumental variable or generalized method of moments estimators provide valid alternative techniques.

This paper extends the analysis in Bruno (2005) focusing on dynamic panel-data (DPD) models with endogenous regressors.

Various Monte Carlo experiments are carried out through my Stata code
**xtarsim** to assess the relative finite-sample performances of
some popular DPD estimators, such as Arellano and Bond (**xtabond**,
**xtabond2**), Blundell and Bond (**xtabond2**), Anderson and Hsiao
(**ivreg**, **ivreg2**, **xtivreg**, **xtivreg2**), and LSDVC
(**xtlsdvc**).

New versions of the commands **xtarsim** and **xtlsdvc** are also presented.

- Bruno, G. S. F. 2005.
- Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals.
*Stata Journal*5: 473–500.

Roger Newson

*National Heart and Lung Institute, Imperial College London*

**Abstract**

Somers’ D and Kendall’s tau-a are parameters behind rank or
nonparametric statistics, interpreted as differences between proportions.
Given two bivariate data pairs (X1, Y1) and (X2, Y2), Kendall’s tau-a
parameter τXY is the difference between
the probability that the two X–Y pairs are concordant and the
probability that the two X–Y pairs are discordant, and Somers’ D
parameter DYX is the difference between the corresponding conditional
probabilities, given that the X-values are ordered. The **somersd** package
computes confidence intervals for both parameters. The Stata 9 version of
**somersd** uses Mata to increase computing speed and greatly extends the
definition of Somers’ D, allowing the X and/or Y variables to be
left- or right-censored and allowing multiple versions of Somers’ D
for multiple sampling schemes for the X–Y pairs. In particular, we may
define stratified versions of Somers’ D, in which we compare only
X–Y pairs from the same stratum. The strata may be defined by grouping
a Rubin–Rosenbaum propensity score, based on the values of multiple
confounders for an association between exposure variable X and an outcome
variable Y . Therefore, rank statistics can have not only confidence
intervals but also confounder-adjusted confidence intervals. Usually, we either
estimate DYX as a measure of the effect of X on Y , or we estimate DXY as a
measure of the performance of X as a predictor of Y, compared with other
predictors. Alternative rank-based measures of the effect of X on Y include
the Hodges–Lehmann median difference and the Theil–Sen median
slope, both of which are defined in terms of Somers’ D.

**Additional information**

Krishnan Bhaskaran

Hannah Green

*MRC Clinical Trials Unit, London*

**Abstract**

We introduce the **assertk** command, beginning with a motivation and a
comparison with the built-in **assert** command. We will then show some examples
demonstrating the various options that can be used to produce customized
output and to perform more complex checks.

**assertk** is a simple utility that makes data consistency checking and
reporting on data quality easy.

The built-in Stata command **assert** checks each observation for a specified
condition and halts do-files and ado-files when the specified condition is
not satisfied. For example:

. assert age entry < .2 contradictions in 149 observations assertion is false; end of do-file r(9);

Thus **assert** is a useful tool for checking important assumptions about the
data you are about to process; your do-file will simply not continue if
these assumptions do not pass the checks. The principle of the **assert**
command also lends itself to consistency checking, i.e., performing a suite
of checks on a dataset to identify potential errors. This is an important
part of the process of data cleaning. However, in this application, the
halting of do files is a hindrance, and there is a lack of detailed output
showing which observations failed the check.

In **assertk**, a condition is specified, and each observation is checked
against this condition. If any data do not pass the check, the
irregularities are output (with the output customizable by various options)
and the do-file continues. For example:

. assertk age ent < ., mess(Age at entry is missing) vars(id age ent)Age at entry is missing (1 obs) id age ent 38048 . 40352 .

Thus a suite of checks can be programmed easily, with one line per check, and a meaningful log of data errors can be produced for use by data managers and statisticians.

Stephen P. Jenkins

**Abstract**

This short talk introduces and illustrates **svylorenz**, a Stata 9
program for computing variance estimates for quantile group shares of total
varname, cumulative quantile group shares (i.e., Lorenz curve ordinates), and
the Gini coefficient. The program implements the linearization methods
proposed by Kovačević and Binder (*Journal of Official Statistics*,
1997).

**Additional information**

David M. Drukker

*StataCorp, College Station, TX*

**Abstract**

This talk discusses estimation, inference, and interpretation of panel-data models using Stata. The talk usually covers the linear RE and FE models, linear RE and FE models with AR(1) errors, linear RE and FE models with general within-panel correlation structures, Hausman–Taylor estimation, linear RE and FE with endogenous variables, linear FE dynamic models, linear mixed models, FE and RE nonlinear models, FE and RE logit models, FE and RE Poisson models, and stochastic frontier models for panel data. The talk briefly introduces each model discussed.

Nicholas J. Cox

*Durham University*

**Abstract**

Seasonal effects are dominant in many environmental time series, and are important or notable in many economic and biomedical time series. In several fields, using anything other than basic line graphs of responses versus time to display series showing seasonality is rare. This presentation will focus on a variety of tricks for graphically examining seasonality. Some of these tricks have long histories in climatology and related sciences, but are little known outside. I will discuss some original code, but the greater emphasis will be on users needing to know Stata functions and commands well to exploit the full potential of its graphics.

**Additional information**

Vincent L. Wiggins

*StataCorp, College Station, TX*

**Abstract**

If you find yourself repeatedly specifying the same options on graph commands, you should write a graphics scheme. A scheme is nothing more than a file containing a set of rules specifying how you want your graphs to look. From the size of fonts used in titles and the color of lines and markers in plots to the placement of legends and the number of default ticks on axes, almost everything about a graph is controlled by the rules in a graphics scheme. We will look at how to create your own graphics schemes and where to find out more about all the rules available in schemes. The first scheme we create will be only a few lines long, yet will produce graphs distinctly different from any existing scheme.

**Additional information**

Paul C. Lambert

*Centre for Biostatistics & Genetic Epidemiology, University of Leicester*

**Abstract**

In population-based cancer studies, cure is said to occur when the mortality
(hazard) rate in the diseased group of individuals returns to the same level
as that expected in the general population. The cure fraction (the
proportion of patients cured of disease) is of interest to patients and a
useful measure to monitor trends in survival of curable disease. I will
describe two types of cure model, namely, the mixture and nonmixture cure
model (Sposto 2002); explain how they can be extended to incorporate the
expected mortality rate (obtained from routine data sources); and discuss
their implementation in Stata using the **strsmix** and **strsnmix**
commands. In both commands there is the choice of parametric distribution
(Weibull, generalized gamma, and log–logistic) and link function for
the cure fraction (identity, logit, and log(–log)). As well as modeling
the cure fraction it is possible to include covariates for the ancillary
parameters for the parametric distributions. This ability is important, as it allows
for departures from proportional excess hazards (typical in many
population-based cancer studies). Both commands incorporate delayed entry
and can therefore be used to obtain up-to-date estimates of the cure
fraction by using period analysis (Smith et al. 2004). There is also an
associated predict command that allows prediction of the cure fraction,
relative survival, and the excess mortality rate with associated confidence
intervals. For some cancers the parametric distributions listed above do not
fit the data well, and I will describe how finite mixture distributions can
be used to overcome this limitation. I will use examples from international cancer
registries to illustrate the approach.

- Smith, L. K., P. C. Lambert, J. L. Botha, and D. R. Jones. 2004.
- Providing more up-to-date estimates of patient survival: A comparison of
standard survival analysis with period analysis using life-table methods and
proportional hazards models.
*Journal of Clinical Epidemiology*57: 14–20.

- Sposto, R. 2002.
- Cure model analysis in cancer: An application to data from
the Children’s Cancer Group.
*Statistics in Medicine*21: 293–312.

**Additional information**

Zhiqiang Wang

*Centre for Chronic Disease, School of Medicine, University of Queensland*

**Abstract**

Controversy exists regarding proper methods for the selection of variables in confounder control in epidemiological studies. Various approaches have been proposed for selecting a subset of confounders among many possible subsets. This paper describes the use of two practical tools, Stata postestimation commands written by the author, to identify the presence and direction of confounding.

One command, **confall**, plots all possible effect estimates against
a statistical value such as the *p*-value or Akaike information
criterion. This computing-intensive procedure allows researchers to inspect
the variability of effect estimates from different possible models. Another
command, **confnd**, uses a stepwise approach to identify confounders
that have caused substantial changes in the effect measurement.

Using three examples, the author illustrates the use of those programs in different situations. When all possible effect estimates are similar, indicating little confounding, the investigator can confidently report the presence and direction of the association between exposure and disease regardless of which variable selection method is used. On the other hand, when all possible effect estimates vary substantially, indicating the presence of confounding, a change-in-estimate plot and its corresponding table are helpful for identifying important confounders. Both commands can be used after most commonly used estimation commands for epidemiological data.

Ian White

*MRC Biostatistics Unit, Cambridge*

**Abstract**

In teaching logistic regression for case–control studies, I ask master’s students in epidemiology to assess an interaction between a 2-level exposure and a 4-level exposure using a likelihood-ratio test. Theory suggests that the test statistic has 3 degrees of freedom, but Stata uses 2 degrees of freedom. The explanation turns out to be that one exposure combination contains controls but no cases, so that one parameter goes to infinity. It is hard to convince the students (and myself) that this combination contributes no degrees of freedom.

I will review how Stata handles situations in which parameters go to infinity.
Although asymptotics for likelihood-ratio tests may not work well in this
situation, I will argue that **lrtest** should be modified to reflect the
true number of degrees of freedom.

**Additional information**

Patrick Royston

*MRC Clinical Trials Unit, London*

**Abstract**

Most survival data are analyzed by using the Cox proportional hazards
model (in Stata: the **stcox** command). Almost by definition, a proportion of
the observations will be right-censored. Analysis of covariate effects in
the Cox model is couched in terms of (log) hazard ratios, and the
distribution of time itself is essentially ignored. This practice is totally
different from standard analysis of a continuous outcome variable, where
multiple (linear) regression is the technique most often used. Hazard ratios
are difficult to interpret and give little insight into how a
covariate affects the time to an event. Furthermore, the assumption of
proportional hazards is strong, and when there is long-term follow-up,
is often breached. I will illustrate how the censored lognormal model can be
used to good effect to remedy some of these deficiencies and give better
insight into the data. Multiple imputation of the censored observations may
be followed by use of familiar exploratory graphical tools, such as
dotplots, scatterplots, and scatterplot smoothers. Analyses using standard
linear regression methods may be done on the log time scale, leading to
simple interpretations and informative graphs of effect size. I will explore
these ideas in the context of a familiar breast cancer dataset and will
show how a treatment/covariate interaction is easily conveyed graphically.

**Additional information**

Maarten L. Buis

*Department of Social Research Methodology, Vrije Universiteit Amsterdam*

**Abstract**

When dealing with response variables that are proportions, people often use
**regress**. This approach can be problematic since the model can lead to
predicted proportions less than zero or more than one and errors that are
likely to be heteroskedastic and nonnormally distributed. This talk will
discuss three more appropriate methods for proportions as response
variables: **betafit**, **dirifit**, and **glm**.

**betafit** is a maximum likelihood estimator using a beta likelihood,
**dirifit** is a maximum likelihood estimator using a Dirichlet
likelihood, and **glm** can be used to create a quasi–maximum likelihood
estimator using a binomial likelihood. On an applied level, a difference
between **dirifit** and the others is that the others can handle only one
response variable, whereas **dirifit** can handle multiple response
variables. For instance, **betafit** and **glm** can model the
proportion of city budget spent on the category security (police and fire
department), whereas **dirifit** can simultaneously model the proportions
spent on categories security, social policy, infrastructure, and other.
Another difference between **betafit** and **glm** is that **glm**
can handle a proportion of exactly zero and one, whereas
**betafit** can handle only proportions between zero and one.

Special attention will be given on how to fit these models in Stata and on how to interpret the results. This presentation will end with a warning not to use any of these techniques for ecological inference, i.e., using aggregated data to infer about individual units. To use a classic example: In the United States in the 1930s, states with a high proportion of immigrants also had a high literacy rate (in the English language), whereas immigrants were on average less literate than nonimmigrants. Regressing state level literacy rate on state level proportion of immigrants would thus give a completely wrong picture about the relationship between individual immigrant status and literacy.

**Additional information**

David M. Drukker

*StataCorp, College Station, TX*

**Abstract**

After presenting a general introduction to the Mata matrix programming language, this talk discusses Mata’s many simple links to the Stata dataset and other important objects in Stata’s memory. An application to maximum simulated likelihood illustrates the programming techniques.

Kit Baum

*Department of Economics, Boston College*

**Abstract**

I will describe several time-series filtering techniques, including the Hodrick–Prescott, Baxter–King, and bandpass filters and variants, and present new Mata-coded versions of these routines, which are considerably more efficient than previous ado-code routines. Applications to several economic and financial time series will be discussed.

**Additional information**