Last updated: 20 September 2006
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
Department of Social Medicine, University of Bristol
We have been undertaking a systematic review of the literature on diet and cancer, which included all study types reporting on any dietary exposure. The data were presented in a mixture of category, mean difference, and regression coefficients, which we analyzed in Stata to produce dose–response estimates and other statistics for all results.
The resulting tables were large (more than 3000 results). To rapidly produce formatted tables, we wrote the xtable command, which arranges data for exporting with formatting tags. These tags are then recognized by an Excel macro, which creates headings, merges across cells, and performs other formatting actions as required. In this way the data are compact, as study-level information is merged across cells to reduce duplication, and neatly organized. The process allows users to arrange the data as they wish, or the data can be sorted according to other variables within the command—or a mix of both. The data are exported as text format, there is one intermediate step as they are imported to Excel, and then it is a single key press to format the table. In this way complex tables can be produced with duplicate information merged across cells at more than one level, and multiple levels of headings can be incorporated. After the initial specification of the xtable command, it is then simple to rerun the procedure, which makes updates and modifications to the analysis simple.
After developing these techniques, we wrote a program to form simple sentences based on our data, e.g.: “The Iowa Women’s Health study, a prospective cohort, reported an unadjusted OR of 1.09 (950.98, 1.21) per cup per day increase of coffee.” A program was then created that produced a series of short texts for each exposure in a log file, consisting of a title, subtitles, a small frequency table, and a sentence summarizing each result. The log file was then opened in Word and tags used to format the document as before to create titles and align the frequency tables. This proved a massive labor-saving device, as much of the report was rather repetitious, and had the added benefit of creating a structure for the report and preventing typing errors and accidental omission of results. The code for this method is too specific to produce a general command, but the techniques will be discussed.
Agenzia Regionale di Sanità Toscana
This paper describes a natural interaction between Stata and markup languages. Stata’s programming and analysis features, together with the flexibility in output formatting of markup languages, allow generation and/or update of whole documents (reports, presentations on screen or web, etc.). Examples are given for both LaTeX and HTML.
Stata’s commands are mainly dedicated to analysis of data on a computer screen and output of analysis stored in a log file available to researchers for later reading. However, users may need to produce output in different formats and to cooperate with professionals who are not familiar with log files. An elegant solution to this problem is exporting output in the format of a markup language, such as LaTeX or HTML.
The most common means for presenting the results of one or several analyses are text on paper, screen presentations, and websites. While it is common to generate such outputs by visual programs, such as MS Office or OpenOffice, it is impossible for Stata to produce documents this way, as it lacks eyes to format a table and hands to hold a mouse to cut and paste graphs. Nevertheless, each of those presentation formats can also be obtained with use of a markup language. Wikipedia defines a markup language as “a kind of text encoding that represents text as well as details about the structure and appearance of the text”.
To publish on the web, HTML is one of the best and most compatible formats. On other hand, LaTeX is a complete language for editing and text formatting on either paper or screen (most commonly via PDF files). Both languages are easy to learn, free, and well documented.
Now Stata happens to be perfectly capable of writing text, such as the instructions for a markup language to write a report, a sequence of slides, or the pages of a website containing tables and graphs.
The problem of formatting the output of a command in LaTeX and/or HTML has been addressed in various ways by several authors. The most comprehensive reference to this issue is Newson (2003), who also provides a suite of tools aimed at printing in markup language the list of a Stata dataset, in such a way that variable labels, value labels, significant figures, and so forth are formatted the way one would wish.
More generally, we can exploit Stata’s ability to write text files to make it produce virtually any piece of markup language code: tables and graphs, but also other kind of objects, like lists, trees, etc.
Finally, by further printing some code putting together all of the ingredients, we make Stata produce a whole document, which is then browsable, printable, or showable on a screen, according to the kind of document.
The key feature of this method is that the document automatically produced can be completely updated as soon as the figures in the data change. This is particularly suitable when the user needs to produce a large amount of output or routinely performs analyses on the same dataset structure, such as administrative data bases or collection of data from a long-lasting study.
For an example of those facilities, we describe a do-file automatically constructing a website for the Regional Agency for Public Health of Tuscany. Finally, we remark that to apply this method, Stata commands must store in memory their results—at least as many as necessary to reproduce the screen output. This is generally the case, with some notable counterexamples (dstdize, svyprop,...).
Institute of Sociology and Social Policy, Corvinus University, Budapest
Students of racial and gender inequalities are often interested in knowing to what extent an observed group difference can be attributed to differences in returns to productive abilities (discrimination effect) or to differences in the average of productive abilities (endowment effect). The standard Blinder–Oaxaca decomposition technique, which applies to continuous outcomes, measures the discrimination (endowment) effect in terms of differences in group-specific regression parameters (means), weighted by group-specific means (regression parameters). This article shows that the standard decomposition technique can be meaningfully extended to categorical outcomes if the regression coefficients are substituted with marginal effects. A user-written program, gdecomp (working title), is also presented, which basically processes marginal effects obtained from another user-written program, margeff.
Giovanni S. F. Bruno
Istituto di Economia Politica, Università Bocconi, Milano
Data used in applied econometrics are typically nonexperimental in nature. This makes the assumption of exogeneity of regressors untenable and poses a serious identification issue in the estimation of economic structural relationships.
As far as the source of endogeneity is confined to unobserved heterogenity between groups (for example, time-invariant managerial ability in firm-level labor demand equations), the availability of panel data can identify the parameters of interest. If endogeneity, instead, is more pervasive, stemming also from unobserved within-group variation (for example, a transitory technology shock hitting at the same time both the labor demand of the firm and the wage paid), then standard panel data estimators are biased and instrumental variable or generalized method of moments estimators provide valid alternative techniques.
This paper extends the analysis in Bruno (2005) focusing on dynamic panel-data (DPD) models with endogenous regressors.
Various Monte Carlo experiments are carried out through my Stata code xtarsim to assess the relative finite-sample performances of some popular DPD estimators, such as Arellano and Bond (xtabond, xtabond2), Blundell and Bond (xtabond2), Anderson and Hsiao (ivreg, ivreg2, xtivreg, xtivreg2), and LSDVC (xtlsdvc).
New versions of the commands xtarsim and xtlsdvc are also presented.
National Heart and Lung Institute, Imperial College London
Somers’ D and Kendall’s tau-a are parameters behind rank or nonparametric statistics, interpreted as differences between proportions. Given two bivariate data pairs (X1, Y1) and (X2, Y2), Kendall’s tau-a parameter τXY is the difference between the probability that the two X–Y pairs are concordant and the probability that the two X–Y pairs are discordant, and Somers’ D parameter DYX is the difference between the corresponding conditional probabilities, given that the X-values are ordered. The somersd package computes confidence intervals for both parameters. The Stata 9 version of somersd uses Mata to increase computing speed and greatly extends the definition of Somers’ D, allowing the X and/or Y variables to be left- or right-censored and allowing multiple versions of Somers’ D for multiple sampling schemes for the X–Y pairs. In particular, we may define stratified versions of Somers’ D, in which we compare only X–Y pairs from the same stratum. The strata may be defined by grouping a Rubin–Rosenbaum propensity score, based on the values of multiple confounders for an association between exposure variable X and an outcome variable Y . Therefore, rank statistics can have not only confidence intervals but also confounder-adjusted confidence intervals. Usually, we either estimate DYX as a measure of the effect of X on Y , or we estimate DXY as a measure of the performance of X as a predictor of Y, compared with other predictors. Alternative rank-based measures of the effect of X on Y include the Hodges–Lehmann median difference and the Theil–Sen median slope, both of which are defined in terms of Somers’ D.
MRC Clinical Trials Unit, London
We introduce the assertk command, beginning with a motivation and a comparison with the built-in assert command. We will then show some examples demonstrating the various options that can be used to produce customized output and to perform more complex checks.
assertk is a simple utility that makes data consistency checking and reporting on data quality easy.
The built-in Stata command assert checks each observation for a specified condition and halts do-files and ado-files when the specified condition is not satisfied. For example:
. assert age entry < . 2 contradictions in 149 observations assertion is false; end of do-file r(9);
Thus assert is a useful tool for checking important assumptions about the data you are about to process; your do-file will simply not continue if these assumptions do not pass the checks. The principle of the assert command also lends itself to consistency checking, i.e., performing a suite of checks on a dataset to identify potential errors. This is an important part of the process of data cleaning. However, in this application, the halting of do files is a hindrance, and there is a lack of detailed output showing which observations failed the check.
In assertk, a condition is specified, and each observation is checked against this condition. If any data do not pass the check, the irregularities are output (with the output customizable by various options) and the do-file continues. For example:
. assertk age ent < ., mess(Age at entry is missing) vars(id age ent) Age at entry is missing (1 obs) id age ent 38048 . 40352 .
Thus a suite of checks can be programmed easily, with one line per check, and a meaningful log of data errors can be produced for use by data managers and statisticians.
Stephen P. Jenkins
This short talk introduces and illustrates svylorenz, a Stata 9 program for computing variance estimates for quantile group shares of total varname, cumulative quantile group shares (i.e., Lorenz curve ordinates), and the Gini coefficient. The program implements the linearization methods proposed by Kovačević and Binder (Journal of Official Statistics, 1997).
David M. Drukker
StataCorp, College Station, TX
This talk discusses estimation, inference, and interpretation of panel-data models using Stata. The talk usually covers the linear RE and FE models, linear RE and FE models with AR(1) errors, linear RE and FE models with general within-panel correlation structures, Hausman–Taylor estimation, linear RE and FE with endogenous variables, linear FE dynamic models, linear mixed models, FE and RE nonlinear models, FE and RE logit models, FE and RE Poisson models, and stochastic frontier models for panel data. The talk briefly introduces each model discussed.
Nicholas J. Cox
Seasonal effects are dominant in many environmental time series, and are important or notable in many economic and biomedical time series. In several fields, using anything other than basic line graphs of responses versus time to display series showing seasonality is rare. This presentation will focus on a variety of tricks for graphically examining seasonality. Some of these tricks have long histories in climatology and related sciences, but are little known outside. I will discuss some original code, but the greater emphasis will be on users needing to know Stata functions and commands well to exploit the full potential of its graphics.
Vincent L. Wiggins
StataCorp, College Station, TX
If you find yourself repeatedly specifying the same options on graph commands, you should write a graphics scheme. A scheme is nothing more than a file containing a set of rules specifying how you want your graphs to look. From the size of fonts used in titles and the color of lines and markers in plots to the placement of legends and the number of default ticks on axes, almost everything about a graph is controlled by the rules in a graphics scheme. We will look at how to create your own graphics schemes and where to find out more about all the rules available in schemes. The first scheme we create will be only a few lines long, yet will produce graphs distinctly different from any existing scheme.
Paul C. Lambert
Centre for Biostatistics & Genetic Epidemiology, University of Leicester
In population-based cancer studies, cure is said to occur when the mortality (hazard) rate in the diseased group of individuals returns to the same level as that expected in the general population. The cure fraction (the proportion of patients cured of disease) is of interest to patients and a useful measure to monitor trends in survival of curable disease. I will describe two types of cure model, namely, the mixture and nonmixture cure model (Sposto 2002); explain how they can be extended to incorporate the expected mortality rate (obtained from routine data sources); and discuss their implementation in Stata using the strsmix and strsnmix commands. In both commands there is the choice of parametric distribution (Weibull, generalized gamma, and log–logistic) and link function for the cure fraction (identity, logit, and log(–log)). As well as modeling the cure fraction it is possible to include covariates for the ancillary parameters for the parametric distributions. This ability is important, as it allows for departures from proportional excess hazards (typical in many population-based cancer studies). Both commands incorporate delayed entry and can therefore be used to obtain up-to-date estimates of the cure fraction by using period analysis (Smith et al. 2004). There is also an associated predict command that allows prediction of the cure fraction, relative survival, and the excess mortality rate with associated confidence intervals. For some cancers the parametric distributions listed above do not fit the data well, and I will describe how finite mixture distributions can be used to overcome this limitation. I will use examples from international cancer registries to illustrate the approach.
Centre for Chronic Disease, School of Medicine, University of Queensland
Controversy exists regarding proper methods for the selection of variables in confounder control in epidemiological studies. Various approaches have been proposed for selecting a subset of confounders among many possible subsets. This paper describes the use of two practical tools, Stata postestimation commands written by the author, to identify the presence and direction of confounding.
One command, confall, plots all possible effect estimates against a statistical value such as the p-value or Akaike information criterion. This computing-intensive procedure allows researchers to inspect the variability of effect estimates from different possible models. Another command, confnd, uses a stepwise approach to identify confounders that have caused substantial changes in the effect measurement.
Using three examples, the author illustrates the use of those programs in different situations. When all possible effect estimates are similar, indicating little confounding, the investigator can confidently report the presence and direction of the association between exposure and disease regardless of which variable selection method is used. On the other hand, when all possible effect estimates vary substantially, indicating the presence of confounding, a change-in-estimate plot and its corresponding table are helpful for identifying important confounders. Both commands can be used after most commonly used estimation commands for epidemiological data.
MRC Biostatistics Unit, Cambridge
In teaching logistic regression for case–control studies, I ask master’s students in epidemiology to assess an interaction between a 2-level exposure and a 4-level exposure using a likelihood-ratio test. Theory suggests that the test statistic has 3 degrees of freedom, but Stata uses 2 degrees of freedom. The explanation turns out to be that one exposure combination contains controls but no cases, so that one parameter goes to infinity. It is hard to convince the students (and myself) that this combination contributes no degrees of freedom.
I will review how Stata handles situations in which parameters go to infinity. Although asymptotics for likelihood-ratio tests may not work well in this situation, I will argue that lrtest should be modified to reflect the true number of degrees of freedom.
MRC Clinical Trials Unit, London
Most survival data are analyzed by using the Cox proportional hazards model (in Stata: the stcox command). Almost by definition, a proportion of the observations will be right-censored. Analysis of covariate effects in the Cox model is couched in terms of (log) hazard ratios, and the distribution of time itself is essentially ignored. This practice is totally different from standard analysis of a continuous outcome variable, where multiple (linear) regression is the technique most often used. Hazard ratios are difficult to interpret and give little insight into how a covariate affects the time to an event. Furthermore, the assumption of proportional hazards is strong, and when there is long-term follow-up, is often breached. I will illustrate how the censored lognormal model can be used to good effect to remedy some of these deficiencies and give better insight into the data. Multiple imputation of the censored observations may be followed by use of familiar exploratory graphical tools, such as dotplots, scatterplots, and scatterplot smoothers. Analyses using standard linear regression methods may be done on the log time scale, leading to simple interpretations and informative graphs of effect size. I will explore these ideas in the context of a familiar breast cancer dataset and will show how a treatment/covariate interaction is easily conveyed graphically.
Maarten L. Buis
Department of Social Research Methodology, Vrije Universiteit Amsterdam
When dealing with response variables that are proportions, people often use regress. This approach can be problematic since the model can lead to predicted proportions less than zero or more than one and errors that are likely to be heteroskedastic and nonnormally distributed. This talk will discuss three more appropriate methods for proportions as response variables: betafit, dirifit, and glm.
betafit is a maximum likelihood estimator using a beta likelihood, dirifit is a maximum likelihood estimator using a Dirichlet likelihood, and glm can be used to create a quasi–maximum likelihood estimator using a binomial likelihood. On an applied level, a difference between dirifit and the others is that the others can handle only one response variable, whereas dirifit can handle multiple response variables. For instance, betafit and glm can model the proportion of city budget spent on the category security (police and fire department), whereas dirifit can simultaneously model the proportions spent on categories security, social policy, infrastructure, and other. Another difference between betafit and glm is that glm can handle a proportion of exactly zero and one, whereas betafit can handle only proportions between zero and one.
Special attention will be given on how to fit these models in Stata and on how to interpret the results. This presentation will end with a warning not to use any of these techniques for ecological inference, i.e., using aggregated data to infer about individual units. To use a classic example: In the United States in the 1930s, states with a high proportion of immigrants also had a high literacy rate (in the English language), whereas immigrants were on average less literate than nonimmigrants. Regressing state level literacy rate on state level proportion of immigrants would thus give a completely wrong picture about the relationship between individual immigrant status and literacy.
David M. Drukker
StataCorp, College Station, TX
After presenting a general introduction to the Mata matrix programming language, this talk discusses Mata’s many simple links to the Stata dataset and other important objects in Stata’s memory. An application to maximum simulated likelihood illustrates the programming techniques.
Department of Economics, Boston College
I will describe several time-series filtering techniques, including the Hodrick–Prescott, Baxter–King, and bandpass filters and variants, and present new Mata-coded versions of these routines, which are considerably more efficient than previous ado-code routines. Applications to several economic and financial time series will be discussed.
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Timberlake Consultants, the official distributor of Stata in the United Kingdom, Ireland, Spain, and Portugal.