Last updated: 25 September 2013
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
Factor variables are defined as categorical variables with integer values, which may represent values of some other kind, specified by a value label. We frequently want to generate such variables in Stata datasets, especially resultssets, which are output Stata datasets produced by Stata programs such as the official Stata statsby command and the SSC packages parmest and xcontract. This is because categorical string variables can only be plotted after conversion to numeric variables and because these numeric variables are also frequently used in defining a key of variables, which identify observations in the resultsset uniquely in a sensible sort order. The sencode package is downloadable, and frequently downloaded, from SSC and is a “super” version of encode, which inputs a string variable and outputs a numeric factor variable. Its added features include a replace option allowing the output numeric variable to replace the input string variable, a gsort() option allowing the numeric values to be ordered in ways other than the alphabetical order of the input string values, and a manyto1 option allowing multiple output numeric values to map to the same input string value. The sencode package is well established and has existed since 2001. However, some tips will be given on ways of using it that are not immediately obvious but which the author has found very useful over the years when mass-producing resultssets. These applications use sencode with other commands, such as the official Stata command split and the SSC packages factmerg, factext, and fvregen.
Reweighting is a popular statistical technique to deal with inference in presence of a nonrandom sample. In the literature, various reweighting estimators have been proposed. This paper presents the author-written Stata routine treatrew, which implements the reweighting on the propensity-score estimator as proposed by Rosenbaum and Rubin (1983) in their seminal article, where they show that parameters’ standard errors can be obtained analytically (Wooldridge 2010, 920–930) or via bootstrapping. Because an implementation in Stata of this estimator with analytic standard errors was still missing, this paper, and the ado-file and help-file accompanying it, aims at filling this gap by providing an easy-to-use implementation of the reweighting on the propensity-score method, as a valuable tool for estimating treatment-effects under “selection-on-observables” (or “overt bias”). Finally, a Monte Carlo experiment to check the reliability of treatrew and to compare its results with other treatment effect estimators will also be provided.
Multiple imputation (MI) is a popular approach to handling missing data, and an extensive range of MI commands is now available in official Stata. A common problem is that of missing values in covariates of regression models. When the substantive model for the outcome contains nonlinear covariate effects or interactions, correctly specifying an imputation model for covariates becomes problematic. We present simulation results illustrating the biases that can occur when standard imputation models are used to impute covariates in linear regression models with a quadratic effect or interaction effect. We then describe a modification of the full conditional specification (FCS) or chained equations approach to MI, which ensures that covariates are imputed from a model which is compatible with a user-specified substantive model. We present the smcfcs Stata command, which implements substantive model compatible FCS and illustrate its application to a dataset.
Social scientists are increasingly fitting multilevel models to datasets in which a large number of individuals (N ~ several thousands) are nested within each of a small number of countries (C ~ 25). The researchers are particularly interested in “country effects”, as summarized by either the coefficients on country-level predictors (or cross-level interactions) or the variance of the country-level random effects. Although questions have been raised about the potentially poor performance of estimators of these “country effects” when C is “small”, this issue appears not to be widely appreciated by many social scientist researchers. Using Monte Carlo analysis, I examine the performance of two estimators of a binary-dependent two-level model using a design in which C = 5(5)50 100 and N = 1000 for each country. The results point to i) the superior performance of adaptive quadrature estimators compared with PQL2 estimators, and ii) poor coverage of estimates of “country effects” in models in which C ~ 25, regardless of estimator. The analysis makes extensive use of xtmelogit and simulate and user-written commands such as runmlwin, parmby, and eclplot. Issues associated with having extremely long runtimes are also discussed.
Multilevel mixed-effects survival models are used in the analysis of clustered survival data, such as repeated events, multicenter clinical trials, or individual patient data meta-analyses, to investigate heterogeneity in baseline risk and treatment effects. I present the stmixed command for the parametric analysis of clustered survival data with two levels. Mixed-effects parametric survival models available include the exponential, Weibull and Gompertz proportional-hazards models, the Royston–Parmar flexible-parametric model, and the log–logistic, log–normal, and generalized gamma-accelerated failure-time models. Estimation is conducted using maximum likelihood, with both adaptive and nonadaptive Gauss–Hermite quadrature available. I will illustrate the command through simulation and application to clinical datasets.
Electronic health records are increasingly used for epidemiological and health service research. However, missing data are often an issue when dealing with electronic records. Up to now, various approaches have been used to overcome these issues, including complete case analysis, last observation carried forward, and multiple imputation. In this presentation, we will first highlight the issues of missing data in longitudinal records and provide examples of the limitations of standard methods of multiple imputation. We will then demonstrate the new twofold user-written Stata command that implements the twofold fully conditional specification (FCS) multiple-imputation algorithm in Stata (Nevalainen, Kenward, and Virtanen, 2009. Stat Med. 28: 3657–3669.)
In the application of the twofold FCS algorithm, we divide time into equal size time blocks. The algorithm then imputes missing values in the longitudinal data, imputing one time block, and then the next. The defining characteristic is that when one imputes missing values at a particular time block, only measurements at that time block and adjacent time blocks are used. This obviates some of the principal difficulties that are typically encountered when attempting to apply a standard MI approach to imputing such longitudinal data.
We illustrate how the twofold FCS MI algorithm works in practice and maximizes the use of data available, even in situations where measurements are only made on a relatively small proportion of individuals in each time block. We discuss some of the strengths and limitations of the twofold FCS MI algorithm and contrast it with existing approaches to imputing longitudinal data. Lastly, we present results demonstrating the potential for gains in efficiency through use of the twofold approach compared with a more conventional “baseline MI” approach.
Residual confounding is a major problem in analysis of observational data, occurring when a confounding variable is measured coarsely (censored, heaped, missing, etc.) and hence cannot be fully adjusted to obtain a causal estimate by usual means such as multiple regression. The analysis of coarse data has been investigated by Heitjan and Rubin, but methods for coarse covariates are lacking.
A fully conditional-specification multiple-imputation approach is possible if we are able to model i) the confounding variable conditional on other information in the dataset and ii) the coarsening mechanism. This provides a very flexible framework for removing residual confounding under our assumptions, including sensitivity analysis. An added complexity over missing data is that it may not be known which observations are coarsened.
Programming this method is presented in Stata for various combinations of i) and ii) above, using the ml and mi functions. In the simplest case of a normally distributed confounder subject to known interval-censoring, intreg and mi can be applied. The method is illustrated with simulated data and the true causal effect is recovered in each instance.
In medical research, particularly in the field of cancer, it is often important to evaluate the impact of treatments and other factors on a composite outcome based on survival and quality-of-life data, such as a Quality Adjusted Life Year (QALY). We present a Stata program, stiqsp, which determines the mean QALY using the Integrated Quality Survival Product. In this nonparametric approach, the survival function is estimated using the Kaplan–Meier method and the quality-of-life function is derived from the mean quality-of-life score at the unique death times. Confidence intervals for the QALY score are determined using the bootstrap method. We illustrate the features of the command with a large dataset of patients with lung cancer.
Econometricians have begun to devote more attention to spatial interactions when carrying out applied econometric studies. In part, this is motivated by an explicit focus on spatial interactions in policy formulation or market behavior, but it may also reflect concern about the role of omitted variables that are or may be spatially correlated.
The Stata user-written procedure xsmle has been designed to estimate a wide range of spatial panel models, including spatial autocorrelation, spatial Durbin, and spatial error models using maximum likelihood methods. It relies upon the availability of balanced panel data with no missing observations. This requirement is stringent, but it arises from the fact that in principle, the values of the dependent variable for any panel unit may depend upon the values of the dependent and independent variables for all the other panel units. Thus even a single missing data point may require that all data for a time period, panel unit, or variable be discarded.
The presence of missing data is an endemic problem for many types of applied work, often because of the creation or disappearance of panel units. At the macro level, the number and composition of countries in Europe or local government units in the United Kingdom has changed substantially over the last three decades. In longitudinal household surveys, new households are created and old ones disappear all the time. Restricting the analysis to a subset of panel units that have remained stable over time is a form of sample selection whose consequences are uncertain and that may have statistical implications that merit additional investigation.
The simplest mechanisms by which missing data may arise underpin the missing-at-random (MAR) assumption. When this is appropriate, it is possible to use two approaches to estimation with missing data. The first is either simple or, preferably, multiple imputation, which involves the replacement of missing data by stochastic imputed values. The Stata procedure mi can be combined with xsmle to implement a variety of estimates that rely upon multiple imputation. While the combination of procedures is relatively simple to estimate, practical experience suggests that the results can be quite sensitive to the specification that is adopted for the imputation phase of the analysis. Hence, this is not a one-size-fits-all method of dealing with unbalanced panels, because the analyst must give serious consideration to the way in which imputed values are generated.
The second approach has been developed by Pfaffermayr. It relies upon the spatial interactions in the model, which means that the influence of the missing observations can be inferred from the values taken by nonmissing observations. In effect, the missing observations are treated as latent variables whose distribution can be derived from the values of the nonmissing data. This leads to a likelihood function that can be partitioned between missing and nonmissing data and thus used to estimate the coefficients of the full model. The merit of the approach is that it takes explicit account of the spatial structure of the model. However, the procedure becomes computationally demanding if the proportion of missing observations is too large and, as one would expect, the information provided by the spatial interactions is not sufficient to generate well-defined estimates of the structural coefficients. The missing-at-random assumption is crucial for both of these approaches, but it is not reasonable to rely upon it when dealing with the birth or death of distinct panel units.
A third approach, which is based on methods used in the literature on statistical signal processing, relies upon reducing the spatial interactions to immediate neighbors. Intuitively, the basic unit for the analysis becomes a block consisting of a central unit (the dependent variable) and its neighbors (the spatial interactions). Because spatial interactions are restricted to within-block effects, the population of blocks can vary over time and standard nonspatial panel methods can be applied.
The presentation will describe and compare the three approaches to estimating spatial panel models as implemented in Stata as extensions to xsmle. It will be illustrated by analyses of i) state data on electricity consumption in the U.S. and ii) gridded historical data on temperature and precipitation to identify the effects of El Niño (ENSO) and other major weather oscillations.
This presentation illustrates Stata’s implementation of the repeated half-sample bootstrap proposed by Saigo et al. (2001, Survey Methodology). This resampling scheme is easy to implement and is appropriate for complex survey designs, even with small stratum sizes. The user-written command rhsbsample mimicks the official bsample command and can be used for bootstrap inference in a wide range of settings.
The simsam package, released earlier this year, allows versatile sample-size calculation (calculating sample size required to achieve given statistical power) using simulation for any method of analysis under any statistical model that can be programmed in Stata. Simulation is particularly helpful in situations where formulae for sample size are approximate or unavailable. The usefulness of the simsam package will depend in part, however, on the quality and variety of simsam applications developed by Stata users.
I will discuss simsam applications for clinical trials with time-to-event or survival outcomes. Here sample size formulas are the subject of ongoing research (available methods for cluster-randomized trials with variable cluster size, for example, are only approximate). Using such examples, I will illustrate general simsam programming considerations such as dealing with analyses that fail, and the advantages of modular programming. Simulation forces us to think carefully about trial definitions—for example, whether recruitment ends after a fixed time or a fixed number of recruits—and to relate the sample-size calculation to a detailed or contingent analysis plan. Such careful attention to detail may be lost in a formulaic approach to sample-size calculation. Simulation also allows us to check for bias in a test as well as power. But simulation is not without problems: while a rare (say, one in a million) failure of an analysis in Stata may not worry the pragmatic statistician, a simsam application must anticipate this failure or risk stumbling on it in the course of many simulations.
Many, perhaps most, useful graphs compare two or more sets of values. Examples are two or more groups or variables (as distributions, time series, etc.) or observed and fitted values for one or more model fits. Often there can be a fine line in such comparisons between richly detailed graphics and busy, unintelligible graphics that lead nowhere. In this presentation, I survey strategy and tactics for developing good graphic multiples in Stata.
Details include the use of over() and by() options and graph combine; the relative merits of super(im)posing and juxtaposing; backdrops of context; killing the key or losing the legend if you can; transforming scales for easier comparison; annotations and self-explanatory markers; linear reference patterns; plotting both data and summaries; plotting different versions or reductions of the data.
Datasets visited or revisited include James Short’s collation of observations from the transit of Venus; John Snow’s data on mortality in relation to water supply in London; Florence Nightingale’s data on deaths in the Crimea; deaths from the Titanic sinking; admissions to Berkeley; hostility in response to insult; and advances and retreats of East Antarctic ice sheet glaciers.
Specific programs discussed include graph dot; graph bar; sparkline (SSC); qplot (SJ) and its relatives; devnplot (SSC); stripplot (SSC); and tabplot (SJ/SSC).
Testing for the presence of autocorrelation in a time series, either in the univariate setting or with the residuals from the estimation of some model, is one of the most common tasks researchers face in the time-series setting. The standard Q test statistic is that introduced by Box and Pierce (1970), subsequently refined by Ljung and Box (1978). The original L-B-P test is applicable to univariate time series and to testing for residual autocorrelation under the assumption of strict exogeneity. Breusch (1978) and Godfrey (1978) in effect extended the L-B-P approach to testing for autocorrelations in residuals in models with weakly exogenous regressors. Both the L-B-P test and the Breusch–Godfrey test are available in Stata, the former for univariate time series via the wntestq command and the latter for postestimation testing following OLS via the estat bgodfrey command.
All the above tests have important limitations: i) the tests are for autocorrelation up to order p, where under the null hypothesis the series or residuals are i.i.d.; ii) when applied to residuals from single-equation estimation, the regressors must all be at least weakly exogenous; iii) the tests are for single-equation models and do not cover panel data.
We use the results of Cumby and Huizinga (1992) to extend the implementation of the Q test statistic of L-B-P-B-G to cover a much wider range of hypotheses and settings: i) tests for the presence of autocorrelation of order p through q, where under the null hypothesis, there may be autocorrelation of order p-1 or less; ii) tests following estimation in which regressors are endogenous and estimation is by IV or GMM methods; iii) tests following estimation using panel data. For iii), we show that the Cumby–Huizinga test, developed for the large-T setting, is, when applied to the large-N panel-data setting and limited to testing for second-order serial correlation, formally identical to the test presented by Arellano and Bond (1991) and available in Stata via Roodman’s abar command.
Semiparametric regression deals with the introduction of some very general nonlinear functional forms in regression analyses. This class of regression models is generally used to fit a parametric model in which the functional form of a subset of the explanatory variables is not known and/or in which the distribution of the error term cannot be assumed of being of a specific type beforehand. To fix ideas, consider the partial linear model y = zb + f(x) + e, in which the shape of the potentially nonlinear function of predictor x is of particular interest. Two approaches to modeling f(x) are to use splines or fractional polynomials. This talk reviews other more general approaches, and the commands available in Stata to fit such models.
The main topic of the talk will be partial linear regression models, with some brief discussion also of so-called single index and generalized additive models. Though several semiparametric regression methods have been proposed and developed in the literature, these are probably the most popular ones.
The general idea of partial linear regression models is that a dependent variable is regressed on i) a set of explanatory variables entering the model linearly and ii) a set of variables entering the model nonlinearly but without assuming any specific functional form. Several estimators have been proposed in the literature and are available in Stata. For example, the semipar command makes available what is called the double residuals estimator introduced by Robinson (1988), which is consistent and efficient. Similarly, the plreg command fits an alternative difference-based estimator proposed by Yatchew (1998) that has similar statistical properties to Robinson’s estimator. These estimators will be briefly compared to identify some drawbacks and pitfalls of both methods.
A natural concern of researchers is how these estimators could be modified to deal with heteroskedasticity, serial correlation, and endogeneity in cross-sectional data or how they could be adapted in the context of panel data to control for unobserved heterogeneity. As a consequence, a substantial part of the talk will be devoted to explaining i) how the plreg and semipar commands can be used to tackle these very common violations of the Gauss–Markov assumptions in cross-sectional data and ii) how the user-written xtsemipar command makes a semiparametric regression easy to fit in the context of panel data.
Because it is sometimes possible to move toward pure parametric models, a test proposed by Hardle and Mammen (1993) and built to check whether the nonparametric fit can be satisfactorily approximated by a parametric polynomial adjustment of order p will be described.
Stata 13’s new power command performs power and sample-size analysis. The power command expands the statistical methods that were previously available in Stata’s sampsi command. I will demonstrate the power command and its additional features, including the support of multiple study scenarios and automatic and customizable tables and graphs. I will also present new functionality allowing users to add their own methods to the power command.
Within drug development, it is crucial to find the right dose that is going to be safe and efficacious; this is often done within early phase II clinical trials. The aim of the dose-finding trial is to understand the relationship between the dose of drug and the potential effect of the drug. Increasingly, adaptive designs are being used in this area because they allow greater flexibility for dose exploration in comparison with traditional fixed-dose designs.
An adaptive dose-finding design usually assumes a true nonlinear dose–response model and select doses that either maximize the determinant of information matrix of the design (D-optimality) or minimize the variance of the predicted dose that gives a targeted response. Our design extends the predicted dose methodology, in a limited number of patients (40), to finding two targeted doses: a minimally effective dose and a therapeutic dose. In our trial, doses are given intravenously, so theoretically, doses are continuous and the response is assumed to be a normally distributed continuous outcome.
Our design has an initial learning phase where pairs of patients are assigned to five preassigned doses. The next phase is fully sequential with an interim analysis after each patient to determine the choice of dose based on the optimality criterion to minimize the determinant of the covariance of the estimated target doses. The dose–choice algorithm assumes that a specific parametric dose–response model is the true relationship, and so a variety of models are considered at the interim, and human judgment involved in the overall decision.
I will introduce a Mata command that uses the optimize function to find the estimated parameters of the model and to subsequently find the optimal design. Simulated results show that assuming a model with a small number of parameters (=3) leads to a choice of doses that are not near to the target doses and overrely on interpolation under the modeling assumptions. Fitting models with greater flexibility with more parameters (=4) results in a choice of doses near to the two target doses. Overall, the design is efficient and seamlessly combines the initial learning and subsequent confirmatory stages.
Hierarchical datasets are commonly a product of the popular CSPro system developed by the U.S. Census Bureau. CSPro became widely popular and a de facto standard for data collection in many countries; some agencies supply data exclusively in CSPro format. While CSPro can export the data into Stata format on its own, the procedure compromises on some features and requires the user to run CSPro, which operates in MS Windows only. The new Stata module usecspro allows easy import of the hierarchical datasets into Stata by automatically parsing the CSPro dictionary files. The conversion is implemented in Mata and allows importing data from any level and any record of the hierarchical CSPro dataset. It also preserves the variable and value labels and takes care of the missing values and other common concerns during data conversion.
This presentation explains how to exploit Stata to run multilevel multiprocess regressions with aML (software downloadable for free from applied-ml.com). I show how a single do-file can prepare the dataset, write the control files, input the starting values, and run the regressions without the need to manually open the aML’s Command Prompt window. In this sense, Stata helps to avoid the difficulties of running complicated regressions with aML by automatically generating the necessary files, which avoids typos and easily allows changes in model specification. The paper contains an example of how well Stata and aML work together.
Stata has a wide range of tools for performing meta-analysis, but presently not individual participant data (IPD) meta-analysis, in which the analysis units are within-study observations (for example, patients) rather than aggregate study results.
I present ipdmetan, a command that facilitates two-stage IPD meta-analysis by fitting a specified model to the data of each study in turn and storing the results in a matrix. Features include subgroups, inclusion of aggregate (for example, published) data, iterative estimates and confidence limits for the tau-squared measure of heterogeneity, and the analysis of treatment-covariate interactions. This last is a great benefit of IPD collection and is a subject on which my colleagues and I have published previously (Fisher et al., 2011, Journal of Clinical Epidemiology 64: 949–967). I shall discuss how ipdmetan facilitates our recommended approach and its strengths and weaknesses, in particular one-stage versus two-stage modeling and within- and between-trial information.
In addition, the graphics subroutine written for the metan package (Harris et al., 2008, Stata Journal 8: 3–28) has been greatly expanded to enable flexible, generalized forest plots for a variety of settings. I shall demonstrate some of the possibilities and encourage feedback on how this may be developed further.
Examples will be given using real-world IPD meta-analyses of survival data in cancer, although the programs are applicable generally.
Network meta-analysis involves synthesising the scientific literature comparing several treatments. Typically, two-arm and three-arm randomized trials are synthesized, and the aim is to compare treatments that have not been directly compared and often to rank the treatments. A difficulty is that the network may be inconsistent, and ways to assess this are required.
In the past, network meta-analysis models have been fitted using Bayesian methods, typically in WinBUGS. I have recently shown how they may be expressed as multivariate meta-analysis models and hence fitted using mvmeta. However, various challenges remain, including getting the dataset in the correct format, parameterizing the inconsistency model, and making good graphical displays of complex data and results. I will show how a new suite of Stata programs, network, meets these challenges, and I will illustrate its use with examples.
Math, when written carefully, can represent a model perfectly. Stata syntax, when written correctly, can represent a model perfectly. Everyone is supposed to understand math, but I have encountered those who do not. I personally think everyone should understand Stata syntax, but again, I have found a few who do not. Path diagrams provide an alternative formalism that is easily accessible to a wide audience. With a few conventions, we can use path diagrams to represent a variety of models and perhaps help explain those models. These diagrams are easy enough to create in Stata, so we will look at some models—both simple and difficult—to see what we think of this emerging formalism.
Competing risks occur in survival analysis when a subject is at risk of more than one type of event. A classic example is when there is consideration of different causes of death. Interest may lie in the cause-specific hazard rates, which can be estimated using standard survival techniques by censoring competing events. An alternative measure is the cumulative incidence function (CIF) which gives an estimate of absolute or crude risk of death accounting for the possibility that individuals may die of other causes. Geskus (2011 Biometrics, 67: 39–49) has recently proposed an alternative way for the estimation and modeling of the CIF that can use weighted versions of standard estimators.
I will describe a Stata command, stcrprep, that restructures survival data and calculates weights based on the censoring distribution. The command is based on the R command crprep, but I will describe a number of extensions that enable the CIF to be modeled directly using parametric models on large datasets. After restructuring the data and incorporating the weights, one can use sts graph to plot the CIF and stcox can be used to fit a Fine and Gray model for the CIF. An advantage of fitting models in this way is that it is possible to use a number of the standard features of the Cox model, for example, using Schoenfeld residuals to visualize and test the proportional subhazards assumption.
I will also describe some additional options that are useful for fitting parametric models and useful for large datasets. In particular, I will describe how the flexible parametric survival models estimated with stpm2 can be used to directly model the cumulative-incidence function. An important advantage is that all the predictions built into stpm2 can be used to directly predict the CIF, subdistribution hazards, etc.
The “workhorse” model for analysing discrete choice data, the conditional logit model, can be implemented in Stata using the official clogit and asclogit commands. While widely used, this model has several well-known limitations that have led researchers in various disciplines to consider more flexible alternatives. The mixed logit model extends the standard conditional logit model by allowing one or more of the parameters in the model to be randomly distributed. When one models the choices of individuals (as is common in several disciplines, including economics, marketing, and transport), this allows for preference of heterogeneity among respondents. Other advantages of the mixed logit model include the ability to allow for correlations across observations in cases where an individual made more than one choice, and relaxing the restrictive independence from the irrelevant alternatives property of the conditional logit model.
There are a range of commands that can be used to estimate mixed logit models in Stata. With the exception of xtmelogit, the official Stata command for estimating binary mixed logit models, all of them are userwritten. The module that is probably best known is gllamm, but while very flexible, it can be slow when the model includes several random parameters. This talk will focus on alternative commands for estimating logit models, with focus on the mixlogit module. We will also look at alternatives and extensions to mixlogit, including the recent lclogit, bayesmlogit, and gmnl commands. The talk will review the theory behind the methods implemented by these commands and present examples of their use.
Stephen P. Jenkins, London School of Economics and Political Science
Roger B. Newson, Imperial College London
Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.