## 2013 UK Stata Users Group meeting: Abstracts

### Creating factor variables in resultssets and other datasets

Factor variables are defined as categorical variables with integer values,
which may represent values of some other kind, specified by a value label.
We frequently want to generate such variables in Stata datasets, especially
resultssets, which are output Stata datasets produced by Stata programs
such as the official Stata **statsby** command and the SSC packages
**parmest** and **xcontract**.
This is because categorical string variables can only be plotted
after conversion to numeric variables and because these numeric variables
are also frequently used in defining a key of variables, which identify
observations in the resultsset uniquely in a sensible sort order. The
**sencode** package is downloadable, and frequently downloaded, from SSC and is
a “super” version of **encode**, which inputs a string
variable and outputs a numeric factor variable. Its added features include a
replace option allowing the output numeric variable to replace the input
string variable, a **gsort()** option allowing the numeric values to be ordered
in ways other than the alphabetical order of the input string values, and a
**manyto1** option allowing multiple output numeric values to map to the same
input string value. The **sencode** package is well established and has
existed since 2001. However, some tips will be given on ways of using it
that are not immediately obvious but which the author has found very useful
over the years when mass-producing resultssets. These applications use
**sencode** with other commands, such as the official Stata command **split**
and the SSC packages **factmerg**, **factext**, and **fvregen**.

**Additional materials:**

uk13_newson.pdf

uk13_newson_examples1.do

### treatrew: A user-written Stata routine for estimating average treatment effects by reweighting on propensity score

Reweighting is a popular statistical technique to deal with inference in
presence of a nonrandom sample. In the literature, various reweighting
estimators have been proposed. This paper presents the author-written Stata
routine **treatrew**, which implements the reweighting on the propensity-score estimator
as proposed by Rosenbaum and Rubin (1983) in their seminal article, where
they show that parameters’ standard errors can be obtained analytically
(Wooldridge 2010, 920–930) or via bootstrapping. Because an implementation
in Stata of this estimator with analytic standard errors was still
missing, this paper, and the ado-file and help-file accompanying it, aims at
filling this gap by providing an easy-to-use
implementation of the reweighting on the propensity-score method, as a valuable
tool for estimating treatment-effects under “selection-on-observables” (or
“overt bias”). Finally, a Monte Carlo experiment to check the reliability
of **treatrew** and to compare its results with other treatment effect
estimators will also be provided.

**Additional materials:**

uk13_cerulli.pdf

### Multiple imputation of covariates in the presence of interactions and nonlinearities

Multiple imputation (MI) is a popular approach to handling missing data, and
an extensive range of MI commands is now available in official Stata. A
common problem is that of missing values in covariates of regression models.
When the substantive model for the outcome contains nonlinear covariate
effects or interactions, correctly specifying an imputation model for
covariates becomes problematic. We present simulation results illustrating
the biases that can occur when standard imputation models are used to
impute covariates in linear regression models with a quadratic effect or
interaction effect. We then describe a modification of the full conditional
specification (FCS) or chained equations approach to MI, which ensures that
covariates are imputed from a model which is compatible with a
user-specified substantive model. We present the **smcfcs** Stata command, which
implements substantive model compatible FCS and illustrate its application
to a dataset.

**Additional materials:**

uk13_bartlett.pdf

### A Monte Carlo analysis of multilevel binary logit model estimator performance

Social scientists are increasingly fitting multilevel models to datasets in
which a large number of individuals (N ~ several thousands) are nested within
each of a small number of countries (C ~ 25). The researchers are
particularly interested in “country effects”, as summarized
by either the coefficients on country-level predictors (or cross-level
interactions) or the variance of the country-level random effects. Although
questions have been raised about the potentially poor performance of
estimators of these “country effects” when C is “small”,
this issue appears not to be widely appreciated by many social scientist
researchers. Using Monte Carlo analysis, I examine the performance of two
estimators of a binary-dependent two-level model using a design in which C =
5(5)50 100 and N = 1000 for each country. The results point to i) the
superior performance of adaptive quadrature estimators compared with PQL2
estimators, and ii) poor coverage of estimates of “country effects” in
models in which C ~ 25, regardless of estimator. The analysis makes
extensive use of **xtmelogit** and **simulate** and user-written
commands such as **runmlwin**, **parmby**, and **eclplot**.
Issues associated with having extremely long runtimes are also discussed.

**Additional materials:**

uk13_jenkins.pdf

### Multilevel mixed-effects parametric survival analysis

Multilevel mixed-effects survival models are used in the analysis of
clustered survival data, such as repeated events, multicenter clinical
trials, or individual patient data meta-analyses, to investigate
heterogeneity in baseline risk and treatment effects. I present the
**stmixed** command for the parametric analysis of clustered survival data
with two levels. Mixed-effects parametric survival models available include
the exponential, Weibull and Gompertz proportional-hazards models, the
Royston–Parmar flexible-parametric model, and the log–logistic,
log–normal, and generalized gamma-accelerated failure-time models.
Estimation is conducted using maximum likelihood, with both adaptive and
nonadaptive Gauss–Hermite quadrature available. I will illustrate the
command through simulation and application to clinical datasets.

**Additional materials:**

uk13_crowther.pdf

### Multiple imputation of missing data in longitudinal health records

Electronic health records are increasingly used for epidemiological and
health service research. However, missing data are often an issue when
dealing with electronic records. Up to now, various approaches have been used
to overcome these issues, including complete case analysis, last observation
carried forward, and multiple imputation. In this presentation, we will first
highlight the issues of missing data in longitudinal records and provide
examples of the limitations of standard methods of multiple imputation. We
will then demonstrate the new twofold user-written Stata command that
implements the twofold fully conditional specification (FCS) multiple-imputation
algorithm in Stata (Nevalainen, Kenward, and Virtanen, 2009.
*Stat Med.* 28: 3657–3669.)

In the application of the twofold FCS algorithm, we divide time into equal size time blocks. The algorithm then imputes missing values in the longitudinal data, imputing one time block, and then the next. The defining characteristic is that when one imputes missing values at a particular time block, only measurements at that time block and adjacent time blocks are used. This obviates some of the principal difficulties that are typically encountered when attempting to apply a standard MI approach to imputing such longitudinal data.

We illustrate how the twofold FCS MI algorithm works in practice and maximizes the use of data available, even in situations where measurements are only made on a relatively small proportion of individuals in each time block. We discuss some of the strengths and limitations of the twofold FCS MI algorithm and contrast it with existing approaches to imputing longitudinal data. Lastly, we present results demonstrating the potential for gains in efficiency through use of the twofold approach compared with a more conventional “baseline MI” approach.

**Additional materials:**

uk13_welch.pptx

### A multiple imputation and coarse data approach to residually confounded regression models

Residual confounding is a major problem in analysis of observational data, occurring when a confounding variable is measured coarsely (censored, heaped, missing, etc.) and hence cannot be fully adjusted to obtain a causal estimate by usual means such as multiple regression. The analysis of coarse data has been investigated by Heitjan and Rubin, but methods for coarse covariates are lacking.

A fully conditional-specification multiple-imputation approach is possible if we are able to model i) the confounding variable conditional on other information in the dataset and ii) the coarsening mechanism. This provides a very flexible framework for removing residual confounding under our assumptions, including sensitivity analysis. An added complexity over missing data is that it may not be known which observations are coarsened.

Programming this method is presented in Stata for various combinations of
i) and ii) above, using the **ml** and **mi** functions. In the
simplest case of a normally distributed confounder subject to known
interval-censoring, **intreg** and **mi** can be applied. The method
is illustrated with simulated data and the true causal effect is recovered
in each instance.

**Additional materials:**

uk13_grant.pdf

### The stiqsp command: A nonparametric approach for the simultaneous analysis of quality-of-life and survival data using the integrated quality survival product

In medical research, particularly in the field of cancer, it is often
important to evaluate the impact of treatments and other factors on a
composite outcome based on survival and quality-of-life data, such as a
Quality Adjusted Life Year (QALY). We present a Stata program, **stiqsp**,
which determines the mean QALY using the Integrated Quality Survival
Product. In this nonparametric approach, the survival function is estimated
using the Kaplan–Meier method and the quality-of-life function is derived
from the mean quality-of-life score at the unique death times. Confidence
intervals for the QALY score are determined using the bootstrap method. We
illustrate the features of the command with a large dataset of patients with
lung cancer.

**Additional materials:**

uk13_gaunt.pdf

### Estimating spatial panel models using unbalanced panels

Econometricians have begun to devote more attention to spatial interactions when carrying out applied econometric studies. In part, this is motivated by an explicit focus on spatial interactions in policy formulation or market behavior, but it may also reflect concern about the role of omitted variables that are or may be spatially correlated.

The Stata user-written procedure **xsmle** has been designed to estimate
a wide range of spatial panel models, including spatial autocorrelation,
spatial Durbin, and spatial error models using maximum likelihood methods.
It relies upon the availability of balanced panel data with no missing
observations. This requirement is stringent, but it arises from the
fact that in principle, the values of the dependent variable for any panel
unit may depend upon the values of the dependent and independent variables
for all the other panel units. Thus even a single missing
data point may require that all data for a time period, panel unit, or
variable be discarded.

The presence of missing data is an endemic problem for many types of applied work, often because of the creation or disappearance of panel units. At the macro level, the number and composition of countries in Europe or local government units in the United Kingdom has changed substantially over the last three decades. In longitudinal household surveys, new households are created and old ones disappear all the time. Restricting the analysis to a subset of panel units that have remained stable over time is a form of sample selection whose consequences are uncertain and that may have statistical implications that merit additional investigation.

The simplest mechanisms by which missing data may arise underpin the
missing-at-random (MAR) assumption. When this is appropriate, it is possible to use
two approaches to estimation with missing data. The first is either simple
or, preferably, multiple imputation, which involves the replacement of
missing data by stochastic imputed values. The Stata procedure **mi** can be
combined with **xsmle** to implement a variety of estimates that rely upon
multiple imputation. While the combination of procedures is relatively
simple to estimate, practical experience suggests that the results can be
quite sensitive to the specification that is adopted for the imputation
phase of the analysis. Hence, this is not a one-size-fits-all method of
dealing with unbalanced panels, because the analyst must give serious
consideration to the way in which imputed values are generated.

The second approach has been developed by Pfaffermayr. It relies upon the spatial interactions in the model, which means that the influence of the missing observations can be inferred from the values taken by nonmissing observations. In effect, the missing observations are treated as latent variables whose distribution can be derived from the values of the nonmissing data. This leads to a likelihood function that can be partitioned between missing and nonmissing data and thus used to estimate the coefficients of the full model. The merit of the approach is that it takes explicit account of the spatial structure of the model. However, the procedure becomes computationally demanding if the proportion of missing observations is too large and, as one would expect, the information provided by the spatial interactions is not sufficient to generate well-defined estimates of the structural coefficients. The missing-at-random assumption is crucial for both of these approaches, but it is not reasonable to rely upon it when dealing with the birth or death of distinct panel units.

A third approach, which is based on methods used in the literature on statistical signal processing, relies upon reducing the spatial interactions to immediate neighbors. Intuitively, the basic unit for the analysis becomes a block consisting of a central unit (the dependent variable) and its neighbors (the spatial interactions). Because spatial interactions are restricted to within-block effects, the population of blocks can vary over time and standard nonspatial panel methods can be applied.

The presentation will describe and compare the three approaches to
estimating spatial panel models as implemented in Stata as extensions to
**xsmle**. It will be illustrated by analyses of i) state data on
electricity consumption in the U.S. and ii) gridded historical data on
temperature and precipitation to identify the effects of El Niño
(ENSO) and other major weather oscillations.

**Additional materials:**

uk13_hughes.pdf

### Repeated half-sample bootstrap resampling

This presentation illustrates Stata’s implementation of the repeated
half-sample bootstrap proposed by Saigo et al. (2001, *Survey Methodology*).
This resampling scheme is easy to implement and is appropriate for complex
survey designs, even with small stratum sizes. The user-written command
**rhsbsample** mimicks the official **bsample** command and can be used for
bootstrap inference in a wide range of settings.

**Additional materials:**

uk13_vankerm.pdf

### Sample size by simulation for clinical trials with survival outcomes: The simsam package in action

The **simsam** package, released earlier this year, allows versatile sample-size
calculation (calculating sample size required to achieve given statistical
power) using simulation for any method of analysis under any statistical
model that can be programmed in Stata. Simulation is particularly helpful
in situations where formulae for sample size are approximate or unavailable.
The usefulness of the **simsam** package will depend in part, however, on the
quality and variety of **simsam** applications developed by Stata users.

I will discuss **simsam** applications for clinical trials with time-to-event or
survival outcomes. Here sample size formulas are the subject of ongoing
research (available methods for cluster-randomized trials with variable
cluster size, for example, are only approximate). Using such examples, I will
illustrate general **simsam** programming considerations such as dealing with
analyses that fail, and the advantages of modular programming. Simulation
forces us to think carefully about trial definitions—for example, whether recruitment
ends after a fixed time or a fixed number of recruits—and to
relate the sample-size calculation to a detailed or contingent analysis
plan. Such careful attention to detail may be lost in a formulaic approach
to sample-size calculation. Simulation also allows us to check for bias in a
test as well as power. But simulation is not without problems: while a rare
(say, one in a million) failure of an analysis in Stata may not worry the
pragmatic statistician, a **simsam** application must anticipate this failure
or risk stumbling on it in the course of many simulations.

**Additional materials:**

uk13_hooper.pdf

### Strategy and tactics for graphic multiples in Stata

Many, perhaps most, useful graphs compare two or more sets of values. Examples are two or more groups or variables (as distributions, time series, etc.) or observed and fitted values for one or more model fits. Often there can be a fine line in such comparisons between richly detailed graphics and busy, unintelligible graphics that lead nowhere. In this presentation, I survey strategy and tactics for developing good graphic multiples in Stata.

Details include the use of **over()** and **by()** options and
**graph combine**;
the relative merits of super(im)posing and juxtaposing; backdrops of
context; killing the key or losing the legend if you can; transforming
scales for easier comparison; annotations and self-explanatory markers;
linear reference patterns; plotting both data and summaries; plotting
different versions or reductions of the data.

Datasets visited or revisited include James Short’s collation of observations from the transit of Venus; John Snow’s data on mortality in relation to water supply in London; Florence Nightingale’s data on deaths in the Crimea; deaths from the Titanic sinking; admissions to Berkeley; hostility in response to insult; and advances and retreats of East Antarctic ice sheet glaciers.

Specific programs discussed include
**graph dot**;
**graph bar**;
**sparkline** (SSC);
**qplot** (SJ) and its relatives;
**devnplot** (SSC);
**stripplot** (SSC); and
**tabplot** (SJ/SSC).

**Additional materials:**

uk13_cox.ppt

uk13_cox_materials.zip

### A general approach to testing for autocorrelation

Testing for the presence of autocorrelation in a time series, either in the
univariate setting or with the residuals from the estimation of some model,
is one of the most common tasks researchers face in the time-series setting.
The standard Q test statistic is that introduced by Box and Pierce (1970),
subsequently refined by Ljung and Box (1978). The original L-B-P test is
applicable to univariate time series and to testing for residual
autocorrelation under the assumption of strict exogeneity. Breusch (1978)
and Godfrey (1978) in effect extended the L-B-P approach to testing for
autocorrelations in residuals in models with weakly exogenous regressors.
Both the L-B-P test and the Breusch–Godfrey test are available in Stata, the
former for univariate time series via the **wntestq** command and the latter for
postestimation testing following OLS via the **estat bgodfrey** command.

All the above tests have important limitations: i) the tests are for
autocorrelation up to order *p*, where under the null hypothesis the
series or residuals are i.i.d.; ii) when applied to residuals from
single-equation estimation, the regressors must all be at least weakly
exogenous; iii) the tests are for single-equation models and do not cover
panel data.

We use the results of Cumby and Huizinga (1992) to extend the implementation
of the Q test statistic of L-B-P-B-G to cover a much wider range of
hypotheses and settings: i) tests for the presence of autocorrelation of
order *p* through *q*, where under the null hypothesis, there may be
autocorrelation of order *p*-1 or less; ii) tests following estimation in
which regressors are endogenous and estimation is by IV or GMM methods; iii)
tests following estimation using panel data. For iii), we show that the
Cumby–Huizinga test, developed for the large-T setting, is, when applied to the
large-N panel-data setting and limited to testing for second-order serial
correlation, formally identical to the test presented by Arellano and Bond
(1991) and available in Stata via Roodman’s **abar** command.

**Additional materials:**

uk13_baumschaffer.pdf

### Semiparametric regression in Stata

Semiparametric regression deals with the introduction of some very general nonlinear functional forms in regression analyses. This class of regression models is generally used to fit a parametric model in which the functional form of a subset of the explanatory variables is not known and/or in which the distribution of the error term cannot be assumed of being of a specific type beforehand. To fix ideas, consider the partial linear model y = zb + f(x) + e, in which the shape of the potentially nonlinear function of predictor x is of particular interest. Two approaches to modeling f(x) are to use splines or fractional polynomials. This talk reviews other more general approaches, and the commands available in Stata to fit such models.

The main topic of the talk will be partial linear regression models, with some brief discussion also of so-called single index and generalized additive models. Though several semiparametric regression methods have been proposed and developed in the literature, these are probably the most popular ones.

The general idea of partial linear regression models is that a dependent
variable is regressed on i) a set of explanatory variables entering the
model linearly and ii) a set of variables entering the model nonlinearly
but without assuming any specific functional form. Several estimators have
been proposed in the literature and are available in Stata. For example, the
**semipar** command makes available what is called the double residuals
estimator introduced by Robinson (1988), which is consistent and efficient.
Similarly, the **plreg** command fits an alternative difference-based estimator
proposed by Yatchew (1998) that has similar statistical properties to
Robinson’s estimator. These estimators will be briefly compared to identify
some drawbacks and pitfalls of both methods.

A natural concern of researchers is how these estimators could be modified
to deal with heteroskedasticity, serial correlation, and endogeneity in
cross-sectional data or how they could be adapted in the context of panel
data to control for unobserved heterogeneity. As a consequence, a
substantial part of the talk will be devoted to explaining i) how the
**plreg** and **semipar** commands can be used to tackle these very common
violations of the Gauss–Markov assumptions in cross-sectional data and ii)
how the user-written **xtsemipar** command makes a semiparametric regression
easy to fit in the context of panel data.

Because it is sometimes possible to move toward pure parametric models, a
test proposed by Hardle and Mammen (1993) and built to check whether the
nonparametric fit can be satisfactorily approximated by a parametric
polynomial adjustment of order *p* will be described.

**Additional materials:**

uk13_verardi.pdf

### Power of the power command

Stata 13’s new **power** command performs power and sample-size
analysis. The **power** command expands the statistical methods that were
previously available in Stata’s **sampsi** command. I will
demonstrate the **power** command and its additional features, including the
support of multiple study scenarios and automatic and customizable tables
and graphs. I will also present new functionality allowing users to add
their own methods to the **power** command.

**Additional materials:**

uk13_marchenko.pdf

### Adaptive dose-finding designs to identify multiple doses that achieve multiple response targets

Within drug development, it is crucial to find the right dose that is going to be safe and efficacious; this is often done within early phase II clinical trials. The aim of the dose-finding trial is to understand the relationship between the dose of drug and the potential effect of the drug. Increasingly, adaptive designs are being used in this area because they allow greater flexibility for dose exploration in comparison with traditional fixed-dose designs.

An adaptive dose-finding design usually assumes a true nonlinear dose–response model and select doses that either maximize the determinant of information matrix of the design (D-optimality) or minimize the variance of the predicted dose that gives a targeted response. Our design extends the predicted dose methodology, in a limited number of patients (40), to finding two targeted doses: a minimally effective dose and a therapeutic dose. In our trial, doses are given intravenously, so theoretically, doses are continuous and the response is assumed to be a normally distributed continuous outcome.

Our design has an initial learning phase where pairs of patients are assigned to five preassigned doses. The next phase is fully sequential with an interim analysis after each patient to determine the choice of dose based on the optimality criterion to minimize the determinant of the covariance of the estimated target doses. The dose–choice algorithm assumes that a specific parametric dose–response model is the true relationship, and so a variety of models are considered at the interim, and human judgment involved in the overall decision.

I will introduce a Mata command that uses the optimize function to find the estimated parameters of the model and to subsequently find the optimal design. Simulated results show that assuming a model with a small number of parameters (=3) leads to a choice of doses that are not near to the target doses and overrely on interpolation under the modeling assumptions. Fitting models with greater flexibility with more parameters (=4) results in a choice of doses near to the two target doses. Overall, the design is efficient and seamlessly combines the initial learning and subsequent confirmatory stages.

**Additional materials:**

uk13_mander.pdf

### usecspro: Importing CSPro hierarchical datasets to Stata

Hierarchical datasets are commonly a product of the popular CSPro system
developed by the U.S. Census Bureau. CSPro became widely popular and a de
facto standard for data collection in many countries; some agencies supply
data exclusively in CSPro format. While CSPro can export the data into
Stata format on its own, the procedure compromises on some features and
requires the user to run CSPro, which operates in MS Windows only. The new
Stata module **usecspro** allows easy import of the hierarchical datasets
into Stata by automatically parsing the CSPro dictionary files. The
conversion is implemented in Mata and allows importing data from
any level and any record of the hierarchical CSPro dataset. It also
preserves the variable and value labels and takes care of the missing values
and other common concerns during data conversion.

**Additional materials:**

uk13_radyakin.pdf

### From Stata to aML

This presentation explains how to exploit Stata to run multilevel multiprocess regressions with aML (software downloadable for free from applied-ml.com). I show how a single do-file can prepare the dataset, write the control files, input the starting values, and run the regressions without the need to manually open the aML’s Command Prompt window. In this sense, Stata helps to avoid the difficulties of running complicated regressions with aML by automatically generating the necessary files, which avoids typos and easily allows changes in model specification. The paper contains an example of how well Stata and aML work together.

**Additional materials:**

uk13_ayllon.pdf

### Two-stage individual participant data meta-analysis and forest plots

Stata has a wide range of tools for performing meta-analysis, but presently not individual participant data (IPD) meta-analysis, in which the analysis units are within-study observations (for example, patients) rather than aggregate study results.

I present **ipdmetan**, a command that facilitates two-stage IPD
meta-analysis by fitting a specified model to the data of each study in turn
and storing the results in a matrix. Features include subgroups, inclusion
of aggregate (for example, published) data, iterative estimates and confidence
limits for the tau-squared measure of heterogeneity, and the analysis of
treatment-covariate interactions. This last is a great benefit of IPD
collection and is a subject on which my colleagues and I have published
previously (Fisher et al., 2011, *Journal of Clinical Epidemiology* 64:
949–967). I shall discuss how **ipdmetan** facilitates our
recommended approach and its strengths and weaknesses, in particular
one-stage versus two-stage modeling and within- and between-trial information.

In addition, the graphics subroutine written for the **metan** package (Harris
et al., 2008, *Stata Journal* 8: 3–28) has been greatly expanded
to enable flexible, generalized forest plots for a variety of settings. I
shall demonstrate some of the possibilities and encourage feedback on how
this may be developed further.

Examples will be given using real-world IPD meta-analyses of survival data in cancer, although the programs are applicable generally.

**Additional materials:**

uk13_fisher.pptx

### A suite of Stata programs for network meta-analysis

Network meta-analysis involves synthesising the scientific literature comparing several treatments. Typically, two-arm and three-arm randomized trials are synthesized, and the aim is to compare treatments that have not been directly compared and often to rank the treatments. A difficulty is that the network may be inconsistent, and ways to assess this are required.

In the past, network meta-analysis models have been fitted using Bayesian
methods, typically in WinBUGS. I have recently shown how they may be
expressed as multivariate meta-analysis models and hence fitted using
**mvmeta**. However, various challenges remain, including getting the dataset
in the correct format, parameterizing the inconsistency model, and
making good graphical displays of complex data and results. I will show how
a new suite of Stata programs, **network**, meets these challenges, and I will
illustrate its use with examples.

**Additional materials:**

uk13_white.pptx

### Path diagrams as a notational formalism—noodling around with pictures

Math, when written carefully, can represent a model perfectly. Stata syntax, when written correctly, can represent a model perfectly. Everyone is supposed to understand math, but I have encountered those who do not. I personally think everyone should understand Stata syntax, but again, I have found a few who do not. Path diagrams provide an alternative formalism that is easily accessible to a wide audience. With a few conventions, we can use path diagrams to represent a variety of models and perhaps help explain those models. These diagrams are easy enough to create in Stata, so we will look at some models—both simple and difficult—to see what we think of this emerging formalism.

**Additional materials:**

uk13_wiggins.pdf

### Estimating and modeling cumulative incidence functions using time-dependent weights

Competing risks occur in survival analysis when a subject is at risk of more
than one type of event. A classic example is when there is consideration of
different causes of death. Interest may lie in the cause-specific hazard
rates, which can be estimated using standard survival techniques by
censoring competing events. An alternative measure is the cumulative
incidence function (CIF) which gives an estimate of absolute or crude risk
of death accounting for the possibility that individuals may die of other
causes. Geskus (2011 *Biometrics*, 67: 39–49) has recently proposed
an alternative way for the
estimation and modeling of the CIF that can use weighted versions of
standard estimators.

I will describe a Stata command, **stcrprep**, that restructures survival
data and calculates weights based on the censoring distribution. The command
is based on the R command **crprep**, but I will describe a number of
extensions that enable the CIF to be modeled directly using parametric
models on large datasets. After restructuring the data and incorporating the
weights, one can use **sts graph** to plot the CIF and **stcox** can be used to fit
a Fine and Gray model for the CIF. An advantage of fitting models in this
way is that it is possible to use a number of the standard features of
the Cox model, for example, using Schoenfeld residuals to visualize and test
the proportional subhazards assumption.

I will also describe some additional options that are useful for fitting
parametric models and useful for large datasets. In particular, I will
describe how the flexible parametric survival models estimated with **stpm2**
can be used to directly model the cumulative-incidence function. An
important advantage is that all the predictions built into **stpm2** can be used
to directly predict the CIF, subdistribution hazards, etc.

**Additional materials:**

uk13_lambert.pdf

### Mixed logit modeling in Stata—an overview

The “workhorse” model for analysing discrete choice data, the
conditional logit model, can be implemented in Stata using the official
**clogit** and **asclogit** commands. While widely used, this model
has several well-known limitations that have led researchers in various
disciplines to consider more flexible alternatives. The mixed logit model
extends the standard conditional logit model by allowing one or more of the
parameters in the model to be randomly distributed. When one models the
choices of individuals (as is common in several disciplines, including
economics, marketing, and transport), this allows for preference
of heterogeneity among respondents. Other advantages of the mixed logit model
include the ability to allow for correlations across observations in cases
where an individual made more than one choice, and relaxing the restrictive
independence from the irrelevant alternatives property of the conditional logit
model.

There are a range of commands that can be used to estimate mixed logit
models in Stata. With the exception of **xtmelogit**, the official Stata
command for estimating binary mixed logit models, all of them are
userwritten. The module that is probably best known is **gllamm**, but while
very flexible, it can be slow when the model includes several random
parameters. This talk will focus on alternative commands for estimating
logit models, with focus on the **mixlogit** module. We will also look at
alternatives and extensions to **mixlogit**, including the recent **lclogit**,
**bayesmlogit**, and **gmnl** commands. The talk will review the theory behind
the methods implemented by these commands and present examples of their use.

**Additional materials:**

uk13_hole.pdf