## 2011 Nordic and Baltic Stata Users Group meeting: Abstracts

### Quantile imputation of missing data

Matteo Bottai
Unit of Biostatistics, Institute of Environmental Medicine, Karolinska Institutet, Sweden
Multiple imputation is an increasingly popular approach for the analysis of data with missing observations. It is implemented in Stata's mi suite of commands. I present a new Stata command for imputation of missing values based on prediction of conditional quantiles of missing observations given the observed data. The command does not require making distributional assumptions and can be applied to impute dependent, bounded, censored, and count data.

bottai_nordic11.pdf

### Comparing observed and theoretical distributions

Maarten L. Buis
Institut fuer Soziologie, Universitaet Tuebingen, Germany
In this presentation, I aim to introduce graphical tools for comparing the distribution of a variable in your dataset with a theoretical probability distribution, like the normal distribution or the Poisson distribution. The presentation will consist of two parts. In the first part, I will consider univariate distributions, with a particular emphasis on hanging and suspended rootograms (hangroot). Looking at univariate distributions is not very common in a lot of (sub-(sub-))disciplines, but there are situations where this can be very useful: For example, if we have a count of accidents and we want to know whether these are occurring randomly, then we can compare this variable with a Poisson distribution. Another example would be simulations, where it is often the case that parameters or test statistics should follow a certain distribution when the model that is being checked is working as expected.

In the second part of the talk, I will focus on the more common situation where models assume a certain distribution for the explained/dependent/y variable, and I will estimate how one or more parameters, often the mean, change when one or more explanatory/independent/x variables change. The challenge now is that the dependent variable no longer follows the theoretical distribution, but rather a mixture of these theoretical distributions. In the case of a linear regression, we can circumvent this difficulty by looking at the residuals, which should follow a normal distribution. However, this circumvention does not generalize to other models. I will show how to graphically compare the distribution of the dependent variable with the theoretical mixture distribution. The focus will be on a trick to sample new dependent variables under the assumption that the model is true. Graphing the distribution of the actual dependent variable together with these sampled variables will give an idea of whether deviations from the theoretical distribution could have occurred by chance. This idea will be applied to checking the distributional assumption in beta regression (betafit) and to choosing between different parametric survival models (streg).

buis_nordic11.pdf

### Simulating complex survival data

Michael J. Crowther
Department of Health Sciences, University of Leicester, Leicester, United Kingdom
Paul C. Lambert
Department of Health Sciences, University of Leicester, Leicester, United Kingdom and Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
Simulation studies are essential for understanding and evaluating both current and new statistical models. When simulating survival times, often an exponential or Weibull distribution is assumed for the baseline hazard function, but these distributions can be considered too simplistic and lack biological plausibility in many situations. We will describe a new user-written command, survsim, that allows the user to simulate survival times from two-component mixture models, allowing much more flexibility in the underlying hazard. Standard parametric models can also be used, including the exponential, Weibull, and Gompertz models. Furthermore, survival times can be simulated from the all-cause distribution of cause-specific hazards for competing risks. A multinomial distribution is used to create the event indicator, whereby the probability of experiencing each event at a simulated time, t, is the cause-specific hazard divided by the all-cause hazard evaluated at time t. Baseline covariates and non-proportional hazards can be included in all scenarios. Finally, we will discuss the complex extension of simulating joint longitudinal and survival data.

crowther_nordic11.pdf

### Quantiles of the survival time from inverse probability weighted Kaplan–Meier estimates

Andrea Discacciati
Unit of Biostatistics and Nutritional Epidemiology, Institute of Environmental Medicine, Karolinska Institutet, Sweden
The stci official Stata command indirectly estimates quantiles of the survival time for different exposure levels from the Kaplan–Meier estimates. However, stci does not take into account possible confounding effects. Therefore, we introduce a new Stata command, stqkm, that indirectly estimates quantiles of the survival time from inverse probability weighted Kaplan–Meier estimates. Confidence intervals for the quantile estimates are obtained using the bootstrap method. We present a simulation study to assess the performances of the stqkm command in the presence of confounding and we present a case study.

discacciati_nordic11.pdf

### An example of competing-risks analysis using Stata

Christel Häggström
Umeâ University, Sweden
Competing-risks analysis in epidemiology is of special importance in survival analysis when studying the elderly and also when the exposure is related to early death. In a cohort study, I investigated the association between metabolic factors (obesity, hypertension, high glucose levels, etc.) and prostate cancer (with mean age of diagnosis 70 years). Using this data, I will present the analysis where I plotted cumulative incidence curves to visualize the risk of prostate cancer in comparison with the competing-risks, all-cause mortality for different levels of metabolic factors, using the Stata commands stcompet and stpepemori. I also used Fine and Gray regression (the stcrreg command) to calculate hazard ratios of subdistribution for both prostate cancer incidence and all-cause mortality.

haggstrom_nordic11.pdf

### Using Stata for agent-based simulations

Peter Hedström
Institute for Futures Studies, Stockholm, Sweden
Thomas Grund
ETH, Zürich, Switzerland
Agent-based modeling (ABM) is an analytical tool that is becoming increasingly important in the social sciences. The core idea behind ABM is to use computational models to analyze the macro- or aggregate-level outcomes that groups of agents, in interaction with one another, bring about. In this presentation, we briefly discuss why ABM is important and show how Stata can be used for such analyses. We also present a suite of programs. Some of these commands are used for generating, visualizing, or measuring various properties of the networks within which the agents are embedded, and others are used for analyzing the collective outcomes that agents are likely to bring about when embedded in such networks.

### A command for Laplace regression

Nicola Orsini
Unit of Biostatistics and Nutritional Epidemiology, Institute of Environmental Medicine, Karolinska Institutet, Sweden
I present an estimation command for Laplace regression to model conditional quantiles of a response variable given a set of covariates. The laplace command is similar to the official qreg command except that it can account for censored data. I illustrate its applicability and use through examples from health-related fields.

orsini_nordic11.pdf

### Using meta-analysis to inform the design of subsequent studies

Sally R. Hinchliffe, Michael J. Crowther, Alison Donald, and Alex J. Sutton
Department of Health Sciences, University of Leicester, Leicester, United Kingdom
In this presentation, we describe a suite of programs (metasim, metapow, metapowplot) that enable the user to estimate the probability that the conclusions of a meta-analysis will change with the inclusion of a new study(ies), as described previously by Sutton et al. (2007). Using the metasim program, we take a simulation approach to estimating the effects in future studies. The method assumes that the effect sizes of future studies are consistent with those observed previously, as represented by the current meta-analysis. The contexts of both two-arm randomized controlled trials and studies of diagnostic test accuracy are considered for a variety of outcome measures. Calculations are possible under both fixed- and random-effect assumptions, and several approaches to inference, including statistical significance and limits of clinical significance, are possible. Calculations for specific sample sizes can be conducted (using metapow), and plots, akin to traditional power curves, indicating the probability a new study(ies) will change inferences for a range of sample sizes can be produced (using metapowplot). Finally, plots of the simulation results are overlaid on a previously described macro, extfunnel, which can help to intuitively explain the results of such calculations of sample size. We hope the macro will be useful to trialists who want to assess the impact potential new trials will have on the overall evidence base and meta-analysts who want to assess the robustness of the current meta-analysis to the inclusion of future data.

Reference:
Sutton, A. J., N. J. Cooper, D. R. Jones, P. C. Lambert, J. R. Thompson, and K. R. Abrams. 2007. Evidence-based sample size calculations based upon updated meta-analysis. Statistics in Medicine 27: 471–490.

hinchcliffe_nordic11.pdf

### Taking the pain out of looping and storing

Patrick Royston
MRC Clinical Trials Unit, United Kingdom
Quite a common task in Stata is to run some sequence of commands under the control of a looping parameter and store the corresponding results in one or more new variables. Over the years, I have written many such loops, some of greater complexity than others. I finally became fed up with it and decided to write a simple command to automate the repetitive parts. The result is looprun, which I shall describe in this presentation.

royston_nordic11.ppt

### Projecting cancer incidence using restricted cubic splines

Mark J. Rutherford, Paul C. Lambert, and John R. Thompson
Department of Health Sciences, University of Leicester, Leicester, United Kingdom
Age–period–cohort models provide a useful method for modeling cancer incidence and mortality rates. There is great interest in estimating the rates of disease at given future time points so that plans can be made for the provision of the required future services. In the setting of using age–period–cohort models incorporating restricted cubic splines, we propose a new technique for projecting incidence. The method is validated via a comparison with existing methods in the setting of Finnish Cancer Registry data. The reasons for the improvements seen in the newly proposed method are twofold. First, improvements are seen because of the finer splitting of the timescale to give a more continuous estimate of the incidence rate. Second, the new method uses more-recent trends to dictate the future projections than previously proposed methods. The output will be produced via the user-written command apcfit. The functionality of the command will be illustrated throughout the talk.

The talk will comprise an introduction of the use of restricted cubic splines for model fitting before describing their use for age–period–cohort models. A description of the new method for projecting cancer incidence will be given prior to showing the results of the application of the method to Finnish Cancer Registry data. The talk will conclude with a description of the potential problems and issues when making projections.

rutherford_nordic11.pdf

### Time to dementia onset: Competing-risks analysis with Laplace regression

Giola Santoni, Debora Rizzuto, and Laura Fratiglioni
Aging Research Center, Karolinska Institutet, Sweden
We want to quantify the protective effect of education on time to dementia onset using a longitudinal data from a population study. We consider dropout due to death of the subject as a competing event of the outcome of interest. We show an adaptation of the Laplace regression method to the case of competing-risks analysis. The first 20% percent of highly educated people will develop dementia 2.5 years (p<.01) later than those with a lower education level. The effect on all cause of mortality is negligible. We show that the results derived through Laplace regression are comparable with those derived with the Stata command stcrreg.

santoni_nordic11.pdf

### Doubly robust estimation in generalized linear models with Stata

Arvid Sjölander
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Sweden
Nicola Orsini
Units of Biostatistics and Nutritional Epidemiology, Institute of Environmental Medicine, Karolinska Institutet, Sweden
The aim of epidemiological research is typically to estimate the association between a particular exposure on a particular outcome, adjusted for a set of additional covariates. This is commonly done by fitting a regression model for the outcome, given exposure and covariates. If the regression model is misspecified, then the resulting estimator may be inconsistent. Recently, a new class of estimators has been developed, so called “doubly robust” (DR) estimators. These estimators use two regression models: one for the outcome and one for the exposure. A DR estimator is consistent if either model is correct, not necessarily both. Thus DR estimators give the analyst two chances instead of only one to make valid inference. In this presentation, we describe a new package for Stata that implements the most common DR estimators.

sjolander_nordic11.pdf

### Chained equations and more in multiple imputation in Stata 12

Yulia Marchenko
StataCorp LP
I present the new Stata 12 command, mi impute chained, to perform multivariate imputation using chained equations (ICE), also known as sequential regression imputation. ICE is a flexible imputation technique for imputing various types of data. The variable-by-variable specification of ICE allows you to impute variables of different types by choosing the appropriate method for each variable from several univariate imputation methods. Variables can have an arbitrary missing-data pattern. By specifying a separate model for each variable, you can incorporate certain important characteristics, such as ranges and restrictions within a subset, specific to each variable. I also describe other new features in multiple imputation in Stata 12.

marchenko_nordic11.pdf

### SEM for those who think they don’t care

Vince Wiggins
StataCorp LP
We will discuss SEM (structural equation modeling), not from the perspective of the models for which it is most often used—measurement models, confirmatory factor analysis, and the like—but from the perspective of how it can extend other estimators. From a wide range of choices, we will focus on extensions of mixed models (random and fixed-effects regression). Extensions include conditional effects (not completely random), endogenous covariates, and others.