Home  /  Resources & support  /  Users Group meetings  /  2004 Australia and New Zealand Users Group meeting

Last updated: 9 December 2004

2004 Australian and New Zealand Users Group meeting

10 October 2004

St Peters Cathedral

Holiday Inn
65 Hindley Street
Adelaide, South Australia


Programming for further processing of postestimation results

Ian Watson
Australian Centre for Industrial Relations Research and Training, University of Sydney


Do you find yourself regularly cutting and pasting your postestimation results, such as regression coefficients, into a spreadsheet? If so, you should consider trying to program wherever possible. Using Stata's matrix commands, this presentation will show you how to process postestimation results, such as manipulating a vector of regression coefficients. By crafting your own small ado-file, you can save yourself from the tedious and repetitive job of using spreadsheets.

The presentation will be illustrated with an example based on decomp, an ado-file by Ian Watson that decomposes earnings results. The spreadsheet approach will be contrasted with the ado approach.

Australia's firm-level productivity — a new perspective

Robert Breunig
Research School of Social Sciences, Australian National University
Marn–Heong Wong
Australia–Japan Research Centre, Australian National University

Not all firms contributed to Australia's impressive productivity growth in the 1990s. Some performed better than others, and entrants arrived even as incumbents exited. If firms make decisions on input demand and liquidation based on their productivity, the latter known to them but unobserved by the econometrician, this gives rise to simultaneity and selection problems that bias the traditional estimators of production function coefficients. We apply a semiparametric technique that endogenizes input choices and firm exit decisions to obtain production function estimates on Australian firms. Estimation is carried out using the Business Longitudinal Survey, Australia's only business longitudinal micro-dataset that tracks firm entry and exit.

Simulating two- and three-generation pedigree data for genetic epidemiology research

Jisheng S. Cui
Department of Public Health, University of Melbourne, and Mathematical and Information Sciences Division, Commonwealth Scientific and Industrial Research Organisation, Clayton, Victoria

Apart from collection of real pedigree data, it is also very important to have simulated pedigree data in genetic epidemiology research. The simulated data can be used to compare the efficiency of different statistical models and to investigate some phenomena that cannot be answered by the real data. Using Stata to simulate the pedigree data has advantages over using computer languages (e.g. C++ or Fortran) because the random numbers of some common probability distributions can be easily simulated by the software. Here we introduce two Stata programs, simuped2 and simuped3, which can be used to simulate two- and three-generation pedigree data, respectively. Variables generated by these programs include family ID, individual ID, generation, age, gender, and genotype.

Modeling intensive care unit outcome in a large data base: analysis of the institutional effect

John Moran
ANZICS Adult Data Base Committee, Carlton Victoria
P. Bristow, N. Bishop, and C. George
ANZICS Adult Data Base Committee, Carlton Victoria
Patty Solomon
School of Applied Mathematics, University of Adelaide

Within the intensive care environment, large data-bases exist, recording patient, ICU, and hospital details. For the last 20 years, a number of competing algorithms have been developed to generate risk-adjusted outcomes for patients; the most well known is the APACHE II (acute physiology and chronic health care evaluation) algorithm. Standardized mortality rates (SMR) for individual ICUs have subsequently been generated (the "league-tables" paradigm). The method of calculation of SMR using say, the APACHE II algorithm, whereby "mortality ratios are calculated by projecting the APACHE II score-specific mortalities of the total group on case mix ...of individual ICUs" amounts to an indirect standardization, which (quoting Yule and Rothman), "is not fully a method of standardization at all". It has been recommended (Fidler 1997) to use direct standardization by either: a. logistic regression ... with separate intercepts for each ICU. The intercepts are simply the logits of directly standardized mortality rates and can be used for rankings. This approach assumes constant slopes for all ICUs... and can be tested, or b. model the differences between ICUs as random effects (DeLong et al 1997)

The above matters will be addressed using data from the ANZICS (Australia and New Zealand Intensive Care Society) national data base, 1993-2003, recording APACHE II data and hospital outcomes for 280,000 patients in 201 ICUs. Implications for the use of the Stata will be illustrated.


Fidler, V. 1997. The effect of case mix adjustment on mortality as predicted by APACHE II. Intensive Care Medicine 23: 711.

DeLong, E. R., E. D. Peterson, D. M. DeLong, L. H. Muhlbaier, S. Hackett, and D. B. Mark. 1997. Comparing risk-adjustment methods for provider profiling. Statistics in Medicine 16:2645–2664.

Generalized partially linear models

Roberto Gutierrez

Partially linear models are linear regression models where one component is allowed to vary nonparametrically. Generalized partially linear models generalize this case from linear regression to the quasi-likelihood setting of standard GLIMs, thus encompassing a larger class models including logistic, Poisson, and Gamma regression. Although estimation for these models is possible in official Stata via fractional polynomials, this approach is entirely nonparametric and uses a local-linear smooth to estimate the "nonlinear" component. The Stata command gplm for fitting generalized partially linear models is discussed and demonstrated.

The effect of missing data on covariates in survival analysis

Irit Aitkin, Department of Psychology, University of Melbourne

We deal with this problem in the context of survival analysis with missing data on covariates. More specifically, we examine the factors affecting the duration of breastfeeding in Western Australia. Duration was studied in 556 women delivering at two maternity hospitals in Perth, Australia. The study was carried out over the period September 1992 to April 1993. 466 women breastfed when they left the hospital. In a previous analysis, the Cox proportional hazards model was fitted to determine the factors affecting duration of breastfeeding. However, because of missing data, a covariate known to be important, smoking, could not be used as it would have resulted in a loss of almost 50% of the available sample. In this analysis, we incorporate the incomplete data on smoking omitted from the previous analysis.

We deal with the missing data on covariates in survival analysis in two ways — the first is by maximum likelihood and the second by multiple imputation.

Direct maximization of the likelihood with missing data is complicated, and most methods that perform maximum likelihood estimation (for example, the EM algorithm) use some form of data augmentation, which augments the observed data with latent (unobserved) data, so that very complicated calculations are replaced by much simpler ones given the "complete data".

The distribution of response time for cases with smoking missing is no longer a Cox model but a mixture of two such models, in proportions given by the population proportions of smokers and non-smokers. The likelihood function is therefore different for complete and incomplete cases, and so maximizing it is more complicated in having to allow for this difference.

We carried out the ML analysis in Stata using GLLAMM (Generalized Linear Latent And Mixed Models) routines (Rabe–Hesketh, Pickles, and Skrondal 2001). In the GLLAMM procedure, a latent smoking variable is defined for the cases with smoking missing, and the breastfeeding durations are regressed on the explanatory variables and smoking — the covariate when it is observed and the latent variable when not. The model for the smoking covariate is a "measurement model" when the covariate is observed and a "structural model" when it is not.

We compared ML using GLLAMM with multiple imputation using the program written by J.L Schafer mainly for S-Plus/R. It is based on the data augmentation algorithm (Tanner and Wong 1987).


Rabe-Heskth, S., A. Pickles, and A. Skrondal. 2001. GLLAMM: A general class of multilevel models and a Stata program. Multilevel Modelling Newsletter 13: 17–23.

Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.

Tanner, M. A. and W. H. Wong. 1987. The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association 82: 528–550.

Tools for using multiple imputation for missing data in Stata

John Carlin
Departments of Paediatrics & Public Health, University of Melbourne, and Murdoch Childrens Research Institute, Melbourne
Philip Greenwood, John Galati, and Joe Schafer
Departments of Paediatrics & Public Health, University of Melbourne, and Murdoch Childrens Research Institute, Melbourne

A major analytic challenge in epidemiological studies is the threat to validity and precision of conclusions raised by missing data. It is still commonly accepted practice to analyze data containing missing values by "complete-case" methods, where entire individuals are omitted from the analysis if they have a missing value on any of the variables required for the analysis in question. This approach can lead to biases in conclusions, by excluding individuals in whom patterns of association may be different than among those retained, and at best leads to loss of precision due to the reduction in sample size available for analysis. The method of multiple imputation is gaining popularity as an approach for dealing with missing data. It involves the production of multiple complete datasets based on a statistical model for the missing values given the observed data. Each of the imputed datasets is then analyzed using standard methods, and valid inferences are obtained by combining these estimates appropriately. Given tools for (a) imputing the missing values, and (b) analyzing the multiple imputed datasets, the method offers great flexibility. In this talk I will review currently available tools for task (a), ranging from fully model-based methods provided in software developed by Schafer and now available in packages such as SAS and S-PLUS to more pragmatic but flexible techniques such as the use of chained equations. Stata commands for performing the latter technique have recently been developed by Patrick Royston, and we are working to develop Stata interfaces for some of Schafer's methods. Tools for task (b) have been fairly limited but we have recently published a flexible package of commands in Stata, which allows a wide range of data manipulations as well as combined analyses to be performed on multiple imputed datasets with minimal effort. We have used multiple imputation to address missing data problems in the Victorian Adolescent Health Cohort Study (VAHCS), which began in 1992 with participants aged 15 and has recently completed an 8th wave of data collection, and analyses of data from this study will be used in the talk to illustrate the methods and to highlight outstanding issues, both statistical and computational.

Analyzing multiply imputed datasets: separate or stacked

Philip Greenwood
Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Melbourne

The method of multiple imputation provides an attractive approach to handling missing data in large studies. A variety of software is now available to produce multiply imputed (MI) datasets, and we have published a set of Stata commands �"MI tools" that facilitate the manipulation and analysis of MI datasets. MI datasets can be either a set of separate data files or a single (stacked) data file with some extra information to index the datasets. For the purpose of writing Stata commands to analyze these data, what are the benefits of each format? The stacked format seems to offer greater efficiency and elegance and can make better use of existing syntax structures. However, separate data files seem to offer greater overall flexibility and some important tasks can only be implemented in that format. It seems that a combined approach might give the best of both worlds. This talk will describe our current work on a revised version of MI tools.

Using plugins and COM servers in Stata for handling multiple datasets

John Galati
Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Melbourne

Computational efficiency and flexibility in a statistical package may be enhanced by enabling the package to communicate directly with other programs. A model of particular interest at the moment is the component object model (COM). This model provides a uniform mechanism for programs running under Microsoft Windows to share data and functionality. Recently, statistical routines for imputing values in multiple datasets have been packaged by Joe Schafer as COM servers, making them available to a wide variety of statistical analysis packages. (The routines themselves were also originally written by Joe.) In the first part of this talk, I will discuss using Stata plugins to access these multiple imputation routines from within Stata. Techniques for handling missing data invariably involve processing multiple datasets. Since Stata is fundamentally geared towards processing a single dataset at any given time, a natural question that arises is how best to handle multiple datasets in Stata in a general, flexible, and efficient manner. In the remainder of the talk, I will discuss using COM servers and Stata plugins for this purpose, and I will highlight the advantages of this approach from the perspective of computational efficiency, flexibility, and elegance.

Propensity score matching using -psmatch-

Adrian Esterman
Flinders Centre for Epidemiology and Biostatistics, Flinders University, Adelaide

In observational studies, the researcher has no control over treatment assignment. Control and intervention groups are therefore often unbalanced with respect to confounding variables, and even covariate adjustment doesn't always fully eliminate bias. The propensity score is the conditional probability of being in the treatment group given the covariates, and it can be used to balance the covariates in the two groups. The score is derived from a logistic regression model of treatment group on the covariates, with the propensity score being the predicted probability of being in the treated group.

Once calculated, the propensity score can be used to reduce bias by matching, stratification, or by using it as a covariate in the regression model. In this presentation, I will briefly present some of the theory behind the use of propensity scores, and demonstrate the Stata procedure psmatch, which facilitates propensity score matching.

Simulating a control pool to economize control recruitment in a matched case–control study

Rory Wolfe
Monash University, Melbourne

Control recruitment in case–control studies is problematic if no register for the study base exists. If random telephone contact is used and study base members comprise a relatively small proportion of the population then control recruitment can be resource-intensive. In a matched case–control study with prospective case recruitment, the study base is accessed repeatedly for control recruitment and in this context we propose a dynamic pool to economize control recruitment. The pool gets added to when a study base member is contacted but doesn't match the current case. The pool is then accessed for future cases before resorting to random telephone contact again.

Using Stata, we simulate the operation of a control pool to quantify the possible economies under a range of likely scenarios. These simulation results are compared with early experience in the Farm Injury Risk among Males study, which found modest efficiency gains (4 controls recruited from the pool at a saving of approximately 90 telephone contacts per control).

Risk ratio estimation with the logistic model

Leigh Blizzard
Menzies Research Institute, University of Tasmania
David W. Hosmer
School of Public Health and Health Sciences, University of Massachusetts

The log-binomial model (the generalized linear model with binomial errors and log link) makes it possible to directly estimate the relative risk from cohort follow-up data, or the prevalence ratio from cross-sectional data, with adjustment for confounders. One of the problems with the use of this model is that the iterative estimation algorithm may fail to converge. Schouten et al recognized this problem, and proposed a clever solution to it. Their approach involves defining a dichotomous outcome variable (D) coded as D=1 for occurrence and D=0 for non-occurrence, and augmenting the original data by replicating the observations on subjects with the outcome (D=1) but with the outcome variable coded as D=0 in the second instance. (In the language of a case control study, each case is included both as a case and as a control). Schouten et al show that that a logistic regression model fitted to the expanded data set has the same parameters as the log-binomial model. They derive a consistent "information sandwich" estimator of the covariance matrix of the estimated coefficients that, with some data manipulation, can be obtained from the output of the logistic regression. The problem is that while a solution for the parameter vector can be obtained from nearly any set of data, it is not guaranteed to be admissible for the log-binomial model. We use Stata to demonstrate the method of Schouten et al, including the calculations required to obtain standard error estimates, and describe the frequency of inadmissible solutions in simulated data.


Schouten, E. G., J. M. Dekker, F. J. Kok, S. le Cessie, H. C. van Houwelingen, J. Pool, and J. P. Vandenbroucke. 1993. Risk Ratio and rate ratio estimation in case-cohort designs: hypertension and cardiovascular mortality. Statistics in Medicine 12: 1733–1745.

Scientific organizers

John Carlin, University of Melbourne and Royal Children's Hospital

Adrian Esterman, Flinders University

Paul Hakendorf, Flinders University

Karl Keesman, Survey Design and Analysis Services Pty Ltd

Claire Rickards, SAPMEA

Malcolm Rosier, Survey Design and Analysis Services Pty Ltd

Philip Ryan, University of Adelaide

Steven Stillman, New Zealand Department of Labour

Logistics organizers

Survey Design and Analysis Services Pty Ltd, the official distributor of Stata in Australia and New Zealand, and the South Australian Postgraduate Medical Education Association Inc. (SAPMEA) Conventions.