Home  /  Stata Conferences and Users Group meetings  /  2015 Oceania Stata Users Group meeting

2015 Oceania Stata Users Group meeting

24–25 September 2015


The Australian National University
Canberra ACT 0200


Text analysis using WordStat 7 within Stata

Normand Péladeau
Provalis Research
WordStat for Stata offers advanced text analytics features, allowing Stata 13 and 14 users to analyze text stored in both short- and long-string variables using numerous text-mining features, such as topic modeling, document clustering, automatic classification, and state-of-the-art dictionary-based content analysis. Extracted themes may then be related to structured data using various statistics and graphic displays. WordStat also offers a tool to create a Stata project from lists of documents (including .DOC, HTML, and PDF files) and to automatically extract from those, numerical, categorical data, and dates.

Treatment effects for survival-time outcomes: Theory and applications using Stata 14

Rebecca Pope

The potential-outcomes framework for estimating treatment effects from observational data treats the unobserved outcome as a missing data problem. When we extend this framework to the analysis of survival-time outcomes, we also allow for data that are missing because of censoring. This requires us to make additional assumptions and changes the properties of some of the estimators.

Beginning with a brief review of key concepts of survival-time data, I discuss potential outcomes in the context of survival analysis. I also explain some of the advantages of using treatment-effects analysis relative to traditional survival analysis. Alongside a brief overview of some of the estimators that are implemented in Stata 14, I demonstrate the application of survival treatment-effects analysis. Examples include analysis of single- and multivalued-treatments and postestimation checking of model assumptions.

Additional information

Causal inference and treatment effect: An integrative framework for evaluation research

Bill Tyler
Charles Darwin University
The increased popularity of quasi-experimental designs with observational data in policy-oriented evaluation studies, while enriching the environment of Stata applications, has complicated the options available to health and other social science researchers. In cross-cultural policy-related research, the tensions between multilevel and counterfactual modeling present particular problems for satisfying evidential criteria for both efficacy and effectiveness within what is often viewed as a homogeneous field for educational and child development policy. This presentation offers a comparative framework for interrogating the options for extending propensity-score analysis and other counterfactual approaches to multilevel modeling. The utility of this framework is illustrated from issues arising from ongoing evaluation projects in the areas of indigenous school-based interventions in remote community settings in Northern Australia.

Additional information

Applications of -margins- in social science

Philip Morrison
Victoria University of Wellington
Although introduced in Stata 11 and 12, margins and marginsplot are not as widely used in social science as they could be. This presentation advocates wider use of these tools. I introduce the basic ideas and illustrate their application to several different types of research questions from the my own research. margins and associated commands greatly expand our ability to assess the effects (associations) of (the usually categorical) attributes of respondents on outcomes of policy interest. I focus on the additional insights gained especially when margins is combined with marginsplot and user-written graphical displays such as coefplot.

Additional information

Pneumonia prevention using topical antibiotics in the intensive care unit (ICU): Another variation on control group variability

James Hurley
Ballarat Health Services

There are over 200 published studies of methods to prevent infections acquired in the intensive care unit (ICU) such as pneumonia and bacteremia. The application of combinations of various antibiotic topically to the upper airway appears to be the most effective method (over 40 studies). Surprisingly, within these studies of topical antibiotics such as the prevention method, the incidence of pneumonia and bacteremia among the control groups is as much as double that versus control groups within studies of methods other than topical antibiotics. Why?

Graphics as obtained with metandi obtained with meta-analysis of diagnostic tests offer a "novel" approach to modeling the relationship between control group rate and intervention effect size within controlled trials. Stata offers a broad range of commands to study statistical relationships, but an outstanding feature is the range of graphical commands available that enable the data to be "eyeballed". In this presentation, I will demonstrate—using graphs produced by metan, metandi, funnelcompar, ellip, and good old twoway (scatter)—that the relationship between control group incidence and effect size in this context is not simple. Is it cause–and–effect or the other way around?

Hurkey, J.C. 2014 Topical antibiotics as a major contextual hazard toward bacteremia within selective digestive decontamination studies: A meta-analysis. BMC Infect Dis 14:714. http://www.biomedcentral.com/1471-2334/14/714/.

Additional information

Structural equation models with a binary outcome using Stata and Mplus

Richard J. Woodman
Flinders University
Xiuqin Hong
Central South University
Shuiyuan Xiao
Central South University
Arduino A. Mangoni
Flinders University

Structural Equation Modeling (SEM) is a powerful technique for examining complex relational structures and potential causal pathways. Although many software packages, including AMOS, STATA, Mplus, LISREL, and R, provide routines for SEM with continuous outcomes, not all are capable of handling categorical data. In addition, there are differences between software in regard to the availability of desirable SEM features, including model fit indices, tests of group invariance, direct- and indirect-effect estimates, modification indices, and estimation approaches. Mplus software is widely used in the social sciences and is considered by many as the gold-standard software for SEM. Stata introduced SEM in version 12 and implemented SEM for categorical outcomes in version 13.

This presentation will describe and compare the available estimation options of Stata and Mplus for SEM using a clinical dataset that includes the binary outcome of coronary artery disease (CAD). We used cross-sectional data on 242 individuals with CAD and 218 individuals without CAD to examine the potential causal pathways and direct and indirect effects of homocysteine on CAD. Data were available for systolic blood pressure, triglycerides, and cholesterol subfractions. Body mass index, blood urea nitrogen, C-reactive protein, and uric acid were used as markers of insulin sensitivity, renal function, inflammation, and oxidative stress, respectively. In addition to discussing the available estimation features of the two software, this presentation compares the respective syntaxes and path diagramming features.

Additional information

An assessment of current software: Parameter estimate accuracy for Generalized Linear Mixed Models with binary outcome data

Tyman Stanford
The University of Adelaide
Generalized linear mixed models (GLMMs) are a widely used class of models that assume the expected value of an outcome variable is determined by a linear combination of predictor variables, via an invertible link function, with both fixed and random model coefficients. Estimation of the model coefficients has improved with increased computational power; the current gold standard to estimate GLMM coefficients requires adaptive Gauss-Hermite quadrature approximation of the profiled likelihood function, usually a multidimensional integral, to obtain (approximate) maximum likelihood solutions. The performance of widely used software packages in estimating fixed and random coefficients with a Bernoulli outcome variable is the focus of this work. The packages surveyed, many with multiple routines available to perform GLMM parameter estimation, are Stata, R, SAS, ADMB, SPSS, and Matlab. The GLMM routines in these packages are applied to multiple simulated datasets with known parameters to determine the accuracy of parameter estimates of both fixed effects and the variance components. The effect of increasing the number of adaptive Gauss-Hermite quadrature integral approximation points on the bias and precision of the estimates, as well as the effect on model selection using AIC, will be presented. The computational time taken to generate model parameter estimates using simulated data is also presented, an additional consideration in practice.

Additional information

Model comparison for analysis of population surveillance data

Rosie Meng
Flinders University
Richard Woodman
Flinders University
Stephen R. Cole
Flinders University
Erin Symonds
Repatriation General Hospital

In this presentation, I evaluate the relative merits of different approaches to the analysis of population-level bowel cancer surveillance data using available Stata routines. The focus is on selecting models to suit the research questions and the ease of interpretation.

Outcomes of colonoscopies for colorectal cancer surveillance was obtained from the South Australian Southern Cooperative Program for the Prevention of Colorectal Cancer (SCOOP). Research questions identified whether patient and adenoma characteristics were associated with the degree of neoplasia advancement at the next surveillance colonoscopy. Among 379 patients with a diagnosis of low- or high-risk adenoma at index colonoscopy between, their first surveillance colonoscopy was performed between 06-Dec-2001 and 21-Dec-2010. Five regression models were constructed: 1) Cox cause-specific model (stcox); 2) Cox model with stratification; 3) parametric survival model (streg); 4) competing-risks survival model (stcrreg); and 5) multinominal logistic regression (mlogit). The four survival models generally had good agreement and also are consistent with Kaplan-Meier curves, but results from mlogit differ significantly from the rest.

Survival analysis is preferred for surveillance data especially when follow-up time varies considerably between individuals. A cause-specific Cox model may be preferred over a competing-risks model to ease result interpretation.

Additional information

rdecompose: Outcome decomposition for aggregate data

JinJing Li
University of Canberra
Yohannes Kinfu
University of Canberra
Social, behavioral, and health scientists frequently apply methods for decomposing changes or differences in outcome variables into components of change. A number of Stata commands, such as those based on the Blinder-Oaxaca approach, have been developed over the years to facilitate this exercise using unit-level data. However, despite the abundance of aggregate data and wide use of corresponding aggregate data decomposition techniques, there are no comparable user-developed Stata commands for decomposing changes or differences using aggregate-level data. In this presentation, we introduce a new Stata command for aggregate data decomposition, based on Gupta's reformulation, and demonstrate applications from a wide range of settings that include demography, epidemiology, and health economics. Our command in Stata also extends existing approaches to allow any number of factors and various functional relationships that are not available in any platform.

Additional information

Bayesian analysis using Stata

Bill Rising
Bayesian analysis made its official Stata debut with the release of Stata 14. In this presentation, we will explore some simple applications to demonstrate the basics of Stata's user interface and suite of commands for Bayesian analysis.

Additional information

A practical introduction to Stata 14 item response theory (IRT)

Malcolm Rosier

Stata 14 includes a module on item response theory (IRT). I discuss basic characteristics of measurement in the social sciences, show how traditional measurement techniques and IRT are related, and discuss merits, constraints, and uses of IRT.

The IRT procedure produces a calibrated scale of the underlying (latent) dimension at the interval level of measurement. The same scale is used to obtain a measure of the difficulty of each item and of the ability of each person. I illustrate the one-parameter and two-parameter logistic models by analyzing a mathematics achievement test with dichotomous responses, scored correct or incorrect. We then introduce the IRT procedures applied to ordered categorical data. We apply the rating scale model (RSM) and the graded response model (GRM) to attitude scale data.

Additional information

Identifying biomarkers in epidemiological studies using a fusion of data mining and traditional statistical techniques in Stata

Jo Dipnall
Deakin University
Julie A. Pasco
The University of Melbourne
Michael Berk
The University of Melbourne
Lana J. Williams
Deakin University
Seetal Dodd
Orygen Youth Health Research Centre
Felice N. Jacka
Murdoch Children's Research Institute
Denny Meyer
Swinburne University of Technology

Epidemiological studies generally incorporate vast numbers of variables. There are a multitude of techniques for variable selection in data mining, machine learning, and traditional statistics with varying accuracy. The aim of this study was to incorporate these techniques in Stata to identify key biomarkers, from a large number measured, and explore their associations with depression.

Data from the National Health and Nutrition Examination Study (2009-2010) were utilized (n=5,227, mean age=43 yr). Depressive symptoms were measured using the Patient Health Questionnaire-9. Blood and urine samples were taken, and large numbers of biomarkers measured (n=67). Anthropometric measurements, demographics, and medications were determined. Lifestyle and health conditions were obtained via a questionnaire. A four-step analysis process was performed incorporating multiple imputation, a Stata boosted regression plugin, and traditional statistical techniques. Covariates included sex, age, race, smoking, food security, PIR, BMI, diabetes, inactivity, and medications. The final model controlled for confounders and effect moderators. All analysis was managed within Stata's project and macro do environment.

Out of a possible 67 biomarkers, 4 were identified as being associated with depressive symptoms. Implementing this research's complex analysis strategy entirely from within Stata eliminated cross platform errors and ensured easy replication of the results.

The Hjort-Hosmer goodness-of-fit statistic for binary regression

Steve Quinn
Flinders University
D.W. Hosmer
University of Massachusetts, Amherst
The statistic most commonly used to evaluate the adequacy of a logistic regression model is the Hosmer-Lemeshow statistic. The authors proposed a goodness-of-fit test based on partitioning the fitted probabilities into a number of groups and compared observed events to expected events within each group. They showed via simulations that the resulting statistic follows a chi-squared distribution with degrees of freedom approximately equal to the number of groups minus two. The Hjort-Hosmer statistic also assesses model adequacy and is based on partial sums of residuals that are sorted by their corresponding fitted values. The basic idea is that if a model is correctly fitted, then the partial sums should vary randomly about zero, and better model fit should correspond to smaller maximal partial sums. In this presentation, the Hosmer-Lemeshow and Hjort-Hosmer statistic are compared in binary regression models with different links, and we describe hjorthos, which calculates the Hosmer-Hjort statistic.

Additional information

xtcluster: A partially heterogeneous framework for short panel-data models

Demetris Christodoulou
Vasilis Sarafidis
Monash University
xtcluster implements the partially heterogeneous framework proposed by Sarafidis and Weber (2015). The algorithm classifies individuals into panel-data regression clusters, such that within each cluster, the slope coefficients are homogeneous, and intracluster heterogeneity is attributed to the presence of individual- and time-specific effects. The slope coefficients differ across clusters. The optimal number of clusters and the associated optimal partition are determined using a model information criterion that is consistent for T fixed as N grows large. The proposed method relies on the data to suggest any clustering structure that might exist. Hence, it can be particularly useful when there is no a priori information about a potential clustering structure, or when one is interested in examining how far a structure that might be meaningful according to some economic measure lies from the structure that is optimal from a statistical point of view.

Additional information

Statdoc: Document and explore

Markus Schaffner
Queensland University of Technology

Statdoc is a small utility program written in Java that automatically documents data analysis projects. It is modeled after similar tools used in software development and as such supports good coding standards. The program can run stand-alone or from within Stata and produces a set of static HTML files that reveal information about the files in a given folder structure.

Statdoc automatically discovers as much information as possible about the data, variables, script files, and output files that it can identify and highlights the links between them. It features an enhanced documenting comment type, which allows it to record supporting meta-information. This way, it allows the user to organize projects with ease and assist to uncover information about other people's projects. The utility is aimed at real-world research projects where a multitude of data sources, script files, and outputs are not uncommon. Because the documentation is produced as static HTML files, it also facilitates sharing the complete information about a project on the web, helping efforts to make the data analysis process more transparent. Statdoc is available as an open source project on Github (for more information and examples, see https://github.com/mas802/statdoc).

Additional information

Using interrupted time-series analysis to examine the effectiveness of the comprehensive stroke unit model

Susan Kim
Flinders University
Daniel Verma
Flinders Medical Centre
Chris Horwood
South Australia Department of Health
Paul Hakendorf
Flinders Medical Centre
Andrew Lee
Flinders University

Stroke care on the comprehensive stroke unit (CSU) is the gold standard. Care for stroke patients often involves neurologists as well as other physicians with stroke care expertise and training, that is, stroke physicians. The aim of this study is to examine whether the CSU results in better outcomes irrespective of the physician.

Patients' data from a single center with ischemic stroke admitted between 2000 and 2014 were analyzed. Three system changes were made during this time: (1) patients were initially seen by a neurologist and transferred to a stroke physician from 2004 onward; (2) advent of a stroke-trained neurologist in 2007; and (3) a CSU model with care by a single stroke physician led by a stroke director from 2010 onward. Interrupted time-series analysis was used to model the changes in patients' outcomes and complication rates over time using monthly aggregated data.

The percentage of patients discharged to rehabilitation facilities significantly changed after each implementation (p<0.01), and a significantly less number of patients developed aspiration pneumonia post 2010 (p=0.045). More patients were sent to rehabilitation facilities and less with complications with the CSU model, so better outcomes can be achieved via the CSU model of care even when staffed by nonneurologist stroke physicians.

Additional information

Count model selection and postestimation to evaluate composite flour technology adoption in Senegal (West Africa)

Kodjo Kondo
University of New England
This presentation examines Stata estimation and postestimation analyses in identifying determinants of the probability and extent of adoption of composite flour technology in bread baking in the Dakar region of Senegal (West Africa). The technology is promoted to limit dependency on imported wheat. A hurdle regression model is estimated using socioeconomic and production data collected from 150 bakers in 2014. The hurdle model, which was preferred over the negative binomial and the zero-inflated negative binomial models, allows us to disentangle factors affecting the adoption decisions from those influencing the quantities used. Findings indicate that the ownership of a 50 kg mixer, training programs on composite flour production, and the number of bakeries owned positively affect adoption decisions, while the quantity decisions are influenced by membership in the baker federation and the expected output. The wheat and millet flour price ratio positively affects both decisions. These results imply that efforts to increase the adoption rate and its extent should promote the 50 kg mixers, intensify the professional training on composite flour production, institutionalize the use of composite flour, and contribute to making local flour cheaper than wheat flour by intensifying local cereal production.

Additional information

Wishes and grumbles

Bill Rising & Rebecca Pope
StataCorp will be happy to receive wishes for developments in Stata and almost as happy to receive grumbles about the software.

Scientific organizers

Demetris Christodoulou, (chair) University of Sydney

Yohannes Kinfu, University of Canberra, CeRAPH

Ghada Gleeson, The Australian National University, ACERH

JinJing Li, University of Canberra

Con Menictas, University of Newcastle

Logistics organizers

Survey Design and Analysis Services Pty Ltd, the official distributor of Stata in Australia and New Zealand.