Stata as a numerical tool for scientific thought experiments: A tutorial with worked examples
Department of Public Health–Department of Biostatistics, Aarhus University
Thought experiments based on simulation can be used to explain the impact of
the chosen study design, statistical analysis strategy, or the sensitivity of
results to fellow researchers. In this talk, I will present two examples
showing how quantitative thought experiments may be implemented in Stata. The
first example uses a large-sample approach to study the impact on the estimated
effect size of dichotomizing an exposure variable at different values. The
second example uses simulations of realistic-size datasets to illustrate the
necessity of using sampling fractions as inverse probability weights in the
statistical analysis for protection against bias in a complex sampling design.
I will also briefly outline the general steps needed for implementing
quantitative thought experiments in Stata. The main purpose is to highlight
that Stata provides programming facilities for conveniently implementing such
thought experiments, and exploiting those may save researchers precious time,
futile speculation, and disruptive debates and thus improve communication in
interdisciplinary research groups.
Studying coincidences with network analysis and other statistical tools
Department of Sociology and Communication, Universidad de Salamanca
The aim of this paper is to introduce a new framework to study data structures
that is based on a combination of statistical and social network analysis and
that is called coincidence analysis. The purpose of this procedure is to
ascertain the most frequent events in a given set of scenarios and to study the
relationships between them. In accordance with this procedure, the concurrence
of persons, objects, attributes, characteristics, or events within the same
temporally or spatially delineated set can be classified in the following
(a) as simple, if both occur at least once in the same set;
(b) as likely if there is more than a single coincidence and if it is more
probable than a concurrence produced by mere chance; and
(c) as statistically probable.
In cases where samples of events are the subject of analysis, a confidence
interval should be established to determine the statistical meaning of the
combination of events.
This mode of analysis can be applied to the exploratory analysis of
questionnaires, the study of textual networks, the review of the content of
databases, and the comparison of different statistical analyses of
interdependence. The following techniques can be used for analyzing the same
data: multidimensional scaling, principal component analysis, correspondence
analysis, biplot representations, agglomeration techniques, and network
The statistical bases of this analysis are described, as is the Stata program
that performs the analyses. As an example of its use, the photograph albums of
the following people who were famous in the early twentieth century are
analyzed: Miguel de Unamuno (1864–1936), Rafael Masó
(1880–1935), Joaquín Turina (1882–1949), and Antonia
Mercé (1890–1936), stage name la Argentina.
Social network analysis using Stata
Thomas Grund and Peter Hedström
Institute of Analytical Sociology, Linköping University
Social network analyses investigate the relationships (arcs/edges) between
individuals or organizations, such as friendship, advice, or trust. In contrast
to many other statistical approaches, one models the interdependencies between
entities explicitly. Such a perspective allows the visualization and study of
structural features of network structures such as centrality of network nodes.
This talk introduces the nwcommands
—a software suite of over 40
Stata commands—for social network analyses in Stata. The software
includes programs for importing and exporting, loading and saving, handling,
manipulating and replacing, generating, and visualizing and animating networks.
It also includes commands for measuring the importance of network nodes, the
detection of network patterns and features, the similarity of multiple networks,
node attributes, and the advanced statistical analysis of networks
). This presentation gives several examples using
these programs, provides instructions for the installation, use, and support
of the software
introduces a platform for developers for additional programs to perform social
network analyses using Stata.
Floating point numbers: A visit through the looking glass
Researchers do not adequately appreciate that floating-point (FP) numbers are a
simulation of real numbers and that, as with all simulations, some features are
preserved and others are not. Writing code, or even do-files, and treating the
computer's floating numbers as if they were real numbers can lead to
substantive problems and to numerical inaccuracy. In this, the relationship
between computers and real numbers is not entirely unlike the relationship
between tea and Douglas Adams's Nutrimatic drink dispenser. The Nutrimatic
produces a concoction that is "almost, but not quite, entirely unlike tea".
In this presentation, I will show what the universe would be like if it were
implemented in FP rather than real numbers. The FP universe turns out to be
nothing like the real universe and probably could not be made to function. The
point of the talk is to build your intuition about the floating-point world so
that you as a researcher can predict when calculations might go awry, know how
to think about the problem, and determine how to fix it.
Tweaking -khb- to control for post treatment confounders in mediation analysis
Department of Sociology, University of Copenhagen
Mediation analyses and their ensuing effect decompositions are widespread in
the social sciences. For example, in stratification research, researchers may
be interested in gauging the extent to which the black-white gap in earnings
can be attributed to the unequal distribution of schooling among the races.
However, methodological research shows that such mediation analyses often fail
to control for the potential endogeneity of the mediator. In the example,
academic ability may be a confounder of the education-earnings association. Yet
controlling for such confounders to eliminate the endogeneity bias of the
mediator is not as straightforward as it may appear. Whenever these control
variables are a function of the predictor variable of interest (race in the
example), standard regression methods for the calculation of direct and
indirect effects no longer apply. Put differently, standard methods cannot
control for post treatment confounders.
In this presentation, I show how to tweak the Stata command khb
(implementing the decomposition method developed by Karlson, Holm, and Breen
[2012, Sociological Methodology
42:274-301]) to control for these
confounders in the estimation of direct and indirect effects in regression
models using logit
. Under the assumption of linearity,
I exploit the residualization or orthogonalization approach that underlies
to derive the bias of omitted post treatment confounders, and I
show how to control for them by tweaking the use of khb
. I also discuss
how to obtain standard errors of the effects. To illustrate the approach, I
give an example of the role of education in social mobility.
Working sideways in Stata
Department of Cardiology, Aarhus University Hospital
Conceptually, Stata is commendably simple; dealing with only one rectangular
data-grid at a time (variables column-wise and observations row-wise). Within
this simple concept, statistics are (usually) operations performed on the
vertical axis, that is; column-wise, e.g. when obtaining the mean value of age
in a number of subjects/observations. Data management (besides loading-,
appending-, merging data, etc.) is the discipline of preparing the rectangular
data-grid for the statistics e.g. by creating derived variables; that is,
working row-wise (or sideways) in the data-grid. Mainly, derived variables are
recodings or simple calculations based on existing variables - all nicely
supported by easily used build-in stand alone Stata commands/functions.
Sometimes however, when a mix of conditions and calculations are required in
the creation of derived variables, things tend to get slightly more complicated
and may require customized “loops” to be able to traverse and handle selected
variables individually row-wise. Various aspects of working sideways in the
Stata data-grid will be presented and discussed with a strict focus on
transparent, safe and robust data-handling.
A short story about Danish register research and Statalist
Department of Public Health, Aarhus University
A PhD student is studying health problems among children born to mothers with
type 1 diabetes. In a clinical database, the student identified 1,300 such
children (index children), and Statistics Denmark delivered information
concerning 100 control children per index child, matched by gender and date of
birth. Health outcomes are mortality, hospital admissions (by diagnosis), and
medications (by ATC groups).
We used a mixed-effects negative binomial regression (Stata's menbreg
command) to analyze hospital admissions. menbreg
intensive, and we wanted some 200 analyses (5 age groups, 20 diagnostic
groups, etc.). Some analyses would take several hours. I tried to find out if
there was a way to automatically stop an analysis that took too long and
proceed with the next analysis. Some of the SUG participants will know how to
do that, but I didn't know at the time.
I sent the question to Statalist, and within five minutes, I had two good
answers: Use the iterate()
option. See help maximize
. It works,
and the analyses are proceeding.
Reproducible research in Stata
Writing a document that contains statistical results in its narrative,
including inline results, can take too much effort. Typically, users have a
separate series of do-files whose results must then be pulled into the
document. This is a very high-maintenance way to work in because updates to
the data, changes to the do-files, updates to the statistical software, and,
especially, updates to inline results all require work and careful checking of
Reproducible research greatly lessens document-maintenance chores by putting
code and results directly into the document; this means that only one document
is used; thus it remains consistent and is easily maintained.
In this presentation, I will show you how to put Stata code directly into a
LaTeX or HTML document and run it through a preprocessor to create the document
containing results. While this is useful for creating self-contained documents,
it is very useful for creating periodic reports, class notes, solution sets,
and other documents that get used over a long period of time.