Fourth German Stata Users Group meeting: Abstracts
Friday, 31 March 2006
Resultssets, resultsspreadsheets, and resultsplots in Stata
Roger Newson
Imperial College London
Abstract
Most Stata users make their living producing results in a form
accessible to end users. Most of these end users cannot immediately
understand Stata logs. However, they can understand tables (in paper,
PDF, HTML, spreadsheet, or word processor documents) and plots (produced
by using Stata or non-Stata software). Tables are produced by Stata as
resultsspreadsheets, and plots are produced by Stata as resultsplots.
Sometimes (but not always), resultsspreadsheets, and resultsplots are
produced using resultssets. Resultssets, resultsspreadsheets and
resultsplots are all produced, directly or indirectly, as output by
Stata commands. A resultsset is a Stata dataset, which is a table whose
rows are Stata observations and whose columns are Stata variables. A
resultsspreadsheet is a table in generic text format, conforming to a
TeX or HTML convention, or to another convention with a column separator
string and possibly left and right row delimiter strings. A resultsplot
is a plot produced as output, using a resultsset or a resultsspreadsheet
as input. Resultsset-producing programs include statsby,
parmby, parmest, collapse, contract,
xcollapse, and xcontract.
Resultsspreadsheet-producing programs include outsheet,
listtex, estout, and estimates table.
Resultsplot-producing programs include eclplot and
mileplot. There are two main approaches (or dogmas) for
generating resultsspreadsheets and resultsplots. The
resultsset-centered dogma is followed by parmest and
parmby users and states: “Datasets make resultssets,
which make resultsplots and resultsspreadsheets”. The
resultsspreadsheet-centered dogma is followed by estout and
estimates table users and states: “Datasets make
resultsspreadsheets, which make resultssets, which make
resultsplots”. The two dogmas are complementary, and each dogma
has its advantages and disadvantages. The resultsspreadsheet dogma is
much easier for the casual user to learn to apply in a hurry and is
therefore probably preferred by most users most of the time. The
resultsset dogma is more difficult for most users to learn but is more
convenient for users who wish to program everything in do-files, with
little or no manual cutting and pasting.
Additional information
newson_ohp1.pdf
Intervention evaluation using gllamm
Andrew Pickles
University of Manchester
Abstract
The gllamm procedure provides a framework in which to undertake many of
the more difficult analyses required for trials and intervention studies.
Treatment effect estimation in the presence of noncompliance can be
undertaken using instrumental variable (IV) methods. I illustrate how
gllamm can be used for IV estimation for the full range of types of
treatment and outcome measures and describe how missing data may be
tackled on an assumption of latent ignorability. I will describe other
approaches to account for clustering and the analysis of
cluster-randomized studies.
Examples from studies of alcohol consumption of primary-care patients,
cognitive behavior therapy of depression patients, and a school based
smoking intervention are discussed.
Additional information
pickles_Germany2006_gllamm.pdf
Estimating IRT models with gllamm
Herbert Matschinger
University of Leipzig
Abstract
Within the framework of economic evaluation, health econometricians
are interested in constructing a meaningful health index that is
consistent with individual or societal preferences. One way to
derive such an index is based on the EQ-5D description and valuation
of health-related quality of life (HRQOL). The purpose of this study
was to analyze how well the EQ-5D reflects one latent construct of
HRQOL and how large the potential impact of measurement variance
is with respect to six different countries. Data came from the European
Study of the Epidemiology of Mental Disorders (ESEMeD), a
cross-sectional survey of a representative random sample (N = 21,425)
in Belgium, France, Germany, Italy, The Netherlands, and Spain. At
least in psychology, much attention is paid to different forms of
item response theory (IRT) models and particularly the Rasch model, since it is the only model
featuring specific objectivity, which enables what is called a
“fair comparison” with respect to the latent dimension to be
measured. Therefore the dimensionality of the construct is evaluated by
means of one-parameter and two-parameter IRT.
Differential item functioning is tested with respect to the six
countries and both the difficulty and discrimination parameters.
Results show that a unidimensional one-parameter IRT model holds for
all countries only if the item “anxiety/depression” is
omitted. If both the physical and the mental components
of HRQOL should be represented, the questionnaire should be
extended to a two-dimensional construct. Consequently, more items to
portray the mental component are needed. This presentation will
focus on the possibilities and restrictions in estimating these models
with gllamm. It will be shown how these models can be established
and tested. Problems regarding the structure of the data and the
assignment of incidental parameters to individual observations will be
discussed.
Additional information
Matschinger_glamm.ppt
Variance estimation for generalized entropy and Atkinson inequality
indicies: The complex survey data case
Martin Biewen
University of Frankfurt
Abstract
We derive the sampling variances of generalized entropy and Atkinson
indices when estimated from complex survey data, and we show how they can
be calculated straightforwardly by using widely available software. We
also show that, when the same approach is used to derive variance
formulae for the i.i.d. case, it leads to estimators that are simpler
than those proposed before. Both cases are illustrated with a comparison
of income inequality in Britain and Germany.
Additional information
biewen.ppt
Linear mixed models in Stata
Roberto Gutierrez
StataCorp
Abstract
Stata version 9 includes the new command xtmixed, for fitting
linear mixed models. Mixed models contain both fixed and random
effects. The fixed effects are analagous to standard regression
coefficients and are estimated directly. The random effects are not
directly estimated but are summarized according to the unique elements
of their respective variance–covariance matrices, known as variance
components. xtmixed syntax is summarized and demonstrated with several
examples. Also, xtmixed and its postestimation routines may be
used to perform nonparametric smoothing by means of penalized splines.
Additional information
gutierrez_mannheim.pdf
Implementing restricted least squares in linear models
J. Haisken-DeNew
RWI Essen
Abstract
The presentation illustrates the user-written program hds97,
which implements the restricted least squares procedure as
described by Haisken-DeNew and Schmidt (1997). Log wages are
regressed on a group of k-1 industry/region/job/etc. dummies.
The kth dummy is the omitted reference dummy. Using RLS, all
k dummy coefficients and standard errors are reported. The
coefficients are interpreted as percent-point deviations from the
industry weighted average. An overall measure of dispersion is
also reported.
This ado-file corrects problems with the Krueger and Summers (1988)
Econometrica methodology of overstated differential standard
errors and understated overall dispersion.
General comments: The coefficients of continuous variables are
not affected by hds97. Also, all results calculated in hds97
are independent of the choice of the reference category. By the way,
for all dummy variable sets having only two outcomes, i.e., male/female,
the t-values of the hds97 adjusted coefficients are always
equal in magnitude but opposite in sign.
Additional information
RLS_Haisken_DeNew.ppt
Sequence analysis using Stata
Christian Brzinsky-Fay
Ulrich Kohler
WZB
Abstract
Sequences are ordered lists of elements. A typical example
is the sequence of bases in DNA. Other
examples are sequences of employment stages during a lifetime or
individual party preferences over time. Sequence analysis include
techniques to handle, describe, and, most importantly, compare
sequences.
Sequences are most commonly used by geneticists but not as commonly by
social scientists. This disparity is surprising, as sequence data are
readily available for the social sciences. In fact,
all data from panel studies can be regarded as sequence data.
Nevertheless, social scientists relatively seldom use panel data for sequence
analysis. The first aim of the presentation therefore is to illustrate a
typical research topic that can be dealt with sequence analysis. The
second part will then describe a bundle of user-written Stata programs
for sequence analysis, including a Mata algorithm for performing optimal
matching with the so-called Needleman–Wunsch algorithm.
Additional information
sum_brzinsky_kohler.pdf
sum_brzinsky_kohler_demo.zip
New tools for evaluating the results of cluster analyses
Hildegard Schaeper
HIS
Abstract
Clustering methods are designed for finding groups in data, i.e., for
grouping similar objects (variables or observations) into the same
cluster and dissimilar objects into separate clusters. Although the
main idea is rather simple, carrying out a cluster analysis remains a
challenging task. The number of different clustering methods is huge
and clustering includes many choices, such as the decision between basic
approaches (e.g., hierarchical and partitioning methods), the choice of
a dissimilarity or similarity measure, the selection of a particular
linkage method when performing a hierarchical agglomerative cluster
analysis, the choice of an initial partition when carrying out a
partitioning cluster analysis, and the determination of the
appropriate number of clusters. Each of these decisions
can affect the classification results.
Apart from two commands for determining the number of clusters
(cluster stop, cluster dendrogram) Stata has no built-in tools that
allow examination of clustering results. We therefore developed
some simple tools that provide further evaluation criteria:
- programs assisting in determining the number of clusters
(Mojena’s stopping rules for hierarchical clustering techniques,
PRE coefficient, F-Max statistic and Beale’s F values for
a partitioning cluster analysis),
- a program for testing the stability of classifications produced
by different cluster analyses (Rand index), and
- a program that computes ETA2 to assess how well the
clustering variables separate the clusters.
The presentation will compare these programs with other cluster-analysis
tools (agglomeration schedule, scree diagram).
Additional information
schaeper_pres_short.ppt
Stata goes BUGS (via R)
Susumu Shikano
University of Mannheim
Abstract
Recently, Bayesian methods such as Markov chain Monte Carlo (MCMC)
techniques have found more use in the social sciences, with (Win)BUGS
being one of the most widely applied programs for this kind of analysis.
Unfortunately, because of the absence of MCMC techniques and any interfaces
to WinBUGS or BUGS in Stata, Stata users who apply MCMC techniques have
to perform such painful tasks as reformatting data themselves. As a
preliminary solution to this problem, one can call another statistical
software R within Stata and use it as an interface to (Win)BUGS.
This presentation outlines this solution, providing a thorough analysis.
Additional information
shikano_StataMeeting2006.pdf
Optimal large package administration for Stata
Markus Hahn
RWI Essen
Abstract
Stata is quite simple to use for smaller ado-packages
stored on user web pages. However, when the number of files in a package
becomes large and the files need to be updated regularly, this task
becomes cumbersome. Package updates could take a long time to complete.
Here a method of storing packages as compressed archives on the host
server is outlined, whereby the user sends a query to the update server
to check for a new version. If a new version is available, the package
archive is downloaded in its entirety and is then extracted and
installed locally. This approach is far more efficient with respect to
installation times (typically only 1/10 of the time needed) than
downloading many text files individually. For large packages, the
bottleneck is most often the download time. Currently this automated
updating can be achieved with a Stata ado-file and the aid of additional
binaries (such as tar, gzip, and zip). The usability of this technique
would be enhanced dramatically if the functionality of an archiving
format (such as tar, gzip, zip) were directly integrated into the Stata
binary. Even encrpyted files could be distributed in this manner as
well. Ado-files inside the package archive can be configured to make an
automatic call to the host server to check for available updates.
Additional information
presentation_hahn.pdf
Report to users
Alan Riley
StataCorp
Additional information
riley_mannheim.pdf
|
Meetings
Stata Conference
User Group meetings
Proceedings
|