Last updated: 25 May 2006
University of Mannheim
Room W 117
Most Stata users make their living producing results in a form accessible to end users. Most of these end users cannot immediately understand Stata logs. However, they can understand tables (in paper, PDF, HTML, spreadsheet, or word processor documents) and plots (produced by using Stata or non-Stata software). Tables are produced by Stata as resultsspreadsheets, and plots are produced by Stata as resultsplots. Sometimes (but not always), resultsspreadsheets, and resultsplots are produced using resultssets. Resultssets, resultsspreadsheets and resultsplots are all produced, directly or indirectly, as output by Stata commands. A resultsset is a Stata dataset, which is a table whose rows are Stata observations and whose columns are Stata variables. A resultsspreadsheet is a table in generic text format, conforming to a TeX or HTML convention, or to another convention with a column separator string and possibly left and right row delimiter strings. A resultsplot is a plot produced as output, using a resultsset or a resultsspreadsheet as input. Resultsset-producing programs include statsby, parmby, parmest, collapse, contract, xcollapse, and xcontract. Resultsspreadsheet-producing programs include outsheet, listtex, estout, and estimates table. Resultsplot-producing programs include eclplot and mileplot. There are two main approaches (or dogmas) for generating resultsspreadsheets and resultsplots. The resultsset-centered dogma is followed by parmest and parmby users and states: “Datasets make resultssets, which make resultsplots and resultsspreadsheets”. The resultsspreadsheet-centered dogma is followed by estout and estimates table users and states: “Datasets make resultsspreadsheets, which make resultssets, which make resultsplots”. The two dogmas are complementary, and each dogma has its advantages and disadvantages. The resultsspreadsheet dogma is much easier for the casual user to learn to apply in a hurry and is therefore probably preferred by most users most of the time. The resultsset dogma is more difficult for most users to learn but is more convenient for users who wish to program everything in do-files, with little or no manual cutting and pasting.
The gllamm procedure provides a framework in which to undertake many of the more difficult analyses required for trials and intervention studies.
Treatment effect estimation in the presence of noncompliance can be undertaken using instrumental variable (IV) methods. I illustrate how gllamm can be used for IV estimation for the full range of types of treatment and outcome measures and describe how missing data may be tackled on an assumption of latent ignorability. I will describe other approaches to account for clustering and the analysis of cluster-randomized studies.
Examples from studies of alcohol consumption of primary-care patients, cognitive behavior therapy of depression patients, and a school based smoking intervention are discussed.
Within the framework of economic evaluation, health econometricians are interested in constructing a meaningful health index that is consistent with individual or societal preferences. One way to derive such an index is based on the EQ-5D description and valuation of health-related quality of life (HRQOL). The purpose of this study was to analyze how well the EQ-5D reflects one latent construct of HRQOL and how large the potential impact of measurement variance is with respect to six different countries. Data came from the European Study of the Epidemiology of Mental Disorders (ESEMeD), a cross-sectional survey of a representative random sample (N = 21,425) in Belgium, France, Germany, Italy, The Netherlands, and Spain. At least in psychology, much attention is paid to different forms of item response theory (IRT) models and particularly the Rasch model, since it is the only model featuring specific objectivity, which enables what is called a “fair comparison” with respect to the latent dimension to be measured. Therefore the dimensionality of the construct is evaluated by means of one-parameter and two-parameter IRT. Differential item functioning is tested with respect to the six countries and both the difficulty and discrimination parameters. Results show that a unidimensional one-parameter IRT model holds for all countries only if the item “anxiety/depression” is omitted. If both the physical and the mental components of HRQOL should be represented, the questionnaire should be extended to a two-dimensional construct. Consequently, more items to portray the mental component are needed. This presentation will focus on the possibilities and restrictions in estimating these models with gllamm. It will be shown how these models can be established and tested. Problems regarding the structure of the data and the assignment of incidental parameters to individual observations will be discussed.
We derive the sampling variances of generalized entropy and Atkinson indices when estimated from complex survey data, and we show how they can be calculated straightforwardly by using widely available software. We also show that, when the same approach is used to derive variance formulae for the i.i.d. case, it leads to estimators that are simpler than those proposed before. Both cases are illustrated with a comparison of income inequality in Britain and Germany.
Stata version 9 includes the new command xtmixed, for fitting linear mixed models. Mixed models contain both fixed and random effects. The fixed effects are analagous to standard regression coefficients and are estimated directly. The random effects are not directly estimated but are summarized according to the unique elements of their respective variance–covariance matrices, known as variance components. xtmixed syntax is summarized and demonstrated with several examples. Also, xtmixed and its postestimation routines may be used to perform nonparametric smoothing by means of penalized splines.
The presentation illustrates the user-written program hds97, which implements the restricted least squares procedure as described by Haisken-DeNew and Schmidt (1997). Log wages are regressed on a group of k-1 industry/region/job/etc. dummies. The kth dummy is the omitted reference dummy. Using RLS, all k dummy coefficients and standard errors are reported. The coefficients are interpreted as percent-point deviations from the industry weighted average. An overall measure of dispersion is also reported.
This ado-file corrects problems with the Krueger and Summers (1988) Econometrica methodology of overstated differential standard errors and understated overall dispersion.
General comments: The coefficients of continuous variables are not affected by hds97. Also, all results calculated in hds97 are independent of the choice of the reference category. By the way, for all dummy variable sets having only two outcomes, i.e., male/female, the t-values of the hds97 adjusted coefficients are always equal in magnitude but opposite in sign.
Sequences are ordered lists of elements. A typical example is the sequence of bases in DNA. Other examples are sequences of employment stages during a lifetime or individual party preferences over time. Sequence analysis include techniques to handle, describe, and, most importantly, compare sequences.
Sequences are most commonly used by geneticists but not as commonly by social scientists. This disparity is surprising, as sequence data are readily available for the social sciences. In fact, all data from panel studies can be regarded as sequence data. Nevertheless, social scientists relatively seldom use panel data for sequence analysis. The first aim of the presentation therefore is to illustrate a typical research topic that can be dealt with sequence analysis. The second part will then describe a bundle of user-written Stata programs for sequence analysis, including a Mata algorithm for performing optimal matching with the so-called Needleman–Wunsch algorithm.
Clustering methods are designed for finding groups in data, i.e., for grouping similar objects (variables or observations) into the same cluster and dissimilar objects into separate clusters. Although the main idea is rather simple, carrying out a cluster analysis remains a challenging task. The number of different clustering methods is huge and clustering includes many choices, such as the decision between basic approaches (e.g., hierarchical and partitioning methods), the choice of a dissimilarity or similarity measure, the selection of a particular linkage method when performing a hierarchical agglomerative cluster analysis, the choice of an initial partition when carrying out a partitioning cluster analysis, and the determination of the appropriate number of clusters. Each of these decisions can affect the classification results.
Apart from two commands for determining the number of clusters (cluster stop, cluster dendrogram) Stata has no built-in tools that allow examination of clustering results. We therefore developed some simple tools that provide further evaluation criteria:
The presentation will compare these programs with other cluster-analysis tools (agglomeration schedule, scree diagram).
- programs assisting in determining the number of clusters (Mojena’s stopping rules for hierarchical clustering techniques, PRE coefficient, F-Max statistic and Beale’s F values for a partitioning cluster analysis),
- a program for testing the stability of classifications produced by different cluster analyses (Rand index), and
- a program that computes ETA2 to assess how well the clustering variables separate the clusters.
Recently, Bayesian methods such as Markov chain Monte Carlo (MCMC) techniques have found more use in the social sciences, with (Win)BUGS being one of the most widely applied programs for this kind of analysis. Unfortunately, because of the absence of MCMC techniques and any interfaces to WinBUGS or BUGS in Stata, Stata users who apply MCMC techniques have to perform such painful tasks as reformatting data themselves. As a preliminary solution to this problem, one can call another statistical software R within Stata and use it as an interface to (Win)BUGS. This presentation outlines this solution, providing a thorough analysis.
Stata is quite simple to use for smaller ado-packages stored on user web pages. However, when the number of files in a package becomes large and the files need to be updated regularly, this task becomes cumbersome. Package updates could take a long time to complete. Here a method of storing packages as compressed archives on the host server is outlined, whereby the user sends a query to the update server to check for a new version. If a new version is available, the package archive is downloaded in its entirety and is then extracted and installed locally. This approach is far more efficient with respect to installation times (typically only 1/10 of the time needed) than downloading many text files individually. For large packages, the bottleneck is most often the download time. Currently this automated updating can be achieved with a Stata ado-file and the aid of additional binaries (such as tar, gzip, and zip). The usability of this technique would be enhanced dramatically if the functionality of an archiving format (such as tar, gzip, zip) were directly integrated into the Stata binary. Even encrpyted files could be distributed in this manner as well. Ado-files inside the package archive can be configured to make an automatic call to the host server to check for available updates.
Johannes Giesecke, University of Mannheim
Ulrich Kohler, WZB
Fred Ramb, Deutsche Bundesbank
The conference is sponsored and organized by Dittrich and Partner (http://www.dpc.de), the distributor of Stata in several countries, including Germany, Austria, and Hungary.