Last updated: 13 April 2007
2007 German Stata Users Group meeting
Monday, 2 April 2007
Why should you become a Stata programmer?
Boston College Economics
In this talk I describe three modes of Stata programming: authoring
do-files, ado-files, and Mata subroutines for ado-file programming. I
discuss the advantages of developing skills in Stata programming that will
help you become more efficient in your use of Stata and generate fully
reproducible research output.
Making regression tables simplified
, introduced by Jann (2005), is a useful tool for producing
regression tables from stored estimates. However, its syntax is relatively
complex and commands may turn out lengthy even for simple tables.
Furthermore, having to store the estimates beforehand can be a bit
cumbersome. To facilitate the production of regression tables, I therefore
present two new commands called esto
a wrapper for official Stata’s estimates store
the storing of estimation results for tabulation. For example, esto
does not require the user to provide names for the stored estimation sets.
, on the other hand, is a wrapper for estout
simplifies compiling nice-looking tables from the stored estimates without
much typing. Basic applications of the commands and usage of esta
with external software such as LaTeX, Word, or Excel will be illustrated by
a range of examples.
Assessing the resonableness of an imputation model
Maarten L. Buis
Vrije Universiteit Amsterdam
Multiple imputation is a popular way of dealing with missing values under
the missing at random (MAR) assumption. Imputation models can become quite
complicated, for instance, when the model of substantive interest contains
many interactions or when the data originate from a nested design. This
paper will discuss two methods to assess how plausible the results are. The
first method consists of comparing the point estimates obtained by multiple
imputation with point estimates obtained by another method for controlling
for bias due to missing data. Second, the changes in standard error between
the model that ignores the missing cases and the multiple imputation model
are decomposed into three components: changes due to changes in sample size,
changes due to uncertainty in the imputation model used in multiple
imputation, and changes due to changes in the estimates that underlie the
standard error. This decomposition helps in assessing the reasonableness of
the change in standard error. These two methods will be illustrated with two
new user written Stata commands.
The influence of categorizing survival time on parameter estimates in a Cox model
University of Freiburg
University Medical Center Freiburg
MRC Clinical Trials Unit, London
With longer follow-up times, the proportional hazards assumption is
questionable in the Cox model. Cox suggested to include an interaction
between a covariate and a function of time. To estimate such a function in
Stata, a substantial enlargement of the data is required. This may cause
severe computational problems. We will consider categorizing survival time,
which raises issues as to the number of cutpoints, their position, the
increased number of ties, and the loss of information, to handle this
problem. Sauerbrei et al. (2007) proposed a new selection procedure to model
potential time-varying effects. They investigate a large dataset (N = 2982)
with 20 years follow-up, for which the Stata command stsplit
about 2.2 million records. Categorizing the data in 6-month intervals gives
35,747 records. We will systematically investigate the influence of the
length of categorization intervals and the four methods of handling ties in
Stata. The results of our categorization approach are promising, showing a
sensible way to handle time-varying effects even in simulation studies.
References: Sauerbrei, W., Royston, P. and Look, M. (2007). A new proposal
for multivariable modelling of time-varying effects in survival data based
on fractional polynomial time-transformation. (Biometrical Journal
Oaxaca/Blinder decompositions for nonlinear models
RWI Essen, University of Bochum
This paper describes the estimation of a general Blinder–Oaxaca
decomposition of the mean outcome differential of linear and nonlinear
regression models. Departing from this general model, we show how it can be
applied to different models with discrete and limited dependent variables.
Estimating double-hurdle models with dependent errors and
Julian A. Fennema
Heriot-Watt University, Edinburgh
This paper describes the estimation of the parameters of a double-hurdle
model in Stata. It is shown that the independent double-hurdle model can be
estimated using a combination of existing commands. Likelihood evaluators to
be used with Stata’s ml facilities are derived to illustrate
how to fit independent and dependent inverse hyperbolic sine double-hurdle
models with heteroskedasticity.
University of Cologne
In this paper, we describe richness
, a Stata program for the
calculation of richness indices. Peichl, Schaefer, and Scheicher (2007)
propose a new class of richness measures to contribute to the debate how to
deal with the financing problems that European welfare states face as a
result of global economic competition. In contrast to the often-used head
count, these new measures are sensitive to changes in rich persons’
income. This approach allows for a more sophisticated analysis of richness,
namely, the question whether the gap between rich and poor is widening. We
propose to use our new measures in addition to the head count index for a
more comprehensive analysis of richness.
Robust income distribution analysis
Philippe Van Kerm
Extreme data are known to be highly influential when measuring income
inequality from microdata. Similarly, Lorenz curves and dominance criteria
are sensitive to data contamination in the tails of the distribution.
In this presentation, I intend to introduce a set of user-written packages
that implement robust statistical methods for income distribution analysis.
These methods are based on the estimation of parametric models (Pareto,
Singh–Maddala) with “optimal B-robust” estimators rather
than maximum likelihood. Empirical examples show how robust inequality
estimates and dominance checks can be derived from these models.
PanelWhiz: A Stata interface for large scale panel datasets
John P. Haisken-DeNew
This paper outlines a panel-data retrieval program written for Stata/SE or
better, which allows easier accessing of the household panel datasets.
Using a dropdown menu system, the researcher selects variables from any and
all available years of the panel. The data are automatically retrieved and
merged to form a long file, which can be directly used by the
Stata panel estimators. The system implements modular data cleaning programs
called plugins. Yearly updates to the data retrievals can be
made automatically. Projects can be stored in libraries, allowing modular
administration and appending. PanelWhiz is available for SOEP,
IAB-Betriebspanel, HILDA, CPS-NBER, CPS-CEPR. Other popular datasets will
be supported soon.
PanelWhiz plugins: automatic vector-oriented data cleaning for
large scale panel datasets
RWI Essen and University of Bochum
PanelWhiz plugins are modular data-cleaning programs for specific items in
PanelWhiz. Each plugin is designed to recode, deflate, and change existing
variables being extracted in a panel-data retrieval. Furthermore, new
variables can be generated on the fly. The PanelWhiz plugin system is a
macro language that uses new-style dialog boxes and Stata’s
modularized class system, allowing a vector orientation for data cleaning.
The PanelWhiz plugins can even be generated using a PanelWhiz plugin
front-end, allowing users to create plugins but not have to write Stata code
themselves. The system is set up to allow data cleaning of any
A model for transferring variables between different
data-sets based on imputation of individual scores
University of Twente
It is an often-encountered problem that variables of interest are scattered
in different datasets. Given the two methodologically similar surveys, a
question not asked in one survey could be seen as a special case of
missing-data problem (Gelman et al., 1998). The paper presents a model for
transferring variables between different datasets, applying the procedures
for multiple imputation of missing values. The feasibility of this approach
was assessed using two Dutch surveys: Social and Cultural Developments in
The Netherlands (SOCON 2000) and the Dutch Election Study (NKO 2002). An
imputation model for the left–right ideological self-placement was
developed based on the SOCON survey. In the next step, left–right
scores were imputed to the respondents from the NKO study. The outcome of
the imputation was evaluated, first, by comparing the imputed variables with
the left–right scores collected in three waves of the NKO study.
Second, the imputed and the original NKO left–right variables are
compared in terms of their associations with a broad set of attitudinal
variables from the NKO dataset. The results show that one would reach
similar conclusions when using the original or imputed variable, albeit with
the increased risk of making Type II errors.
Two issues on remote data access
At the Research Data Centre of the BA at the IAB, researchers can send in
Stata programs to be processed there with the log files sent back to them
after a disclosure limitation review. This method of data access is called
remote data access and the reason we do this is data confidentiality. Remote
data access has two nonstandard requirements: efficient use of the computer
resources and automation of parts of the disclosure limitation review. I
would like to talk about how we deal with these requirements and discuss
ways to improve them.
Johannes Giesecke, University of Mannheim
Ulrich Kohler, WZB
Fred Ramb, Deutsche Bundesbank
The conference is sponsored and organized by Dittrich and Partner
the distributor of Stata in several countries, including
Germany, Austria, and Hungary.