»  Home »  Resources & support »  User Group meetings »  2007 German Stata Users Group meeting

## 2007 German Stata Users Group meeting: Abstracts

### Why should you become a Stata programmer?

Kit Baum
Boston College Economics
In this talk I describe three modes of Stata programming: authoring do-files, ado-files, and Mata subroutines for ado-file programming. I discuss the advantages of developing skills in Stata programming that will help you become more efficient in your use of Stata and generate fully reproducible research output.

StataProgDESUG.7323.pdf

### Making regression tables simplified

Ben Jann
ETH Zurich
estout, introduced by Jann (2005), is a useful tool for producing regression tables from stored estimates. However, its syntax is relatively complex and commands may turn out lengthy even for simple tables. Furthermore, having to store the estimates beforehand can be a bit cumbersome. To facilitate the production of regression tables, I therefore present two new commands called esto and esta. esto is a wrapper for official Stata’s estimates store and simplifies the storing of estimation results for tabulation. For example, esto does not require the user to provide names for the stored estimation sets. esta, on the other hand, is a wrapper for estout and simplifies compiling nice-looking tables from the stored estimates without much typing. Basic applications of the commands and usage of esta with external software such as LaTeX, Word, or Excel will be illustrated by a range of examples.

Essen07_jann.pdf
Essen07_jann.zip

### Assessing the resonableness of an imputation model

Maarten L. Buis
Vrije Universiteit Amsterdam
Multiple imputation is a popular way of dealing with missing values under the missing at random (MAR) assumption. Imputation models can become quite complicated, for instance, when the model of substantive interest contains many interactions or when the data originate from a nested design. This paper will discuss two methods to assess how plausible the results are. The first method consists of comparing the point estimates obtained by multiple imputation with point estimates obtained by another method for controlling for bias due to missing data. Second, the changes in standard error between the model that ignores the missing cases and the multiple imputation model are decomposed into three components: changes due to changes in sample size, changes due to uncertainty in the imputation model used in multiple imputation, and changes due to changes in the estimates that underlie the standard error. This decomposition helps in assessing the reasonableness of the change in standard error. These two methods will be illustrated with two new user written Stata commands.

BUIS_GsugBuis.pdf

### The influence of categorizing survival time on parameter estimates in a Cox model

Anika Buchholz
University of Freiburg
Willi Sauerbrei
University Medical Center Freiburg
Patric Royston
MRC Clinical Trials Unit, London
With longer follow-up times, the proportional hazards assumption is questionable in the Cox model. Cox suggested to include an interaction between a covariate and a function of time. To estimate such a function in Stata, a substantial enlargement of the data is required. This may cause severe computational problems. We will consider categorizing survival time, which raises issues as to the number of cutpoints, their position, the increased number of ties, and the loss of information, to handle this problem. Sauerbrei et al. (2007) proposed a new selection procedure to model potential time-varying effects. They investigate a large dataset (N = 2982) with 20 years follow-up, for which the Stata command stsplit creates about 2.2 million records. Categorizing the data in 6-month intervals gives 35,747 records. We will systematically investigate the influence of the length of categorization intervals and the four methods of handling ties in Stata. The results of our categorization approach are promising, showing a sensible way to handle time-varying effects even in simulation studies. References: Sauerbrei, W., Royston, P. and Look, M. (2007). A new proposal for multivariable modelling of time-varying effects in survival data based on fractional polynomial time-transformation. (Biometrical Journal, in press)

BUCHHOLZ_Vortrag.Essen.pdf

### Oaxaca/Blinder decompositions for nonlinear models

Matthias Sinning
Markus Hahn
RWI Essen, University of Bochum
This paper describes the estimation of a general Blinder–Oaxaca decomposition of the mean outcome differential of linear and nonlinear regression models. Departing from this general model, we show how it can be applied to different models with discrete and limited dependent variables.

SINNING_stata_presentation.pdf

### Estimating double-hurdle models with dependent errors and heteroskedasticity

Julian A. Fennema
Heriot-Watt University, Edinburgh
This paper describes the estimation of the parameters of a double-hurdle model in Stata. It is shown that the independent double-hurdle model can be estimated using a combination of existing commands. Likelihood evaluators to be used with Stata’s ml facilities are derived to illustrate how to fit independent and dependent inverse hyperbolic sine double-hurdle models with heteroskedasticity.

### Measuring richness

Andreas Peichl
University of Cologne
In this paper, we describe richness, a Stata program for the calculation of richness indices. Peichl, Schaefer, and Scheicher (2007) propose a new class of richness measures to contribute to the debate how to deal with the financing problems that European welfare states face as a result of global economic competition. In contrast to the often-used head count, these new measures are sensitive to changes in rich persons’ income. This approach allows for a more sophisticated analysis of richness, namely, the question whether the gap between rich and poor is widening. We propose to use our new measures in addition to the head count index for a more comprehensive analysis of richness.

peichl_20070402_VortragStataUserGroup.pdf

### Robust income distribution analysis

Philippe Van Kerm
Extreme data are known to be highly influential when measuring income inequality from microdata. Similarly, Lorenz curves and dominance criteria are sensitive to data contamination in the tails of the distribution. In this presentation, I intend to introduce a set of user-written packages that implement robust statistical methods for income distribution analysis. These methods are based on the estimation of parametric models (Pareto, Singh–Maddala) with “optimal B-robust” estimators rather than maximum likelihood. Empirical examples show how robust inequality estimates and dominance checks can be derived from these models.

VANKERM_gsum_slides.pdf

### PanelWhiz: A Stata interface for large scale panel datasets

John P. Haisken-DeNew
RWI Essen
This paper outlines a panel-data retrieval program written for Stata/SE or better, which allows easier accessing of the household panel datasets. Using a dropdown menu system, the researcher selects variables from any and all available years of the panel. The data are automatically retrieved and merged to form a long file, which can be directly used by the Stata panel estimators. The system implements modular data cleaning programs called plugins. Yearly updates to the data retrievals can be made automatically. Projects can be stored in libraries, allowing modular administration and appending. PanelWhiz is available for SOEP, IAB-Betriebspanel, HILDA, CPS-NBER, CPS-CEPR. Other popular datasets will be supported soon.

HAISKEN_panelwhiz_overview.ppt

### PanelWhiz plugins: automatic vector-oriented data cleaning for large scale panel datasets

Markus Hahn
RWI Essen and University of Bochum
PanelWhiz plugins are modular data-cleaning programs for specific items in PanelWhiz. Each plugin is designed to recode, deflate, and change existing variables being extracted in a panel-data retrieval. Furthermore, new variables can be generated on the fly. The PanelWhiz plugin system is a macro language that uses new-style dialog boxes and Stata’s modularized class system, allowing a vector orientation for data cleaning. The PanelWhiz plugins can even be generated using a PanelWhiz plugin front-end, allowing users to create plugins but not have to write Stata code themselves. The system is set up to allow data cleaning of any PanelWhiz-supported dataset.

HAHN_german_stata2007.pdf

### A model for transferring variables between different data-sets based on imputation of individual scores

Bojan Todosijevic
University of Twente
It is an often-encountered problem that variables of interest are scattered in different datasets. Given the two methodologically similar surveys, a question not asked in one survey could be seen as a special case of missing-data problem (Gelman et al., 1998). The paper presents a model for transferring variables between different datasets, applying the procedures for multiple imputation of missing values. The feasibility of this approach was assessed using two Dutch surveys: Social and Cultural Developments in The Netherlands (SOCON 2000) and the Dutch Election Study (NKO 2002). An imputation model for the left–right ideological self-placement was developed based on the SOCON survey. In the next step, left–right scores were imputed to the respondents from the NKO study. The outcome of the imputation was evaluated, first, by comparing the imputed variables with the left–right scores collected in three waves of the NKO study. Second, the imputed and the original NKO left–right variables are compared in terms of their associations with a broad set of attitudinal variables from the NKO dataset. The results show that one would reach similar conclusions when using the original or imputed variable, albeit with the increased risk of making Type II errors.

TODOSIJEVIC_Presentation_light.pps

### Two issues on remote data access

Peter Jacobebbinghaus
IAB
At the Research Data Centre of the BA at the IAB, researchers can send in Stata programs to be processed there with the log files sent back to them after a disclosure limitation review. This method of data access is called remote data access and the reason we do this is data confidentiality. Remote data access has two nonstandard requirements: efficient use of the computer resources and automation of parts of the disclosure limitation review. I would like to talk about how we deal with these requirements and discuss ways to improve them.