3rd Italian Stata Users Group meeting: Abstracts
Monday, October 9
Teaching data documentation with Stata
Svend Juul
Institute of Public Health, Aarhus C, Denmark
Abstract
I have seen many accidents such as the following:
- Not being able to reconstruct what modifications were made to a dataset
- Not being able to reproduce an analysis
- Not discovering errors due to lack of error checking
- Mistakes about what a numerical code represents
Such experiences led to a 2-day course in data documentation—the first time
in 1998, using SPSS, and using Stata since 2002. The main target group is Ph.D.
students in the health sciences. Most of these students are not very
sophisticated concerning statistics and computing; they have other things in
their minds. However, most of their projects involve collection of their own data,
safe handling of which is crucial.
A key concept is the audit trail: as a bookkeeper you must be able to go
back from the final balance sheet to the individual vouchers. This is
necessary to identify and correct your own errors, and it is a request for
audit. The same principle applies when working with research data.
The main aim of the course is to help student researchers handle their
own data consistently and safely, thus preventing errors, mistakes, and
loss of data. It starts with advice on methods and safeguards when entering
data and ends with the student archiving data and documentation after
analysis of a moderately complex dataset. The quality of the archive is
assessed and feedback is given.
In the course I use the booklet Take Good Care of Your Data. You may
download it and the data used for the course from
http://www.folkesundhed.au.dk/uddannelse/software. The concepts also had
a strong influence on the contents and structure of
An Introduction to Stata
for Health Researchers (S. Juul, Stata Press, 2006).
The course gets good evaluations (mostly); frequent comments include the following: “I
should have taken this course a year ago” and “This course
should be compulsory.” About 80% pass the ‘driving test’
(the assessment of the quality of the students' archives); about 20% fail.
There are no sanctions against those who fail, but I tell them that they are
a hazard to their own data. Often they redo the exercise and ask for another
assessment—and most pass the second time.
The presentation will give more detail about the teaching experience. Also,
useful Stata tools will be presented. My main reason for giving the
presentation is the feeling that these are important, but frequently
neglected, issues.
A comparative analysis of dynamic panel-data estimators in the presence of
endogenous regressors
Giovanni S. F. Bruno
Università Bocconi
Abstract
Data used in applied econometrics are typically nonexperimental in nature,
making the assumption of exogeneity of regressors untenable and posing a
serious identification issue in the estimation of economic structural
relationships.
As far as the source of endogeneity is confined to unobserved heterogeneity
between groups (for example, time-invariant managerial ability in firm-level
labor demand equations), the availability of panel data can identify the
parameters of interest. If endogeneity, instead, is more pervasive stemming
also from unobserved within group variation, (for example, a transitory
technology shock hitting at the same time both the labor demand of the firm
and the wage paid) then standard panel data estimators are biased and
instrumental-variable or generalized method-of-moments estimators provide
valid alternative techniques.
This paper extends the analysis in (Bruno, G. S. F., 2005: “Estimation and
inference in dynamic unbalanced panel-data models with a small number of
individuals”, Stata Journal
5: 473–500) focusing on dynamic panel-data (DPD) models with
endogenous regressors.
Various Monte Carlo experiments are carried out through my Stata code
xtarsim to assess the relative finite-sample performances of some
popular DPD estimators, such as Arellano and Bond (xtabond,
xtabond2); Blundell and Bond (xtabond2); Anderson and Hsiao
(ivreg, ivreg2, xtivreg, xtivreg2); and LSDVC
(xtlsdvc).
New versions of the codes xtarsim and xtlsdvc are also
presented.
A Stata procedure for the deduplication of individual records:
The INPS archive case
Orietta Dessy
Università Bocconi
Abstract
A common problem in collecting individual data is the duplication of
individuals. Often the attribution of a code that identifies an individual
is sensitive to variables such as name, surname, date and place of
birth, and address. This information is imputed when the person is registered
the first time in the archive, and then it is checked and updated each time
that the same person is contacted. The problem is that any mistake in
reporting individuals’ identifying information gives rise to a new
identifying code, therefore mistakenly creating a new person in the archive.
Therefore the need for deduplicating observations for the same individual
arises. The construction of an appropriate program for solving this problem
can be developed at different stages. First of all, some general checks of
coherence have to be implemented, using all the available information in the
archive (name, surname, fiscal or social security codes, sex, date and place
of birth, variables of assessment of quality of the collected information,
etc.). Then, at a second stage, some general criteria of phonetic assonance
can be used for deduplicating observations in an appropriate probabilistic
environment. Our program does not correct imputed wrong data with the right
ones but simply generates individual identifying codes that can sensibly
reduce the cases of duplication of individual records, making their identity
anonymous at the same time. This is sufficient and useful for carrying out
any kind of statistical, econometric and, in particular, panel-data analysis
on data subject to privacy restrictions. Further research should be devoted
to the possibility of correcting for the right information, possibly using
preconstructed universal vocabularies, so that the program can be extended
to the cases where individuals’ details are needed for the purposes of
the analysis. As an example of application of our routine, we use the
Italian administrative archive of the National Institute of Social Security
(INPS).
Dynamic factor analysis with Stata
Alessandro Federici
Università di Roma “La Sapienza”
Abstract
For a detailed summary of this talk, please read
federici.pdf.
Abstract in Italian
federici_it.pdf
Estimating and modeling relative survival
Paul Dickman
Karolinska Institutet
Enzo Coviello
ASL BAT/1–Andria
Abstract
Relative survival is the method of choice for estimating patient survival
using data collected by population-based cancer registries. It is estimated
as the ratio of the observed survival (where all deaths are considered
events) to the expected survival of a comparable group from the general
population.
strs is a new Stata macro producing life table estimates of relative
survival. The availability of several options makes it possible to estimate
various methods or approaches, to standardize for age or other variables,
and to prepare suitable files for modeling the effect of covariates on
relative survival.
The basis of the algorithm is to split the data using stsplit to
obtain one observation for each individual and for each life table interval.
Thereafter data are merged with a file containing expected survival
probabilities from a comparable general population and are collapsed to
obtain one observation for each interval.
The syntax of the new command will be illustrated using three examples.
First, expected survival will be estimated by Ederer I, Hakulinen and Ederer
II methods. Second, the relative survival will be estimated using cohort,
period and hybrid approaches. Finally, age-standardized relative survival
estimates will be achieved by traditional and alternative methods.
The command can create two output datasets. The first, individ.dta,
contains one observation for each patient for each life table interval; the
second dataset, grouped.dta, contains one observation for each life-table
interval.
Using the grouped.dta dataset we will illustrate how simple it is to obtain
graphs of the basic functions and of the time trends in 5- and 10-year
relative survival.
Relative survival may be modeled using these output datasets. Relative
survival models can be fitted within the framework of generalized linear
models by a nonstandard link function defined in rs.ado. We will show how
to obtain model estimates to easily assess the effect of covariates as
relative excess risk and to adjust parameter estimates for additional
factors. In these models, follow-up time effect can be fitted as a piecewise
function using dummy variables, one for each time band, or as a smooth
function, e.g., applying fractional polynomials by the command mfp.
The new command is available for download from
http://www.pauldickman.com/rsmodel/.
Abstract in Italian
coviello_it.pdf
Seminonparametric estimation of univariate and bivariate binary-choice models
Giuseppe de Luca
Unversità di Roma—Tor Vergata—ISFOL
Abstract
In this paper, we use the seminonparametric (SNP) approach proposed by
Gallant and Nychka (1987) to estimate univariate and bivariate binary-choice
models. After discussing issues related to identification and estimation, we
provide a set of new Stata commands for semiparametric estimation of three
types of models. The first is a semiparametric binary-choice model that
nests the probit model. The corresponding SNP estimator is a faster and
improved version of the SNP estimator for ordered-choice models implemented in
Stata by Stewart (2004). The second and third models are instead SNP
generalizations of the bivariate probit model with and without sample
selection respectively. The proposed estimators are √n-consistent and
asymptotically normal for the model parameters of interest under weak
distributional assumptions. The use of these commands is illustrated
through empirical applications.
Abstract in Italian
deluca_it.pdf
Sensitivity analysis of epidemiological results
Nicola Orsini
Karolinska Institutet
Rino Bellocco
Karolinska Institutet, Università di Milano—Bicocca
Sander Greenland
UCLA School of Public Health
Alicja Wolk
Karolinska Institutet
Abstract
In observational epidemiological studies, potential biases due to
uncontrolled confounding, nonrandom subjects election, and measurement
errors are usually only qualitatively addressed, in discussion of study
results. This practice can be problematic because misleading inferences may
results from inadequate accounting for the effects of these biases.
Quantitative assessment of bias can provide valuable insight into the
importance of various sources of bias, and thus enhance the contributions of
epidemiology. Such assessments may demonstrate that certain sources of bias
cannot possibly explain a study result, or that a bias explanation cannot be
ruled out.
To our knowledge, although some basic and advanced methods are already known
in epidemiological literature, they have seen a small number of published
applications, and this will certainly remain the case until major software
packages incorporate them.
Therefore, we will present a new user-friendly Stata command for sensitivity
analysis of biases (uncontrolled confounding and classification errors) in
observational epidemiological studies (cohort and case–control), with
applications to the research area of lifestyle habits (diet, physical
activity) and health.
Abstract in Italian
orsini_it.pdf
Scheming your way to consistent graphs
Vince Wiggins
StataCorp, College Station, TX
Abstract
If you find yourself repeatedly specifying the same options on graph
commands, you should write a graphics scheme. A scheme is nothing more than
a file containing a set of rules specifying how you want your graphs to
look. From the size of fonts used in titles and the color of lines and
markers in plots to the placement of legends and the number of default ticks
on axes, almost everything about a graph is controlled by the rules in a
graphics scheme. We will look at how to create your own graphics schemes and
where to find out more about all the rules available in schemes. The first
scheme we create will be only a few lines long, yet will produce graphs
distinctly different from any existing scheme.
Using Stata to support the daily activities of a university curriculum:
- Elaboration, printing and correction of written tests (open answer or multiple choice) integrating Stata and EPIDATA
- Use of administrative datasets and elaboration on a student’s progression data
Giovanni Capelli
Bruno Federico
Università degli Studi Cassino
Abstract
1) Elaboration, printing and correction of written tests (open answer or
multiple choice) integrating Stata and EPIDATA
For the final evaluation of university students, many courses use written
tests, consisting of multiple-choice tests or series of open-answer
questions to be completed by each student. To avoid students’
cheating or cooperating in doing the tests, teachers often have to prepare
many different versions of the written test, an activity that results in a
significant burden on their work time. Moreover, existing software modules
are often not at all or scarcely customizable. To respond to these needs, we
wrote some Stata routines (some of which are still works in progress).
For open-answer tests, the routines randomly choose questions among course chapters and
can prepare many different versions of the test in plain-text format, ready
to be printed. For multiple choice tests, the routines randomly choose questions, their
order and the order of the answers, prepare printable versions of
test in .SMCL format, automatically prepare EPIDATA files (.QES and .CHK
files) for the input of students’ answers, perform test corrections,
define student scores, and can be used to evaluate the overall frequency of
wrong answers to any specific question in the database.
2) Use of administrative datasets and elaboration on a student’s
progression data
Personal and curriculum data of university students (high school, residence,
curriculum plan, date and outcome of examination for each course) are
archived in administrative databases managed by the Office for Students
Affairs of the Universities based on software (e.g., G. I. S. S., S3) the
interface of which is designed and managed by specific software houses. The
new guidelines introduced by the National Ministry of University, among
which a new system to calculate the Yearly National Funding (F. F. O.) for
universities based on the outcome of the learning progression of students,
made student administrative databases a possible source of relevant
data for the evaluation and optimization of the learning process. For
these needs, in cooperation with the Office for Students Affairs of the
University of Cassino, we are working on some Stata routines for the
extraction of administrative data by SQL queries (possibly using the odbc
command) and particularly for the elaboration of these data aimed to develop
tools for the management of the course curriculum not included in the
administrative software (listings of students pending of specific
examinations, analysis of time to get through examinations, identification
of “bottlenecks” in the curriculum progression of students, etc.).
Abstract in Italian
capelli_it.pdf
School-to-work transitions for university graduates in Mauritius: A duration model approach
V. C. Jaunky
A. J. Khadaroo
University of Mauritius
Abstract
The time students take to obtain a job having completed their studies is a
crucial indicator of the degree of efficiency of the labor market as well
as of the overall state of an economy. This paper examines school-to-work
transition (STWT) for University of Mauritius (UoM) graduates who completed
their degree during the period 1995–2000. To date STWT studies have
focused on developed countries given the scarcity of relevant data in the
nondeveloped world. This study constitutes the first attempt to model the
duration of job search in Mauritius based on data gathered by the Tertiary
Education Commission (TEC). We determine the direction of the duration
dependence and uncover those observable characteristics that affect the
job search duration. We use a variety of survival frameworks and control for
unobserved heterogeneity. The gamma frailty log-normal model is found to fit
the data best. An inverted U-shaped baseline hazard prevails in the graduate
labor market. A higher age at graduation and a higher father education
increase the job search time for graduates, whereas a higher mother education
and postgraduate training lead to a shorter job search time. Management and
engineering graduates experience a shorter job search period than science
and social science graduates. In addition graduates from urban areas have a
shorter job search time than their rural counterparts. Male graduates and
female graduates on average experience the same job search duration.
L’applicazione delle interrupted time series ad un campione di dati
di tipo panel per la valutazione dell’efficacia di un training di
formazione medica
(The application of interrupted time series to a panel-data survey to
evaluate the efficiency of medical training)
Maria Angela Mazzi
Michela Rimondini
Corista Zimmerman
Università degli Studi di Verona
Abstract
La ricerca in ambito biomedico non sempre puó avvalersi del tipico disegno
sperimentale degli studi clinici randomizzati. Puó infatti diventare
difficile arruolare un numero ottimale di soggetti (numerositą individuata
dallo studio di potenza), quando il fenomeno d’interesse è raro o
particolarmente complesso (lo sperimentatore non è in grado di tenere sotto
controllo tutte le variabili influenzanti il fenomeno d’interesse) o,
ancora, quando gli alti costi per unità di rilevazione sono tali da non
consentire la copertura di spesa. In tal caso appare opportuno procedere
nella ricerca realizzando studi che prevedono disegni di tipo
quasisperimentale.
Obiettivi: Trattamento dei dati in un disegno quasi-sperimentale
nell’ambito di uno studio volto alla valutazione dell’efficacia di un
pacchetto formativo nell’uso di tecniche comunicative in psichiatria (Verona
Communication in Psychiatry Training (VR-COPSYT)). Verifica
dell’applicabilitą di un modello lineare autoregressivo per dati panel.
Disegno dello studio: Per ciascuno dei 10 specializzandi iscritti al
corso di Psicologia Medica della Scuola di Specializzazione in Psichiatria
dell’Università degli studi di Verona, è stata raccolta una serie di 12
videoregistrazioni di prime visite (con pazienti simulati, per contenere
l’effetto confondente delle variabili genere e diagnosi), composta di 8
osservazioni pre-training e 4 post-training, per un totale di 120
osservazioni.
Metodi: Per quantificare l’abilità clinica del medico e sintetizzare
ogni colloquio in un unico valore, si è costruito un indicatore di tipo
rapporto (basato sul rapporto tra proposizioni centrate sul paziente e
proposizioni centrate sul medico). La letteratura propone la tecnica delle
interrupted time series per testare ipotesi in ambito quasi-sperimentale
(Campbell, 1965). L’intento infatti è quello di verificare la presenza di
modificazioni nelle traiettorie soggettive in occasione del trattamento (la
frequentazione del corso) introdotto dallo sperimentatore. L’impiego del
modello lineare autoregressivo di primo ordine ad effetti fissi permette di
stimare il profilo medio della classe degli specializzandi tenendo conto
anche delle peculiari caratteristiche di ciascun medico (lo stile di
conduzione del colloquio). Il comando di Stata xtregar ha permesso di
stimare il suddetto modello.
Risultati: L’effetto del training viene quantificato dalla variazione
intercorsa tra le fasi di pre e posttraining in termini di intercetta (salto
della curva in conseguenza di una variazione repentina nel profilo
individuale) e di pendenza (variazione progressiva nella tendenza del
profilo del medico, conseguente il training) delle interrupted time series.
Le stime ottenute per il campione osservato indicano una variazione solo in
termini di intercetta, interpretata come il tentativo dei medici di
applicare le tecniche suggerite dal training.
Conclusioni: La sfida di mutuare dall’econometria la tecnica dei
modelli autoregressivi per dati panel e applicarla in un contesto biomedico
sembra efficace. Rimangono ancora da delineare chiaramente come i limiti
specifici degli studi biomedici possano influire sull’applicabilità della
tecnica.
Tuesday, October 10
Survey data analysis in Stata
Abstract
The survey data analysis training will be given by Jeff Pitblado of
StataCorp. During the training seminar, Jeff will discuss Stata’s
features for analyzing survey and correlated data and will explain how and
when to use the three major variance estimators for survey and correlated
data: the linearization estimator, balanced repeated replications, and the
clustered jackknife (the last two added in Stata 9).
Jeff will also discuss sampling designs and stratification, including
Stata’s new features for estimation with data from multistage designs
and for applying poststratification. A theme of the seminar will be how
you can make inferences with correct coverage from data collected by
single-stage or multistage surveys or from data with inherent correlation,
such as data from longitudinal studies.
The survey data course will be given in English.
Analisi dei dati panel in Stata (in Italiano)
Abstract
L’obiettivo di questo corso è di fornire ai partecipanti
un’introduzione alla strumentazione teorica e applicata necessarie per
poter svolgere autonomamente analisi empiriche con dati panel. Durante il
corso verranno trattati i seguenti argomenti:
- Gestione de dati panel in Stata
- Il modello di regressione ad effetti “fissi”
- Il modello di regressione ad effetti “random”
- Test di correlazione dell’errore, nel tempo e tra individui, nei modelli panel
- Test di esteroschedasticità
- Correzione degli standard errors per la correlazione e l’eteroschedasticà negli errori
- Modelli panel con effetti temporali
- Sbilanciamento nei dati
- Modelli con variabili esplicative endogene: stimatori panel a variabili strumentali
|