Last updated: 10 October 2006
Via Nazionale 22
Institute of Public Health, Aarhus C, Denmark
I have seen many accidents such as the following:
- Not being able to reconstruct what modifications were made to a dataset
- Not being able to reproduce an analysis
- Not discovering errors due to lack of error checking
- Mistakes about what a numerical code represents
Such experiences led to a 2-day course in data documentation—the first time in 1998, using SPSS, and using Stata since 2002. The main target group is Ph.D. students in the health sciences. Most of these students are not very sophisticated concerning statistics and computing; they have other things in their minds. However, most of their projects involve collection of their own data, safe handling of which is crucial.
A key concept is the audit trail: as a bookkeeper you must be able to go back from the final balance sheet to the individual vouchers. This is necessary to identify and correct your own errors, and it is a request for audit. The same principle applies when working with research data.
The main aim of the course is to help student researchers handle their own data consistently and safely, thus preventing errors, mistakes, and loss of data. It starts with advice on methods and safeguards when entering data and ends with the student archiving data and documentation after analysis of a moderately complex dataset. The quality of the archive is assessed and feedback is given.
In the course I use the booklet Take Good Care of Your Data. You may download it and the data used for the course from http://www.epidata.dk/downloads/takecare.pdf. The concepts also had a strong influence on the contents and structure of An Introduction to Stata for Health Researchers (S. Juul, Stata Press, 2006).
The course gets good evaluations (mostly); frequent comments include the following: “I should have taken this course a year ago” and “This course should be compulsory.” About 80% pass the ‘driving test’ (the assessment of the quality of the students' archives); about 20% fail. There are no sanctions against those who fail, but I tell them that they are a hazard to their own data. Often they redo the exercise and ask for another assessment—and most pass the second time.
The presentation will give more detail about the teaching experience. Also, useful Stata tools will be presented. My main reason for giving the presentation is the feeling that these are important, but frequently neglected, issues.
Giovanni S. F. Bruno
Data used in applied econometrics are typically nonexperimental in nature, making the assumption of exogeneity of regressors untenable and posing a serious identification issue in the estimation of economic structural relationships.
As far as the source of endogeneity is confined to unobserved heterogeneity between groups (for example, time-invariant managerial ability in firm-level labor demand equations), the availability of panel data can identify the parameters of interest. If endogeneity, instead, is more pervasive stemming also from unobserved within group variation, (for example, a transitory technology shock hitting at the same time both the labor demand of the firm and the wage paid) then standard panel data estimators are biased and instrumental-variable or generalized method-of-moments estimators provide valid alternative techniques.
This paper extends the analysis in (Bruno, G. S. F., 2005: “Estimation and inference in dynamic unbalanced panel-data models with a small number of individuals”, Stata Journal 5: 473–500) focusing on dynamic panel-data (DPD) models with endogenous regressors.
Various Monte Carlo experiments are carried out through my Stata code xtarsim to assess the relative finite-sample performances of some popular DPD estimators, such as Arellano and Bond (xtabond, xtabond2); Blundell and Bond (xtabond2); Anderson and Hsiao (ivreg, ivreg2, xtivreg, xtivreg2); and LSDVC (xtlsdvc).
New versions of the codes xtarsim and xtlsdvc are also presented.
A common problem in collecting individual data is the duplication of individuals. Often the attribution of a code that identifies an individual is sensitive to variables such as name, surname, date and place of birth, and address. This information is imputed when the person is registered the first time in the archive, and then it is checked and updated each time that the same person is contacted. The problem is that any mistake in reporting individuals’ identifying information gives rise to a new identifying code, therefore mistakenly creating a new person in the archive. Therefore the need for deduplicating observations for the same individual arises. The construction of an appropriate program for solving this problem can be developed at different stages. First of all, some general checks of coherence have to be implemented, using all the available information in the archive (name, surname, fiscal or social security codes, sex, date and place of birth, variables of assessment of quality of the collected information, etc.). Then, at a second stage, some general criteria of phonetic assonance can be used for deduplicating observations in an appropriate probabilistic environment. Our program does not correct imputed wrong data with the right ones but simply generates individual identifying codes that can sensibly reduce the cases of duplication of individual records, making their identity anonymous at the same time. This is sufficient and useful for carrying out any kind of statistical, econometric and, in particular, panel-data analysis on data subject to privacy restrictions. Further research should be devoted to the possibility of correcting for the right information, possibly using preconstructed universal vocabularies, so that the program can be extended to the cases where individuals’ details are needed for the purposes of the analysis. As an example of application of our routine, we use the Italian administrative archive of the National Institute of Social Security (INPS).
Università di Roma “La Sapienza”
For a detailed summary of this talk, please read federici.pdf.
Abstract in Italian
Relative survival is the method of choice for estimating patient survival using data collected by population-based cancer registries. It is estimated as the ratio of the observed survival (where all deaths are considered events) to the expected survival of a comparable group from the general population.
strs is a new Stata macro producing life table estimates of relative survival. The availability of several options makes it possible to estimate various methods or approaches, to standardize for age or other variables, and to prepare suitable files for modeling the effect of covariates on relative survival.
The basis of the algorithm is to split the data using stsplit to obtain one observation for each individual and for each life table interval. Thereafter data are merged with a file containing expected survival probabilities from a comparable general population and are collapsed to obtain one observation for each interval.
The syntax of the new command will be illustrated using three examples. First, expected survival will be estimated by Ederer I, Hakulinen and Ederer II methods. Second, the relative survival will be estimated using cohort, period and hybrid approaches. Finally, age-standardized relative survival estimates will be achieved by traditional and alternative methods.
The command can create two output datasets. The first, individ.dta, contains one observation for each patient for each life table interval; the second dataset, grouped.dta, contains one observation for each life-table interval.
Using the grouped.dta dataset we will illustrate how simple it is to obtain graphs of the basic functions and of the time trends in 5- and 10-year relative survival.
Relative survival may be modeled using these output datasets. Relative survival models can be fitted within the framework of generalized linear models by a nonstandard link function defined in rs.ado. We will show how to obtain model estimates to easily assess the effect of covariates as relative excess risk and to adjust parameter estimates for additional factors. In these models, follow-up time effect can be fitted as a piecewise function using dummy variables, one for each time band, or as a smooth function, e.g., applying fractional polynomials by the command mfp.
Abstract in Italian
Giuseppe de Luca
Unversità di Roma—Tor Vergata—ISFOL
In this paper, we use the seminonparametric (SNP) approach proposed by Gallant and Nychka (1987) to estimate univariate and bivariate binary-choice models. After discussing issues related to identification and estimation, we provide a set of new Stata commands for semiparametric estimation of three types of models. The first is a semiparametric binary-choice model that nests the probit model. The corresponding SNP estimator is a faster and improved version of the SNP estimator for ordered-choice models implemented in Stata by Stewart (2004). The second and third models are instead SNP generalizations of the bivariate probit model with and without sample selection respectively. The proposed estimators are √n-consistent and asymptotically normal for the model parameters of interest under weak distributional assumptions. The use of these commands is illustrated through empirical applications.
Abstract in Italian
Karolinska Institutet, Università di Milano—Bicocca
UCLA School of Public Health
In observational epidemiological studies, potential biases due to uncontrolled confounding, nonrandom subjects election, and measurement errors are usually only qualitatively addressed, in discussion of study results. This practice can be problematic because misleading inferences may results from inadequate accounting for the effects of these biases.
Quantitative assessment of bias can provide valuable insight into the importance of various sources of bias, and thus enhance the contributions of epidemiology. Such assessments may demonstrate that certain sources of bias cannot possibly explain a study result, or that a bias explanation cannot be ruled out.
To our knowledge, although some basic and advanced methods are already known in epidemiological literature, they have seen a small number of published applications, and this will certainly remain the case until major software packages incorporate them.
Therefore, we will present a new user-friendly Stata command for sensitivity analysis of biases (uncontrolled confounding and classification errors) in observational epidemiological studies (cohort and case–control), with applications to the research area of lifestyle habits (diet, physical activity) and health.
Abstract in Italian
StataCorp, College Station, TX
If you find yourself repeatedly specifying the same options on graph commands, you should write a graphics scheme. A scheme is nothing more than a file containing a set of rules specifying how you want your graphs to look. From the size of fonts used in titles and the color of lines and markers in plots to the placement of legends and the number of default ticks on axes, almost everything about a graph is controlled by the rules in a graphics scheme. We will look at how to create your own graphics schemes and where to find out more about all the rules available in schemes. The first scheme we create will be only a few lines long, yet will produce graphs distinctly different from any existing scheme.
Università degli Studi Cassino
1) Elaboration, printing and correction of written tests (open answer or multiple choice) integrating Stata and EPIDATA
For the final evaluation of university students, many courses use written tests, consisting of multiple-choice tests or series of open-answer questions to be completed by each student. To avoid students’ cheating or cooperating in doing the tests, teachers often have to prepare many different versions of the written test, an activity that results in a significant burden on their work time. Moreover, existing software modules are often not at all or scarcely customizable. To respond to these needs, we wrote some Stata routines (some of which are still works in progress). For open-answer tests, the routines randomly choose questions among course chapters and can prepare many different versions of the test in plain-text format, ready to be printed. For multiple choice tests, the routines randomly choose questions, their order and the order of the answers, prepare printable versions of test in .SMCL format, automatically prepare EPIDATA files (.QES and .CHK files) for the input of students’ answers, perform test corrections, define student scores, and can be used to evaluate the overall frequency of wrong answers to any specific question in the database.
2) Use of administrative datasets and elaboration on a student’s progression data
Personal and curriculum data of university students (high school, residence, curriculum plan, date and outcome of examination for each course) are archived in administrative databases managed by the Office for Students Affairs of the Universities based on software (e.g., G. I. S. S., S3) the interface of which is designed and managed by specific software houses. The new guidelines introduced by the National Ministry of University, among which a new system to calculate the Yearly National Funding (F. F. O.) for universities based on the outcome of the learning progression of students, made student administrative databases a possible source of relevant data for the evaluation and optimization of the learning process. For these needs, in cooperation with the Office for Students Affairs of the University of Cassino, we are working on some Stata routines for the extraction of administrative data by SQL queries (possibly using the odbc command) and particularly for the elaboration of these data aimed to develop tools for the management of the course curriculum not included in the administrative software (listings of students pending of specific examinations, analysis of time to get through examinations, identification of “bottlenecks” in the curriculum progression of students, etc.).
Abstract in Italian
V. C. Jaunky
A. J. Khadaroo
University of Mauritius
The time students take to obtain a job having completed their studies is a crucial indicator of the degree of efficiency of the labor market as well as of the overall state of an economy. This paper examines school-to-work transition (STWT) for University of Mauritius (UoM) graduates who completed their degree during the period 1995–2000. To date STWT studies have focused on developed countries given the scarcity of relevant data in the nondeveloped world. This study constitutes the first attempt to model the duration of job search in Mauritius based on data gathered by the Tertiary Education Commission (TEC). We determine the direction of the duration dependence and uncover those observable characteristics that affect the job search duration. We use a variety of survival frameworks and control for unobserved heterogeneity. The gamma frailty log-normal model is found to fit the data best. An inverted U-shaped baseline hazard prevails in the graduate labor market. A higher age at graduation and a higher father education increase the job search time for graduates, whereas a higher mother education and postgraduate training lead to a shorter job search time. Management and engineering graduates experience a shorter job search period than science and social science graduates. In addition graduates from urban areas have a shorter job search time than their rural counterparts. Male graduates and female graduates on average experience the same job search duration.
Maria Angela Mazzi
Università degli Studi di Verona
La ricerca in ambito biomedico non sempre puó avvalersi del tipico disegno sperimentale degli studi clinici randomizzati. Puó infatti diventare difficile arruolare un numero ottimale di soggetti (numerositą individuata dallo studio di potenza), quando il fenomeno d’interesse è raro o particolarmente complesso (lo sperimentatore non è in grado di tenere sotto controllo tutte le variabili influenzanti il fenomeno d’interesse) o, ancora, quando gli alti costi per unità di rilevazione sono tali da non consentire la copertura di spesa. In tal caso appare opportuno procedere nella ricerca realizzando studi che prevedono disegni di tipo quasisperimentale.
Obiettivi: Trattamento dei dati in un disegno quasi-sperimentale nell’ambito di uno studio volto alla valutazione dell’efficacia di un pacchetto formativo nell’uso di tecniche comunicative in psichiatria (Verona Communication in Psychiatry Training (VR-COPSYT)). Verifica dell’applicabilitą di un modello lineare autoregressivo per dati panel.
Disegno dello studio: Per ciascuno dei 10 specializzandi iscritti al corso di Psicologia Medica della Scuola di Specializzazione in Psichiatria dell’Università degli studi di Verona, è stata raccolta una serie di 12 videoregistrazioni di prime visite (con pazienti simulati, per contenere l’effetto confondente delle variabili genere e diagnosi), composta di 8 osservazioni pre-training e 4 post-training, per un totale di 120 osservazioni.
Metodi: Per quantificare l’abilità clinica del medico e sintetizzare ogni colloquio in un unico valore, si è costruito un indicatore di tipo rapporto (basato sul rapporto tra proposizioni centrate sul paziente e proposizioni centrate sul medico). La letteratura propone la tecnica delle interrupted time series per testare ipotesi in ambito quasi-sperimentale (Campbell, 1965). L’intento infatti è quello di verificare la presenza di modificazioni nelle traiettorie soggettive in occasione del trattamento (la frequentazione del corso) introdotto dallo sperimentatore. L’impiego del modello lineare autoregressivo di primo ordine ad effetti fissi permette di stimare il profilo medio della classe degli specializzandi tenendo conto anche delle peculiari caratteristiche di ciascun medico (lo stile di conduzione del colloquio). Il comando di Stata xtregar ha permesso di stimare il suddetto modello.
Risultati: L’effetto del training viene quantificato dalla variazione intercorsa tra le fasi di pre e posttraining in termini di intercetta (salto della curva in conseguenza di una variazione repentina nel profilo individuale) e di pendenza (variazione progressiva nella tendenza del profilo del medico, conseguente il training) delle interrupted time series. Le stime ottenute per il campione osservato indicano una variazione solo in termini di intercetta, interpretata come il tentativo dei medici di applicare le tecniche suggerite dal training.
Conclusioni: La sfida di mutuare dall’econometria la tecnica dei modelli autoregressivi per dati panel e applicarla in un contesto biomedico sembra efficace. Rimangono ancora da delineare chiaramente come i limiti specifici degli studi biomedici possano influire sull’applicabilità della tecnica.
The survey data analysis training will be given by Jeff Pitblado of StataCorp. During the training seminar, Jeff will discuss Stata’s features for analyzing survey and correlated data and will explain how and when to use the three major variance estimators for survey and correlated data: the linearization estimator, balanced repeated replications, and the clustered jackknife (the last two added in Stata 9).
Jeff will also discuss sampling designs and stratification, including Stata’s new features for estimation with data from multistage designs and for applying poststratification. A theme of the seminar will be how you can make inferences with correct coverage from data collected by single-stage or multistage surveys or from data with inherent correlation, such as data from longitudinal studies.
The survey data course will be given in English.
L’obiettivo di questo corso è di fornire ai partecipanti un’introduzione alla strumentazione teorica e applicata necessarie per poter svolgere autonomamente analisi empiriche con dati panel. Durante il corso verranno trattati i seguenti argomenti:
- Gestione de dati panel in Stata
- Il modello di regressione ad effetti “fissi”
- Il modello di regressione ad effetti “random”
- Test di correlazione dell’errore, nel tempo e tra individui, nei modelli panel
- Test di esteroschedasticità
- Correzione degli standard errors per la correlazione e l’eteroschedasticà negli errori
- Modelli panel con effetti temporali
- Sbilanciamento nei dati
- Modelli con variabili esplicative endogene: stimatori panel a variabili strumentali
Una-Louise Bell, TStat S.r.l.
Rino Bellocco, Karolinska Institutet
Giovanni Capelli, Università degli Studi di Cassino
Marcello Pagano, Harvard School of Public Health
TStat S.r.l, the official distributor of Stata in Italy.