Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Comparing two data set
From
Dirk Enzmann <[email protected]>
To
[email protected]
Subject
Re: st: Comparing two data set
Date
Sat, 05 Mar 2011 03:33:16 +0100
In reply to
http://www.stata.com/statalist/archive/2011-03/msg00261.html
-----------------------------------------------------------
I need -merge- because I want to find out exactly which values of which
cases have been entered differently (and list the problems using -foreach-).
If I use -merge- when having duplicates in one of the datasets I will
receive a warning (old version of -merge-) or an error message (new
version of -merge-) but I still don't know which cases are causing the
problem. Using -duplicates- first may help.
The use of -duplicates-, -merge-, and the variable by variable
comparison via -foreach- and subsequent listing of discrepancies will
most likely find all problems of double entered data - except problems
that occur because of wrong IDs.
Personally I don't think that entering data twice and comparing them
afterwards is a good strategy: It is much better to use data entry
software that allows to correct double entered data immediately by
comparing the data during data entry. EpiData Entry is a good example,
see: http://www.epidata.dk/
Now I modified my example by checking for possible duplicates of the
second data set (and removing them) in the same way as possible
duplicates of the first:
* ==================================================
* Example of comparing two datasets by ID. Run will
* stop if files exist already (remove "//" from
* of "//, replace" to avoid this). Take care that no
* lines of this example syntax are broken!
* --------------------------------------------------
* Modify dataset "bpwide" and save it as "bpwide2":
sysuse bpwide, clear
replace sex=abs(sex-1) if mod(patient,13)==0
replace agegrp=2 if agegrp != 2 & mod(patient,11)==0
replace bp_before=bp_after if patient==100
replace bp_after=145 if patient==100
input
121 1 1 120 119
end
replace patient=2 if _n==1
replace patient=1 if _n==2
save bpwide_2 //, replace
* --------------------------------------------------
* Create log-file:
cap log close
log using example_cf //, replace
* --------------------------------------------------
* Open dataset 1:
sysuse bpwide, clear
* Create a duplicate case:
set obs 121
replace patient=patient[60] if _n==121
replace sex=sex[60] if _n==121
replace agegrp=agegrp[60] if _n==121
replace bp_before=bp_before[60] if _n==121
replace bp_after=bp_after[60] if _n==121
* Describe data:
describe, short
* List duplicates of data 1:
duplicates tag patient sex-bp_after, generate(doubl)
list patient if doubl==1
* drop variable "doubl":
drop doubl
* drop duplicates of data 1:
duplicates drop patient sex-bp_after, force
* Add "_orig" to all varnames except ID (patient):
foreach var of varlist sex-bp_after{
rename `var' `var'_orig
}
* Sort according to (list of) ID variable(s):
gsort patient
* Save dataset for comparison:
save bpwide_orig //, replace
* --------------------------------------------------
* Open dataset 2 and describe:
use bpwide_2, clear
describe, short
* List duplicates of data 2:
duplicates tag patient sex-bp_after, generate(doubl)
list patient if doubl==1
* drop variable "doubl":
drop doubl
* drop duplicates of data 2:
duplicates drop patient sex-bp_after, force
* Sort according to (list of) ID variable(s):
gsort patient
* Merge datasets according to ID (patient):
qui merge patient using bpwide_orig
tab _merge
* List problems of non-identical IDs:
list patient _merge if _merge <3, abbreviate(15)
* List cases and values that differ using values:
foreach var of varlist sex-bp_after{
list patient `var'_orig `var' ///
if `var' != `var'_orig & _merge==3, ///
abbreviate(15) sep(20) noobs nol
}
* --------------------------------------------------
* List cases and values that differ using labels:
foreach var of varlist sex-bp_after{
list patient `var'_orig `var' ///
if `var' != `var'_orig & _merge==3, ///
abbreviate(15) sep(20) noobs
}
log close
* ==================================================
Dirk
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/