Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing two data set


From   Dirk Enzmann <[email protected]>
To   [email protected]
Subject   Re: st: Comparing two data set
Date   Sat, 05 Mar 2011 03:33:16 +0100

In reply to

http://www.stata.com/statalist/archive/2011-03/msg00261.html

-----------------------------------------------------------

I need -merge- because I want to find out exactly which values of which cases have been entered differently (and list the problems using -foreach-).

If I use -merge- when having duplicates in one of the datasets I will receive a warning (old version of -merge-) or an error message (new version of -merge-) but I still don't know which cases are causing the problem. Using -duplicates- first may help.

The use of -duplicates-, -merge-, and the variable by variable comparison via -foreach- and subsequent listing of discrepancies will most likely find all problems of double entered data - except problems that occur because of wrong IDs.

Personally I don't think that entering data twice and comparing them afterwards is a good strategy: It is much better to use data entry software that allows to correct double entered data immediately by comparing the data during data entry. EpiData Entry is a good example, see: http://www.epidata.dk/

Now I modified my example by checking for possible duplicates of the second data set (and removing them) in the same way as possible duplicates of the first:

* ==================================================
* Example of comparing two datasets by ID. Run will
* stop if files exist already (remove "//" from
* of "//, replace" to avoid this). Take care that no
* lines of this example syntax are broken!

* --------------------------------------------------
* Modify dataset "bpwide" and save it as "bpwide2":

sysuse bpwide, clear

replace sex=abs(sex-1) if mod(patient,13)==0
replace agegrp=2 if agegrp != 2 & mod(patient,11)==0
replace bp_before=bp_after if patient==100
replace bp_after=145 if patient==100
input
121 1 1 120 119
end
replace patient=2 if _n==1
replace patient=1 if _n==2

save bpwide_2 //, replace

* --------------------------------------------------
* Create log-file:
cap log close
log using example_cf //, replace

* --------------------------------------------------
* Open dataset 1:
sysuse bpwide, clear

* Create a duplicate case:
set obs 121
replace patient=patient[60] if _n==121
replace sex=sex[60] if _n==121
replace agegrp=agegrp[60] if _n==121
replace bp_before=bp_before[60] if _n==121
replace bp_after=bp_after[60] if _n==121

* Describe data:
describe, short

* List duplicates of data 1:
duplicates tag patient sex-bp_after, generate(doubl)
list patient if doubl==1
* drop variable "doubl":
drop doubl
* drop duplicates of data 1:
duplicates drop patient sex-bp_after, force

* Add "_orig" to all varnames except ID (patient):
foreach var of varlist sex-bp_after{
  rename `var' `var'_orig
}

* Sort according to (list of) ID variable(s):
gsort patient

* Save dataset for comparison:
save bpwide_orig //, replace

* --------------------------------------------------
* Open dataset 2 and describe:
use bpwide_2, clear
describe, short

* List duplicates of data 2:
duplicates tag patient sex-bp_after, generate(doubl)
list patient if doubl==1
* drop variable "doubl":
drop doubl
* drop duplicates of data 2:
duplicates drop patient sex-bp_after, force

* Sort according to (list of) ID variable(s):
gsort patient

* Merge datasets according to ID (patient):
qui merge patient using bpwide_orig
tab _merge

* List problems of non-identical IDs:
list patient _merge if _merge <3, abbreviate(15)

* List cases and values that differ using values:
foreach var of varlist sex-bp_after{
  list patient `var'_orig `var'     ///
       if `var' != `var'_orig & _merge==3, ///
	   abbreviate(15) sep(20) noobs nol
}
* --------------------------------------------------
* List cases and values that differ using labels:
foreach var of varlist sex-bp_after{
  list patient `var'_orig `var'     ///
       if `var' != `var'_orig & _merge==3, ///
	   abbreviate(15) sep(20) noobs
}

log close
* ==================================================

Dirk
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index