Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing two data set


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Comparing two data set
Date   Sat, 5 Mar 2011 11:18:11 +0000

I guess I did not ask my question sufficiently clearly or completely.

I make a distinction between (a) what is congenial or convenient and
(b) what is essential or necessary.

On (a) I am happy to believe that EpiData entry is excellent for its
purpose. This reminds me of a frequent comment on this list that it is
easier to do something in this list in MS Excel.

Using any alternative software is attractive if and only if

1. You have that software on your machine.

2. You are well disposed to that software.

3. You are familiar with that software or willing to become familiar
with it. Stata users can also often say that they prefer to learn of
Stata solutions rather than learn something completely new, however
excellent that might be once you get used to it.

In any case, we often get questions in which users already have two
versions of the same dataset already in Stata, and want to know how to
compare them in Stata. At most, the advice to them might be, it seems,
"Next time, consider using EpiData Entry".

For this kind of problem, there is more than one way to do it, yet we
can still trade experiences and logic on what is fastest, shortest,
without side-effects and so forth.

Dirk says "I need -merge- because I want to find out exactly which
values of which
cases have been entered differently (and list the problems using -foreach-)."

That seems incorrect to me. All you need to do before -append-ing is
to add variables with dataset number and observation numbers and you
can keep track of what came from where.

-merge- may be convenient but I don't think it is essential.

I imagine we all have biases for and against ways of doing this. For
example, I am very happy to work with -by:- and its associated
machinery. The more you use anything, the more you memorise most or
all of its syntax and get to have almost a visual picture of what a
command will do. Conversely, I rarely use -merge- and can only recall
the very simplest -merge- syntax.  I've seen people do very clever
things based on -merging- a dataset with a clone of itself, and so
forth, which sometimes I can do in place with -by:-. It's just like
the experience of, I imagine, most or all of the people on this list
that there are bits of mathematics that you know pretty much perfectly
because you use them all the time and bits that you have to relearn
every time you use them.

Nick

On Sat, Mar 5, 2011 at 2:33 AM, Dirk Enzmann
<[email protected]> wrote:
> In reply to
>
> http://www.stata.com/statalist/archive/2011-03/msg00261.html
>
> -----------------------------------------------------------
>
> I need -merge- because I want to find out exactly which values of which
> cases have been entered differently (and list the problems using -foreach-).
>
> If I use -merge- when having duplicates in one of the datasets I will
> receive a warning (old version of -merge-) or an error message (new version
> of -merge-) but I still don't know which cases are causing the problem.
> Using -duplicates- first may help.
>
> The use of -duplicates-, -merge-, and the variable by variable comparison
> via -foreach- and subsequent listing of discrepancies will most likely find
> all problems of double entered data - except problems that occur because of
> wrong IDs.
>
> Personally I don't think that entering data twice and comparing them
> afterwards is a good strategy: It is much better to use data entry software
> that allows to correct double entered data immediately by comparing the data
> during data entry. EpiData Entry is a good example, see:
> http://www.epidata.dk/
>
> Now I modified my example by checking for possible duplicates of the second
> data set (and removing them) in the same way as possible duplicates of the
> first:
>
> * ==================================================
> * Example of comparing two datasets by ID. Run will
> * stop if files exist already (remove "//" from
> * of "//, replace" to avoid this). Take care that no
> * lines of this example syntax are broken!
>
> * --------------------------------------------------
> * Modify dataset "bpwide" and save it as "bpwide2":
>
> sysuse bpwide, clear
>
> replace sex=abs(sex-1) if mod(patient,13)==0
> replace agegrp=2 if agegrp != 2 & mod(patient,11)==0
> replace bp_before=bp_after if patient==100
> replace bp_after=145 if patient==100
> input
> 121 1 1 120 119
> end
> replace patient=2 if _n==1
> replace patient=1 if _n==2
>
> save bpwide_2 //, replace
>
> * --------------------------------------------------
> * Create log-file:
> cap log close
> log using example_cf //, replace
>
> * --------------------------------------------------
> * Open dataset 1:
> sysuse bpwide, clear
>
> * Create a duplicate case:
> set obs 121
> replace patient=patient[60] if _n==121
> replace sex=sex[60] if _n==121
> replace agegrp=agegrp[60] if _n==121
> replace bp_before=bp_before[60] if _n==121
> replace bp_after=bp_after[60] if _n==121
>
> * Describe data:
> describe, short
>
> * List duplicates of data 1:
> duplicates tag patient sex-bp_after, generate(doubl)
> list patient if doubl==1
> * drop variable "doubl":
> drop doubl
> * drop duplicates of data 1:
> duplicates drop patient sex-bp_after, force
>
> * Add "_orig" to all varnames except ID (patient):
> foreach var of varlist sex-bp_after{
>  rename `var' `var'_orig
> }
>
> * Sort according to (list of) ID variable(s):
> gsort patient
>
> * Save dataset for comparison:
> save bpwide_orig //, replace
>
> * --------------------------------------------------
> * Open dataset 2 and describe:
> use bpwide_2, clear
> describe, short
>
> * List duplicates of data 2:
> duplicates tag patient sex-bp_after, generate(doubl)
> list patient if doubl==1
> * drop variable "doubl":
> drop doubl
> * drop duplicates of data 2:
> duplicates drop patient sex-bp_after, force
>
> * Sort according to (list of) ID variable(s):
> gsort patient
>
> * Merge datasets according to ID (patient):
> qui merge patient using bpwide_orig
> tab _merge
>
> * List problems of non-identical IDs:
> list patient _merge if _merge <3, abbreviate(15)
>
> * List cases and values that differ using values:
> foreach var of varlist sex-bp_after{
>  list patient `var'_orig `var'     ///
>       if `var' != `var'_orig & _merge==3, ///
>           abbreviate(15) sep(20) noobs nol
> }
> * --------------------------------------------------
> * List cases and values that differ using labels:
> foreach var of varlist sex-bp_after{
>  list patient `var'_orig `var'     ///
>       if `var' != `var'_orig & _merge==3, ///
>           abbreviate(15) sep(20) noobs
> }
>
> log close
> * ==================================================
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index