Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Is it necessary to sort data before using -cf-?


From   "Martin Weiss" <martin.weiss1@gmx.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Is it necessary to sort data before using -cf-?
Date   Sun, 29 Nov 2009 11:36:31 +0100

<>

At the end of the day, it is natural that a comparison of values of a
variable should be conducted row after row, so the -sort- order does matter
for it. The manual entry and help file do not mention this fact, but I feel
that it goes without saying. What else would you compare but the values line
by line?

Note how in the following code the datasets are both ordered by -rep78-.
Given that rep78 only features 5 distinct values, this -sort- order is not
unique, though. That is the reason for the existence of the -stable- option
to -sort-, btw...


*******
sysuse auto,clear
sort rep78
save new.dta, replace

u new.dta, clear
sort for
//ends up being sorted by rep78
sort rep78
cf _all using new.dta, verbose
*******

Given only 5 values to go by, -sort- has to randomize its results, and only
by chance will it produce the same result twice. These differences are
subsequently picked up by -cf-.

See also Phil`s http://www.stata-journal.com/sjpdf.html?articlenum=dm0019
and http://www.stata.com/support/faqs/lang/sort.html


There is a -findit compdta- package, which is quite old and runs under
-version 4.0-. It does, however, feature a -sort- option.


HTH
Martin


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of gjhxmu@sina.com
Sent: Sonntag, 29. November 2009 10:36
To: statalist
Subject: st: Is it necessary to sort data before using -cf-? 

Dear statalists,

Is it necessary to sort data before using -cf-? 
Without sorting, I found two same datasets are reported difference. However,
I found no reference in -help cf-.
If necessary, how to determine the sorted variable(s) if I compare all the
variables or certain variables?
Does the sorted variable need to have no duplicates?

For example,

. sysuse auto,clear
(1978 Automobile Data)

. sort turn

. save new,replace
file new.dta saved

. sysuse auto,clear
(1978 Automobile Data)

. sort rep78

. cf _all using new
            make:  74 mismatches
           price:  74 mismatches
             mpg:  69 mismatches
           rep78:  63 mismatches
        headroom:  64 mismatches
           trunk:  72 mismatches
          weight:  73 mismatches
          length:  73 mismatches
            turn:  71 mismatches
    displacement:  72 mismatches
      gear_ratio:  72 mismatches
         foreign:  42 mismatches
r(9);

. 
Could anyone help me? Thank you.


Best regards,
Rose


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index