Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing two data set

From	Rajaram Subramanian Potty <[email protected]>
To	[email protected]
Subject	Re: st: Comparing two data set
Date	Wed, 2 Mar 2011 17:12:37 +0530

Thank you very much for the information. Installed the -cf3- and able
to generate the error list by the ID.

RAJARAM. S

On Wed, Mar 2, 2011 at 3:33 PM, Kevin Owuor <[email protected]> wrote:
> Maybe you can Tryout cf3 package type --findit cf3--.it lists errors by id
> ----------
> Kevin Owuor
> Kemri/ucsf
> Kenya
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: Wednesday, March 02, 2011 12:44 PM
> To: [email protected]
> Subject: Re: st: Comparing two data set
>
> The answer is Yes, and follows from looking at the help for -duplicates-.
>
> Following the example in my previous, let's introduce an oddity and
> then show how you find it.
>
> . replace mpg = 42 in 42
> (1 real change made)
>
> . duplicates report make-foreign
>
> Duplicates in terms of make price mpg rep78 headroom trunk weight
> length turn displacement
>    gear_ratio foreign
>
> --------------------------------------
>   copies | observations       surplus
> ----------+---------------------------
>        1 |            2             0
>        2 |          146            73
> --------------------------------------
>
> -duplicates- reports two observations that are singletons, i.e. occur
> precisely once. We create a tag variable (which will be 0 for the
> singletons).
>
> . duplicates tag make-foreign, gen(tag)
>
> Duplicates in terms of make price mpg rep78 headroom trunk weight
> length turn displacement
>    gear_ratio foreign
>
> . l if tag == 0
>
>
> +---------------------------------------------------------------------------
> ---------------+
>  42. | make        | price | mpg | rep78 | headroom | trunk | weight |
> length | turn | displa~t |
>     | Plym. Arrow | 4,647 |  42 |     3 |      2.0 |    11 |  3,260 |
>   170 |   37 |      156 |
>
> |---------------------------------------------------------------------------
> ---------------|
>     |        gear_r~o        |         foreign        |        ds
>    |        tag         |
>     |            3.05        |        Domestic        |         2
>    |          0         |
>
> +---------------------------------------------------------------------------
> ---------------+
>
>
> +---------------------------------------------------------------------------
> ---------------+
> 116. | make        | price | mpg | rep78 | headroom | trunk | weight |
> length | turn | displa~t |
>     | Plym. Arrow | 4,647 |  28 |     3 |      2.0 |    11 |  3,260 |
>   170 |   37 |      156 |
>
> |---------------------------------------------------------------------------
> ---------------|
>     |        gear_r~o        |         foreign        |        ds
>    |        tag         |
>     |            3.05        |        Domestic        |         1
>    |          0         |
>
> +---------------------------------------------------------------------------
> ---------------+
>
> So, you can home in on anomalies in any standard way.
>
> Nick
>
> On Wed, Mar 2, 2011 at 9:25 AM, Rajaram Subramanian Potty
> <[email protected]> wrote:
>> Dear Nick,
>>
>> Thanks for the information. Twor or three times I used the -cf-
>> command to identify the errors in two data files. But I want the error
>> should be displayed according to the ID variable. But presently, the
>> -cf-  command gives error by observation number in the Stata data set
>> and not by the ID variable. If I will be able to generate the errors
>> according to the ID variable, it will be easy for use to trace
>> questionnaire and find the error in the data entry. So, I just want to
>> know whether it is possible to get the error listed by the ID vriable.
>>
>> Thanks and regards,
>>
>> RAJARAM. S
>>
>> On Wed, Mar 2, 2011 at 2:44 PM, Nick Cox <[email protected]> wrote:
>>> One way is to check that the .dta or other data files are identical
>>> using your operating system.
>>>
>>> Also, check out -cf- and -dta_equal-.
>>>
>>> Another way to approach this is to -append- the datasets and look for
>>> -duplicates-. However, -duplicates- just looks for duplicate
>>> observations. In principle, the variable names, variable labels, value
>>> labels, formats and characteristics must also be shown to be
>>> identical.
>>>
>>> To do this last, you will need to create a dataset identifier so that
>>> you can work out where any anomalies are.
>>>
>>> Here is an example where by construction the interesting part of the
>>> data is identical. So, -duplicates- confirms that everything occurs
>>> twice. Conversely, mismatches would imply singletons, triplicates,
>>> etc.
>>>
>>> . sysuse auto
>>> (1978 Automobile Data)
>>>
>>> . gen ds = 1
>>>
>>> . save auto1
>>> file auto1.dta saved
>>>
>>> . sysuse auto, clear
>>> (1978 Automobile Data)
>>>
>>> . gen ds = 2
>>>
>>> . append using auto1
>>> (label origin already defined)
>>>
>>>
>>> . tab ds
>>>
>>>         ds |      Freq.     Percent        Cum.
>>> ------------+-----------------------------------
>>>          1 |         74       50.00       50.00
>>>          2 |         74       50.00      100.00
>>> ------------+-----------------------------------
>>>      Total |        148      100.00
>>>
>>> . duplicates report make-foreign
>>>
>>> Duplicates in terms of make price mpg rep78 headroom trunk weight
>>> length turn displacement
>>>    gear_ratio foreign
>>>
>>> --------------------------------------
>>>   copies | observations       surplus
>>> ----------+---------------------------
>>>        2 |          148            74
>>> --------------------------------------
>>>
>>> Nick
>>>
>>> On Wed, Mar 2, 2011 at 9:01 AM, Rajaram Subramanian Potty
>>> <[email protected]> wrote:
>>>
>>>> We are carried out a survey and the data from the survey was entered
>>>> two times. Now, we want to compare these two data files for possible
>>>> data etnry errors. Please, inform how to compare the two data files
>>>> and identify the data entry error using stata.
>>> *
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Comparing two data set
  - From: Nick Cox <[email protected]>

References:
- st: Comparing two data set
  - From: Rajaram Subramanian Potty <[email protected]>
- Re: st: Comparing two data set
  - From: Nick Cox <[email protected]>
- Re: st: Comparing two data set
  - From: Rajaram Subramanian Potty <[email protected]>
- Re: st: Comparing two data set
  - From: Nick Cox <[email protected]>
- RE: st: Comparing two data set
  - From: "Kevin Owuor" <[email protected]>

Prev by Date: st: RE: testing heteroksedasticity and autocorrelation fixed effect model
Next by Date: st: graph hbar, over, and bar coloring
Previous by thread: RE: st: Comparing two data set
Next by thread: Re: st: Comparing two data set
Index(es):
- Date
- Thread