Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: testing -duplicates tag-


From   Michael McCulloch <[email protected]>
To   [email protected]
Subject   RE: st: RE: testing -duplicates tag-
Date   Thu, 4 Sep 2008 08:06:09 -0700

Dear Martin,
This works with perfect sensitivity and specificity. For those interested, the original question was:
for each record in one group (foreign), tag for deletion all records in another group (domestic), which are duplicates on a set of specified variables (headroom and trunk).

My goal in asking for help with this argument was to have a method for removing potential duplicates between two overlapping data sets, where individual identifiers are not available.

My sincere thanks to Martin, Nick, Eva and Emmanouil for their very kind, and helpful, input.
Michael


Ok, so let`s try that again. The tag should now reliably indicate that an
observation is duplicated more times overall than in the domestic subgroup,
implying that it must have at least one match in the foreign group...

*********

sysuse auto, clear
g id=_n

duplicates tag headroom trunk if foreign==0, generate(dupdom)
duplicates tag headroom trunk, generate(dupall)

*tag to indicate domestic obs with at least one match in foreign
g byte tag = for==0 & dupall>dupdom

*let�s see
l tag id f if for==0, noo h(25)
*********

HTH
Martin


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael McCulloch
Sent: Thursday, September 04, 2008 6:30 AM
To: [email protected]
Subject: Re: st: RE: testing -duplicates tag-

The code suggested by Martin gets me closer, but the pattern is still
not exclusive. I'm trying to identify observations in DOMESTIC, which
are duplicates (in headroom & trunk) of observations in FOREIGN. Here
are two sets of those duplicates. Note how 20 is a duplicate of 57,
where the patterns of missing and 0 in dupfor and dupdom seem to form
a pattern; that pattern, however is contradicted in the next set,
where 53 71 and 72 are duplicates of 32.

Any ideas would be appreciated!

id	foreign	 headroom	trunk	dupall	dupfor	dupdom
20	Domestic		2	8	1	.	0
57	Foreign		2	8	1	0	.
*	*	*	*	*	*	*
32	Domestic		3	15	3	.	0
53	Foreign		3	15	3	2	.
71	Foreign		3	15	3	2	.
72	Foreign		3	15	3	2	.




Try this:

sysuse auto, clear
duplicates tag headroom trunk if foreign==1, generate(dupfor)
*duplicates tag headroom trunk if foreign==0, generate(dupdom)
duplicates tag headroom trunk, generate(dupall)
l if dupfor==0 & dupall>0


HTH
Martin


Quoting Michael McCulloch <[email protected]>:

On other question, if I may:
How would I modify the list command as re-written below, to identify
only those duplicates where:
	headroom and trunks are duplicated, but
	foreign is not,
so that I could find only those Foreign cars who have duplicates in the
set of Domestic cars (in this case observations #7 and #8)?

clear
sysuse auto
list foreign headroom trunk
duplicates tag headroom trunk, generate(dup)
sort headroom trunk
list foreign headroom trunk dup if dup>0 & trunk==8, clean noobs




Well, as -help duplicates- shows, a -varlist- is allowed with all of the fice commands. If you had the *OR* operator, this would be pointless. -duplicates tag- watches out for unique combinations of the variables in your -varlist- and then tags with the number of other observations sharing this unique combination.

sysuse auto, clear
duplicates tag head mpg, gen(dup)
duplicates report headroom mpg
ta dup

duplicates tag head mpg tru, gen(dup1)
duplicates report headroom mpg tru
ta dup1


HTH
Martin

Quoting Michael McCulloch <[email protected]>:


Thanks Martin. Am I correct in understanding that, in this revised
example immediately below, the command:

. duplicates tag headroom trunk, generate(dup)

would tag as dup>0 all sets of observations for which there are duplicates of:
headroom *AND* trunk
and not just those for which there are duplicates of:
headroom *OR* trunk
?
It looks that way on visual inspection of this example's output, but I
 >>>>want to make sure before applying it to my much larger dataset.

clear
sysuse auto
list foreign headroom trunk
duplicates tag headroom trunk, generate(dup)
sort headroom trunk
list foreign headroom trunk dup if dup>0, clean

Michael

Well, the question is not much clearer now, at least to me. I suspect you want something like

count if duptag > 0

after your commands. Just replace duptag with the tag used by Stata and be aware that two observations sharing the same covariate pattern would each be counted twice (58 and 59 would both count under this rule). If that is not what you want, clarify!


HTH
Martin

Quoting Michael McCulloch <[email protected]>:


Apologies, I wasn't clear in my question. What I want to do is find
records for which *both* trunk and headroom are duplicates. So
following the command suggested by Martin and Nick, I get:


. list foreign headroom trunk if trunk==8, clean

       foreign   headroom   trunk  20.   Domestic        2.0       8
45.   Domestic        1.5       8  57.    Foreign        2.0       8
58.    Foreign        2.5       8  59.    Foreign        2.5       8
Note that:
	observations 20 and 57 both have headroom==2.0, trunk==8
	observations 58 and 59 both have headroom==2.5, trunk==8

Since I'm developing this command for use in a large dataset, how
would
I follow up -duplicates tag- to identify those unique sets of records,
where two variables are duplicates simultaneously, without having to
search manually?

I cannot see your point. Stata does tag these observations
with tag 1. Just
-list- after -duplicates tag-.

**********
clear
sysuse auto
list foreign headroom trunk if trunk==8
duplicates tag headroom trunk, generate(dup_admission_id)
*Let`s see...
list dup_* foreign headroom trunk if trunk==8
**********

HTH
Martin

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael McCulloch
Sent: Wednesday, September 03, 2008 6:29 PM
To: Statalist
Subject: st: testing -duplicates tag-

Hello,
I'm testing -duplicates tag-, and puzzled as to why it won't show the
two observations where headroom==2.0 and trunk==8.

clear
sysuse auto
list foreign headroom trunk if trunk==8
duplicates tag headroom trunk, generate(dup_admission_id)

--

Best wishes,
Michael McCulloch



Pine Street Foundation
124 Pine St., San Anselmo, CA 94960-2674
Tel: (415) 407-1357
Fax: (415) 485-1065
[email protected]
www.pinestreetfoundation.org
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/


*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
 >>>*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index