Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Replacing duplicate values


From   "Pavlos C. Symeou" <p.symeou@lmu.de>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: Replacing duplicate values
Date   Thu, 01 Apr 2010 17:20:35 +0200

Dear Nick and Abdel,

thank you for your replies. I need to clarify that I don't wish to drop any duplicate observations. Rather, I want to delete duplicate values across the four ipc variables and then move all the distinct values to the left. Transforming them into the long format would be one option but the complete dataset is too complex and I prefer to avoid this at the time.

Regards,

Pavlos

"AbdelRahmen Wrote"
"type  help duplicates drop under Stata and you will find what you are looking for"


On 01/04/2010 17:00, Nick Cox wrote:
It's a Stata two-step: reshape, drop duplicates, reshape back. Something like

* warning: untested code
reshape long ipc_, i(id)
bysort id ipc_: gen superfluousandredundant = _n>  1
drop if superfluousandredundant
bysort id (ipc) : gen j = _n
reshape wide ipc, i(id) j(j)

Actually, the last -reshape- might not be a good idea. The long structure might be more useful.

Nick
n.j.cox@durham.ac.uk

Pavlos C. Symeou

I have a dataset which concerns patents. Every patent is assigned a
number of International Patent Classifications (IPCs). However, there
are mistakes in the database and certain IPCs appear more than once for
a single patent, which is meaningless. Examples are patents with id 6
and id 7 (ipc_1, ipc_2 etc list the number of IPCs a single patent is
assigned). For the patent with id 6 we can see that ipc_2 and ipc_3 are
the same.  Id 7 illustrates a more general issue. Duplicate values may
not appear sequentially and may appear more than twice.

id    ipc_1    ipc_2    ipc_3    ipc_4
1     A44B    G09F    H04N
2     A47B    G06F    H05K    E05D
3     A47B    G06F
4     A47B    H04N    H05K
5     A47B
6     A47B    F16M    F16M    H05K
7     A47B    A47B F16M A47B

Can you suggest a way to delete the duplicate values, which can be more
than two, and move the remaining to the left? For example patents with
id 6 and id 7 would look like this:

id    ipc_1    ipc_2    ipc_3    ipc_4
6     A47B    F16M    H05K
7     A47B    F16M


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index