Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Replacing duplicate values

From	"Pavlos C. Symeou" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: RE: Replacing duplicate values
Date	Thu, 01 Apr 2010 17:20:35 +0200

Dear Nick and Abdel,

thank you for your replies. I need to clarify that I don't wish to dropany duplicate observations. Rather, I want to delete duplicate valuesacross the four ipc variables and then move all the distinct values tothe left. Transforming them into the long format would be one option butthe complete dataset is too complex and I prefer to avoid this at the time.


Regards,

Pavlos

"AbdelRahmen Wrote"
"type  help duplicates drop under Stata and you will find what you are looking for"


On 01/04/2010 17:00, Nick Cox wrote:

It's a Stata two-step: reshape, drop duplicates, reshape back. Something like

* warning: untested code
reshape long ipc_, i(id)
bysort id ipc_: gen superfluousandredundant = _n>  1
drop if superfluousandredundant
bysort id (ipc) : gen j = _n
reshape wide ipc, i(id) j(j)

Actually, the last -reshape- might not be a good idea. The long structure might be more useful.

Nick
[email protected]

Pavlos C. Symeou

I have a dataset which concerns patents. Every patent is assigned a
number of International Patent Classifications (IPCs). However, there
are mistakes in the database and certain IPCs appear more than once for
a single patent, which is meaningless. Examples are patents with id 6
and id 7 (ipc_1, ipc_2 etc list the number of IPCs a single patent is
assigned). For the patent with id 6 we can see that ipc_2 and ipc_3 are
the same.  Id 7 illustrates a more general issue. Duplicate values may
not appear sequentially and may appear more than twice.

id    ipc_1    ipc_2    ipc_3    ipc_4
1     A44B    G09F    H04N
2     A47B    G06F    H05K    E05D
3     A47B    G06F
4     A47B    H04N    H05K
5     A47B
6     A47B    F16M    F16M    H05K
7     A47B    A47B F16M A47B

Can you suggest a way to delete the duplicate values, which can be more
than two, and move the remaining to the left? For example patents with
id 6 and id 7 would look like this:

id    ipc_1    ipc_2    ipc_3    ipc_4
6     A47B    F16M    H05K
7     A47B    F16M


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Replacing duplicate values
  - From: "Pavlos C. Symeou" <[email protected]>
- st: RE: Replacing duplicate values
  - From: "Nick Cox" <[email protected]>

Prev by Date: RE: st: RE: RE: RE: LARS ado??
Next by Date: st: AW: RE: Replacing duplicate values
Previous by thread: st: RE: Replacing duplicate values
Next by thread: st: AW: RE: Replacing duplicate values
Index(es):
- Date
- Thread