Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Replacing duplicate values

From	"Pavlos C. Symeou" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: Replacing duplicate values
Date	Tue, 06 Apr 2010 15:21:29 +0200

Dear Martin and Nick,

thank you for your input in a previous inquiry, which has worked in manyinstances. However, I am experiencing problems with the process ofreshaping the data to/from long/wide formats. The datasets I am workingon are big (more than 1 GB) and consist of about 600 string variables.Reshaping back and forth not only does it take ages (I am working on apowerful Windows Vista 64 pc, quad-core and have Stata 11 MP) tocomplete but creates enormous files which I can't handle. I would liketo ask whether you have any alternatives to the ones below. Allow mefirst to explain again the task.

I have a dataset which concerns patents. Every patent is citing otherpatents. Every patent may cite multiple existing patents. The datasetappears in wide format where I have a patent's id and the number of itscitations (citation_1, citation_2, etc.). However, there are mistakes inthe dataset and certain citations appear more than once for a singlepatent, which is meaningless. Examples are patents with id 1 and id 2where citation AAAA appears twice. Patent with id 3, has three citationsbut they show in places 2,3, and 4 (a similar issue happens with patentwith id 4).


id    citation_1    	citation_2    	citation_3    	citation_4	citation_5
1     AAAA		BBBB		CCCC		AAAA
2     NICK		NICK		MARTIN		NICK
3     				YYYY		NNNN		PAVLO
4     ZZZZ		FFFF								TRDFF
5
.

The task is to delete duplicate values for each observation and move the remaining values to the left towards citation_1. For example patents with id 2 and id 3 would look like this:

id    citation_1    	citation_2    	citation_3    	citation_4	citation_5
2     NICK		MARTIN
3     YYYY		NNNN		PAVLOS
.

You suggested I used the following code that simply removes the duplicates:
***********************************************************************
reshape long ipc_, i(id)
bysort id ipc_: gen superfluousandredundant = _n>  1
replace ipc_="" if superfluousandredundant==1
drop superfluousandredundant
***********************************************************************

Further, I have used the following code to reallocate the values of each observation to the left:
******************************************************************************************************
g unit=1
bysort id: generate runsum = sum(unit) if ipc_!=""
rename runsum _runsum
sort id _runsum
bysort id: g n=_n
replace _runsum=n if _runsum==.
drop _j unit n
reshape wide ipc, i(id) j(_runsum)
****************************************************************************************************

The problem is that both pieces of code use -reshape- which only works when my dataset (I have dataset for each of a sample of 300 companies) is very small. Can you suggest another way around to achieve the above task?

Best wishes,

Pavlos

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Replacing duplicate values
  - From: Robert Picard <[email protected]>

References:
- st: Replacing duplicate values
  - From: "Pavlos C. Symeou" <[email protected]>
- st: RE: Replacing duplicate values
  - From: "Nick Cox" <[email protected]>
- st: AW: RE: Replacing duplicate values
  - From: "Martin Weiss" <[email protected]>

Prev by Date: st: AW: Using if in list of r(...)
Next by Date: st: Detach value label from string var
Previous by thread: st: AW: RE: Replacing duplicate values
Next by thread: Re: st: Replacing duplicate values
Index(es):
- Date
- Thread