Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Replacing duplicate values


From   "Pavlos C. Symeou" <p.symeou@lmu.de>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   st: Replacing duplicate values
Date   Tue, 06 Apr 2010 15:21:29 +0200

Dear Martin and Nick,

thank you for your input in a previous inquiry, which has worked in many instances. However, I am experiencing problems with the process of reshaping the data to/from long/wide formats. The datasets I am working on are big (more than 1 GB) and consist of about 600 string variables. Reshaping back and forth not only does it take ages (I am working on a powerful Windows Vista 64 pc, quad-core and have Stata 11 MP) to complete but creates enormous files which I can't handle. I would like to ask whether you have any alternatives to the ones below. Allow me first to explain again the task.

I have a dataset which concerns patents. Every patent is citing other patents. Every patent may cite multiple existing patents. The dataset appears in wide format where I have a patent's id and the number of its citations (citation_1, citation_2, etc.). However, there are mistakes in the dataset and certain citations appear more than once for a single patent, which is meaningless. Examples are patents with id 1 and id 2 where citation AAAA appears twice. Patent with id 3, has three citations but they show in places 2,3, and 4 (a similar issue happens with patent with id 4).

id    citation_1    	citation_2    	citation_3    	citation_4	citation_5
1     AAAA		BBBB		CCCC		AAAA
2     NICK		NICK		MARTIN		NICK
3     				YYYY		NNNN		PAVLO
4     ZZZZ		FFFF								TRDFF
5
.

The task is to delete duplicate values for each observation and move the remaining values to the left towards citation_1. For example patents with id 2 and id 3 would look like this:

id    citation_1    	citation_2    	citation_3    	citation_4	citation_5
2     NICK		MARTIN
3     YYYY		NNNN		PAVLOS
.

You suggested I used the following code that simply removes the duplicates:
***********************************************************************
reshape long ipc_, i(id)
bysort id ipc_: gen superfluousandredundant = _n>  1
replace ipc_="" if superfluousandredundant==1
drop superfluousandredundant
***********************************************************************

Further, I have used the following code to reallocate the values of each observation to the left:
******************************************************************************************************
g unit=1
bysort id: generate runsum = sum(unit) if ipc_!=""
rename runsum _runsum
sort id _runsum
bysort id: g n=_n
replace _runsum=n if _runsum==.
drop _j unit n
reshape wide ipc, i(id) j(_runsum)
****************************************************************************************************

The problem is that both pieces of code use -reshape- which only works when my dataset (I have dataset for each of a sample of 300 companies) is very small. Can you suggest another way around to achieve the above task?

Best wishes,

Pavlos

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index