Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Using -collapse- extensively to find historical, irregular matches: Better way?


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Using -collapse- extensively to find historical, irregular matches: Better way?
Date   Tue, 30 Sep 2003 15:43:46 +0100

Chih-Mao Hsieh
>  
> I have a two-column file with variables "citing" and 
> "cited".  "Citing" refers to a patent, and "cited" refers 
> to a patent that is "cited" by the "citing" patent.  
> Therefore, if a patent cites and therefore "recombines" 3 
> patents prior to it, this history shows up as 3 rows (end 
> of message has examples).
>  
> I need a program to catch the number of times that the 
> exact same set of patents has been "recombined" in the past 
> (i.e. imagine trying to find all the papers that cite the 
> same set of references that you do in one of your papers!).
>  
> The basic solution I have come up with is the following:
>  
> collapse (mean) mean=cited (sum) sum=cited (sd) sd=cited, by(citing)
> bysort mean sum sd: gen byte counter = _n
> replace counter=counter-1
>  
> It seems to work, and as the datafile has 16 million rows, 
> with 3 million unique "citing" numbers -- therefore with a 
> fair amount of variance -- I believe it may be good enough. 
>  My questions are: (1) Is there a more accurate way, if 
> less efficient, to do what I need? (2) Is there any reason 
> I should expect Stata to calculate means, sums, and sd's in 
> different ways from row to row (i.e. rounding) that would 
> render totally ineffective my specific use of -collapse-?  
> I attach an example below.
>  
> Thanks, --Chihmao
>  
> ------------------------------------------
>  
> citing      cited
> 100          30
> 100          32
> 100          33
> 101          34
> 101          35
> 105          30
> 105          32
> 105          33
> 106          29
> 106          30
> 108          30
> 108          32
> 108          33
>  
> Desired output:
>  
> citing      counter
> 100            0
> 101            0
> 105            1    (since #100 cited the exact same list 
> of patents, no more, no less)
> 106            0
> 108            2    (since there are now 2 prior 
> occurrences of same patent list: #100 and #105)

You are aware that this is a bit of a fudge.

I'd restructure the data like this: 

gen allcited = "" 
bysort citing (cited) : replace allcited = allcited[_n-1] + " " + cited 
by citing : keep if _n == _N 
bysort allcited (citing) : gen counter = _n - 1 
sort citing 

Now this depends on your not overflowing the length 
limits of a string variable. 

You could save some space by 

egen cited2 = group(cited) 

and then using -cited2-. 

Nick 
n.j.cox@durham.ac.uk 


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index