Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: Using -collapse- extensively to find historical, irregular matches: Better way?


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: RE: RE: Using -collapse- extensively to find historical, irregular matches: Better way?
Date   Tue, 30 Sep 2003 16:43:54 +0100

I mentioned one simplification which improves
the problem, namely the use of -egen, group()- to map
to integers 1 up. 

I was toying with an idea of mapping them to 
successive primes and computing the product, 
but Stata, not surprisingly, has no built-in 
-prime()- function to generate successive primes. 
Also, in principle, that wouldn't be a solution 
either as the largest such product would, I 
guess, be too big to handle in any case. 

Nick 
[email protected] 

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of 
> Chih-Mao Hsieh
> Sent: 30 September 2003 16:23
> To: [email protected]
> Subject: st: RE: RE: Using -collapse- extensively to find 
> historical,
> irregular matches: Better way?
> 
> 
> Nick, thanks for your response.
>  
> I had been shying away from converting "cited" to strings 
> because the numbers are in the millions, i.e. strings would 
> be length 7.  Many of the "citing" patents have more than 
> 35-40 "cited" patents, and so the concatenation might 
> surpass the string's length limit.
>  
> Of course, the chances are not high that two patents would 
> match each other over the first 35 patents, so your way 
> does appear to be better.
>  
> Cheers, --Chihmao
> 
> 	-----Original Message----- 
> 	From: [email protected] on behalf 
> of Nick Cox 
> 	Sent: Tue 9/30/2003 9:43 AM 
> 	To: [email protected] 
> 	Cc: 
> 	Subject: st: RE: Using -collapse- extensively to find 
> historical, irregular matches: Better way?
> 	
> 	
> 
> 	Chih-Mao Hsieh
> 	> 
> 	> I have a two-column file with variables "citing" and
> 	> "cited".  "Citing" refers to a patent, and "cited" refers
> 	> to a patent that is "cited" by the "citing" patent. 
> 	> Therefore, if a patent cites and therefore "recombines" 3
> 	> patents prior to it, this history shows up as 3 rows (end
> 	> of message has examples).
> 	> 
> 	> I need a program to catch the number of times that the
> 	> exact same set of patents has been "recombined" in the past
> 	> (i.e. imagine trying to find all the papers that cite the
> 	> same set of references that you do in one of your papers!).
> 	> 
> 	> The basic solution I have come up with is the following:
> 	> 
> 	> collapse (mean) mean=cited (sum) sum=cited (sd) 
> sd=cited, by(citing)
> 	> bysort mean sum sd: gen byte counter = _n
> 	> replace counter=counter-1
> 	> 
> 	> It seems to work, and as the datafile has 16 million rows,
> 	> with 3 million unique "citing" numbers -- therefore with a
> 	> fair amount of variance -- I believe it may be good enough.
> 	>  My questions are: (1) Is there a more accurate way, if
> 	> less efficient, to do what I need? (2) Is there any reason
> 	> I should expect Stata to calculate means, sums, and sd's in
> 	> different ways from row to row (i.e. rounding) that would
> 	> render totally ineffective my specific use of -collapse-? 
> 	> I attach an example below.
> 	> 
> 	> Thanks, --Chihmao
> 	> 
> 	> ------------------------------------------
> 	> 
> 	> citing      cited
> 	> 100          30
> 	> 100          32
> 	> 100          33
> 	> 101          34
> 	> 101          35
> 	> 105          30
> 	> 105          32
> 	> 105          33
> 	> 106          29
> 	> 106          30
> 	> 108          30
> 	> 108          32
> 	> 108          33
> 	> 
> 	> Desired output:
> 	> 
> 	> citing      counter
> 	> 100            0
> 	> 101            0
> 	> 105            1    (since #100 cited the exact same list
> 	> of patents, no more, no less)
> 	> 106            0
> 	> 108            2    (since there are now 2 prior
> 	> occurrences of same patent list: #100 and #105)
> 	
> 	You are aware that this is a bit of a fudge.
> 	
> 	I'd restructure the data like this:
> 	
> 	gen allcited = ""
> 	bysort citing (cited) : replace allcited = 
> allcited[_n-1] + " " + cited
> 	by citing : keep if _n == _N
> 	bysort allcited (citing) : gen counter = _n - 1
> 	sort citing
> 	
> 	Now this depends on your not overflowing the length
> 	limits of a string variable.
> 	
> 	You could save some space by
> 	
> 	egen cited2 = group(cited)
> 	
> 	and then using -cited2-.
> 	
> 	Nick
> 	[email protected]
> 	
> 	
> 	*
> 	*   For searches and help try:
> 	*   http://www.stata.com/support/faqs/res/findit.html
> 	*   http://www.stata.com/support/statalist/faq
> 	*   http://www.ats.ucla.edu/stat/stata/
> 	
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index