Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: Using -collapse- extensively to find historical, irregular matches: Better way?


From   "Chih-Mao Hsieh" <Hsieh@olin.wustl.edu>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: RE: Using -collapse- extensively to find historical, irregular matches: Better way?
Date   Tue, 30 Sep 2003 10:32:02 -0500

Nick, please disregard that last message.  Your message obviously already responded to it.
 
Best,
CM

	-----Original Message----- 
	From: owner-statalist@hsphsun2.harvard.edu on behalf of Chih-Mao Hsieh 
	Sent: Tue 9/30/2003 10:23 AM 
	To: statalist@hsphsun2.harvard.edu 
	Cc: 
	Subject: st: RE: RE: Using -collapse- extensively to find historical, irregular matches: Better way?
	
	

	Nick, thanks for your response.
	
	I had been shying away from converting "cited" to strings because the numbers are in the millions, i.e. strings would be length 7.  Many of the "citing" patents have more than 35-40 "cited" patents, and so the concatenation might surpass the string's length limit.
	
	Of course, the chances are not high that two patents would match each other over the first 35 patents, so your way does appear to be better.
	
	Cheers, --Chihmao
	
	        -----Original Message-----
	        From: owner-statalist@hsphsun2.harvard.edu on behalf of Nick Cox
	        Sent: Tue 9/30/2003 9:43 AM
	        To: statalist@hsphsun2.harvard.edu
	        Cc:
	        Subject: st: RE: Using -collapse- extensively to find historical, irregular matches: Better way?
	       
	       
	
	        Chih-Mao Hsieh
	        >
	        > I have a two-column file with variables "citing" and
	        > "cited".  "Citing" refers to a patent, and "cited" refers
	        > to a patent that is "cited" by the "citing" patent.
	        > Therefore, if a patent cites and therefore "recombines" 3
	        > patents prior to it, this history shows up as 3 rows (end
	        > of message has examples).
	        >
	        > I need a program to catch the number of times that the
	        > exact same set of patents has been "recombined" in the past
	        > (i.e. imagine trying to find all the papers that cite the
	        > same set of references that you do in one of your papers!).
	        >
	        > The basic solution I have come up with is the following:
	        >
	        > collapse (mean) mean=cited (sum) sum=cited (sd) sd=cited, by(citing)
	        > bysort mean sum sd: gen byte counter = _n
	        > replace counter=counter-1
	        >
	        > It seems to work, and as the datafile has 16 million rows,
	        > with 3 million unique "citing" numbers -- therefore with a
	        > fair amount of variance -- I believe it may be good enough.
	        >  My questions are: (1) Is there a more accurate way, if
	        > less efficient, to do what I need? (2) Is there any reason
	        > I should expect Stata to calculate means, sums, and sd's in
	        > different ways from row to row (i.e. rounding) that would
	        > render totally ineffective my specific use of -collapse-?
	        > I attach an example below.
	        >
	        > Thanks, --Chihmao
	        >
	        > ------------------------------------------
	        >
	        > citing      cited
	        > 100          30
	        > 100          32
	        > 100          33
	        > 101          34
	        > 101          35
	        > 105          30
	        > 105          32
	        > 105          33
	        > 106          29
	        > 106          30
	        > 108          30
	        > 108          32
	        > 108          33
	        >
	        > Desired output:
	        >
	        > citing      counter
	        > 100            0
	        > 101            0
	        > 105            1    (since #100 cited the exact same list
	        > of patents, no more, no less)
	        > 106            0
	        > 108            2    (since there are now 2 prior
	        > occurrences of same patent list: #100 and #105)
	       
	        You are aware that this is a bit of a fudge.
	       
	        I'd restructure the data like this:
	       
	        gen allcited = ""
	        bysort citing (cited) : replace allcited = allcited[_n-1] + " " + cited
	        by citing : keep if _n == _N
	        bysort allcited (citing) : gen counter = _n - 1
	        sort citing
	       
	        Now this depends on your not overflowing the length
	        limits of a string variable.
	       
	        You could save some space by
	       
	        egen cited2 = group(cited)
	       
	        and then using -cited2-.
	       
	        Nick
	        n.j.cox@durham.ac.uk
	       
	       
	        *
	        *   For searches and help try:
	        *   http://www.stata.com/support/faqs/res/findit.html
	        *   http://www.stata.com/support/statalist/faq
	        *   http://www.ats.ucla.edu/stat/stata/
	       
	
	
	*
	*   For searches and help try:
	*   http://www.stata.com/support/faqs/res/findit.html
	*   http://www.stata.com/support/statalist/faq
	*   http://www.ats.ucla.edu/stat/stata/
	


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index