[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <[email protected]> |

To |
<[email protected]> |

Subject |
st: RE: Using -collapse- extensively to find historical, irregular matches: Better way? |

Date |
Tue, 30 Sep 2003 15:43:46 +0100 |

Chih-Mao Hsieh > > I have a two-column file with variables "citing" and > "cited". "Citing" refers to a patent, and "cited" refers > to a patent that is "cited" by the "citing" patent. > Therefore, if a patent cites and therefore "recombines" 3 > patents prior to it, this history shows up as 3 rows (end > of message has examples). > > I need a program to catch the number of times that the > exact same set of patents has been "recombined" in the past > (i.e. imagine trying to find all the papers that cite the > same set of references that you do in one of your papers!). > > The basic solution I have come up with is the following: > > collapse (mean) mean=cited (sum) sum=cited (sd) sd=cited, by(citing) > bysort mean sum sd: gen byte counter = _n > replace counter=counter-1 > > It seems to work, and as the datafile has 16 million rows, > with 3 million unique "citing" numbers -- therefore with a > fair amount of variance -- I believe it may be good enough. > My questions are: (1) Is there a more accurate way, if > less efficient, to do what I need? (2) Is there any reason > I should expect Stata to calculate means, sums, and sd's in > different ways from row to row (i.e. rounding) that would > render totally ineffective my specific use of -collapse-? > I attach an example below. > > Thanks, --Chihmao > > ------------------------------------------ > > citing cited > 100 30 > 100 32 > 100 33 > 101 34 > 101 35 > 105 30 > 105 32 > 105 33 > 106 29 > 106 30 > 108 30 > 108 32 > 108 33 > > Desired output: > > citing counter > 100 0 > 101 0 > 105 1 (since #100 cited the exact same list > of patents, no more, no less) > 106 0 > 108 2 (since there are now 2 prior > occurrences of same patent list: #100 and #105) You are aware that this is a bit of a fudge. I'd restructure the data like this: gen allcited = "" bysort citing (cited) : replace allcited = allcited[_n-1] + " " + cited by citing : keep if _n == _N bysort allcited (citing) : gen counter = _n - 1 sort citing Now this depends on your not overflowing the length limits of a string variable. You could save some space by egen cited2 = group(cited) and then using -cited2-. Nick [email protected] * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: RE: Stata and WinEdt as text editor** - Next by Date:
**st: label value (continued)** - Previous by thread:
**st: RE: Stata and WinEdt as text editor** - Next by thread:
**st: label value (continued)** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |