st: RE: RE: RE: Short program to "collapse (# unique elements)": Use of nested loops and a "weights not allowed" message |

Tue, 30 Sep 2003 08:38:34 -0500 |

Nick Cox Thank you! As these datasets have millions of observations, any time-saving strategy will be important. Best, CM -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu on behalf of Nick Cox Sent: Tue 9/30/2003 7:12 AM To: statalist@hsphsun2.harvard.edu Cc: Subject: st: RE: RE: Short program to "collapse (# unique elements)": Use of nested loops and a "weights not allowed" message Chih-Mao Hsieh > > I have a > > data file with three columns: citing, cited, nclass. For > > every "citing", there are multiple "cited", and for each > > "cited" there is a "nclass". The file is sorted by citing, > > then nclass. I need a program to count the number of > > unique "nclass" strings associated to each "citing". > > > > As a simple example, given the following data file "data.dta": > > > > citing cited nclass > > 100 20 12 > > 100 22 15 > > 100 23 15 > > 101 32 14 > > 101 33 15 > > 101 34 15 > > 101 40 17 > > > > I need the following output file: > > > > citing numpatclass > > 100 2 [12 and 15 are unique, 15 is > repeated] > > 101 3 [14, 15, 17 are unique, 15 > is repeated] > Phil Ryan gave excellent advice explaining how > this can be done, without loops, by using -by:-. > > In addition, note the FAQ > How do I compute the number of distinct observations? > http://www.stata.com/support/faqs/data/distinct.html > which explains approaches using -by:-, similar in > spirit to Phil's solution, and also gives manual > references and references to user-written software > in this area. > > Thus, a canned solution here is > > bysort citing : egen numpatclass = nvals(nclass) > by citing : keep if _n== 1 Another approach is a double -contract-: contract citing nclass contract citing, freq(numpatclass) After the first -contract-, the number of observations for each value of -citing- is the number of distinct values of -nclass- observed for each; so the second -contract- immediately yields the desired count variable. That this solution using -contract- makes no use of -by:- or -_N- is pure illusion. Look inside -contract- at the Stata code -- -contract- is implemented as an .ado -- and you will see that it is based on exactly the same machinery. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

