Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | daniel klein <klein.daniel.81@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: the fastest way to check if unique values of a variable > 100 |
Date | Tue, 27 Aug 2013 11:45:41 +0200 |
As I metioned, I did not get what you are trying to do. To me your original post sounded like you wanted to check whether any value of a variable is larger than 100. Re-reading your post now it is (hopefully) clear that you want to check whether a given variable has more than 100 unique values. Sorry for the missunderstanding. How about this simple Mata approach? m : real scalar muniq(string scalar varn, real scalar brk) { real rowvector x real scalar u st_view(x, . ,varn) u = 1 for (i = 2; i <= rows(x); ++i) { if (x[i, 1] != x[i - 1, 1]) ++u if (u >= brk) break } return(u) } end Here is a timed example compared to -tabulate- // example sysuse auto ,clear expand 10000 sort price timer clear timer on 1 qui ta price di r(r) timer off 1 timer on 2 m : muniq("price", 74) timer off 2 timer on 3 m : muniq("price", 37) timer off 3 timer list // end example As you see, the code should be almost as fast as -tabulate- if you are going through the maximum of possible unique values (74 in this case), and should be faster if you constrain the number of unique values to be found (to 37 in the example). Note that the Mata function currently requires the data to be -sort-ed on the respective variable to work. This needs some extra time (and one would want to integrate it into the code if one planned to make it a serious function or program), but I guess if you constrain the number of unique values to 100 you should still be faster than with -tabulate-. Best Daniel -- I think your original one-liner was -cap as foo > 100 ,f-. This would check for 100 unique values only if the values can only be positive integers. Otherwise it leads to false positives (e.g. if I have only two values, but one is 2321) or falls negatives (if I have 2000 values of 0(0.01)19.99 ). That's what I meant. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/