Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: the fastest way to check if unique values of a variable > 100


From   Joe Canner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: the fastest way to check if unique values of a variable > 100
Date   Wed, 28 Aug 2013 16:40:46 +0000

The requirement that the data set be sorted before calling the Mata routine can be a significant hit with a large data set.  Based on my benchmarks with an 8 million observation data set, the -tabulate- method is faster unless the number of unique values is quite high (not sure where the break-even point is but it is at least in the several hundreds).

Also, this routine only works for numeric variables. For something that does both string and numeric variables you could use something like:

prog unique
syntax [varlist] [in] [if]
qui levelsof `varlist' `in' `if', local(lev)

local n=0
foreach lev in `lev' {
  local ++n
  if `n'>100 {
    continue, break
  }
}

di "`n'"
end

Since -levelsof- uses -tabulate- for numeric variables, the performance for numeric variables is the similar to that for -tabulate- (for reasonable numbers of unique values).  And, since -levelsof- uses a different method for string variables (basically a -bysort-), it is faster than -tabulate- when you have string variables with a large number of unique values.

Regards,
Joe

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of daniel klein
Sent: Tuesday, August 27, 2013 5:46 AM
To: [email protected]
Subject: Re: st: the fastest way to check if unique values of a variable > 100

As I metioned, I did not get what you are trying to do. To me your original post sounded like you wanted to check whether any value of a variable is larger than 100. Re-reading your post now it is
(hopefully) clear that you want to check whether a given variable has more than 100 unique values. Sorry for the missunderstanding.

How about this simple Mata approach?

m :
real scalar muniq(string scalar varn, real scalar brk) {
    real rowvector x
    real scalar u

    st_view(x, . ,varn)

    u = 1
    for (i = 2; i <= rows(x); ++i) {
        if (x[i, 1] != x[i - 1, 1]) ++u
        if (u >= brk) break
    }

    return(u)
}
end

Here is a timed example compared to -tabulate-

// example
sysuse auto ,clear
expand 10000

sort price

timer clear
timer on 1
qui ta price
di r(r)
timer off 1

timer on 2
m : muniq("price", 74)
timer off  2

timer on 3
m : muniq("price", 37)
timer off 3

timer list
// end example

As you see, the code should be almost as fast as -tabulate- if you are going through the maximum of possible unique values (74 in this case), and should be faster if you constrain the number of unique values to be found (to 37 in the example).

Note that the Mata function currently requires the data to be -sort-ed on the respective variable to work. This needs some extra time (and one would want to integrate it into the code if one planned to make it a serious function or program), but I guess if you constrain the number of unique values to 100 you should still be faster than with -tabulate-.

Best
Daniel
-- 

I think your original one-liner was -cap as foo > 100 ,f-. This would check for 100 unique values only if the values can only be positive integers. Otherwise it leads to false positives (e.g. if I have only two values, but one is 2321) or falls negatives (if I have 2000 values of 0(0.01)19.99 ). That's what I meant.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index