Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "William Gould, StataCorp LP" <wgould@stata.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: inlist() equivalent for mata? |
Date | Fri, 21 Mar 2014 11:42:10 -0500 |
Andrew Maurer <Andrew.Maurer@qrm.com> wrote a response to my last posting on his problem. What follows is long, perhaps entertaining, but not really useful for helping Andrew with this problem. It is perhaps informative, however, in learning how to think about these kinds of problems. In addition, I might be misinterpreting what Andrew means means by the "array approach", but that doesn't matter. I'm going to assume that "array approach" means "asarray() approach". Perhaps Andrew means a real array approach. If so, that would change a word here and there in what follows, but not in how I'm thinking about the problem. Andrew writes, > I tried benchmarking the method you described against the original > method of using an array that I posted. It seems like for small list > size (rows(L)) the sum method is faster, but as soon as L length is > greater than a few hundred , it is substantially slower (eg 26s vs > 3.4s for L length of 2,000). I suspect that sum(b1 :==L) is > internally looping through the elements of L. So with Bsize of 1 > billion and Lsize of 100,000, it would effectively do a loop of 1 > billion iterations over 100,000 elements. See code comparing methods > below. Your speculation as to cause is spot on. sum() loops over all the elements rather than stopping immediately if the element is found. Thus, on any given SUCCESSFUL search, one expects the sum() approach to execute twice as many computer operations than a stop-on-success method would execute. Here's the reasoning, 1. Let N_a denote be the number of values being searched over, which values I put in the vector. 2. The number of of computer operations required by the sum() approach is hus proportional to N_a. Let's call the proportionality factor m1, so the number of computer operations required is m1*N_a + f1, where f1 is the fixed time-cost of starting the search. 3. Andrew implemented a stop-on-success approach using Mata. Assuming Andrew used asarray(), the execution time will be m2*g(N_a)+f2. Describing g() is difficult, except that g(N_a)<N_a for all values of N_a and, for asarray(), dg()/dN_a < 1. In addition, Assuming Andrew is using asarray(), f2>f1. 4. I implemented my approach using Mata's built-in function. Andrew wrote Mata code. Thus we would expect m1 < m2. Moreover, my approach is operation-wise simpler than the asarray() approach used by Andrew, which is yet another reason why m1 should be less than m2. 4. Thus, m1*N_a + f1 < m2*g(N_a) + f2 for small N. As N increases, these conditions lead to the certainty that the asarray() approach will eventually run faster than the sum() approach. > I also tried a method where I create a mata object f3, f9, f23, etc > for every element in L. I then loop through the elements of B and > stata capture if the mata object exists. This method is the slowest > of them all (although I'm not quite sure why. I wo uld have thought > that checking to see if scalar f1 exists, for example, would be > fast.) No, checking whether an object exists is slow. It requires searching tables of objects that do exist. Stata does that kind of thing constantly. Mata, however, is a compiler, and existing things have an address, addresses are resolved when the routine is compiled, and at execution time, Mata goes right to the desired object. When you ask whether an object exists, you are forcing Mata to behave like Stata's ado language. Andrew has just discovered an important reason why Mata is so much faster than Stata's ado. > I suspect that hash1'ing is fast, but the collision checking is slow. > The initial fixed cost of creating a perfect hash table would likely > be far less than the cost of collision-resolving 1 billion times, > although I don't know enough about hash algorithms to know if it's > possible to create a perfect hash from a given set of integers. hasing1'ing is fast. There is no such thing as a perfect hash function in the sense that the probability of collision is ever 0. A hash function is defined as follows Let U be a set of POSSIBLE values. Let u represent a member of set. Hash function h = H_n(u, n) is defined over U such that 1 <= h <= n, h an integer, where n is a number you may choose. The larger is n, the lower will be p, the probability that two different u's will have the same H_n() value. It is not required, however, that p ever be zero. It is not required that p ever go to 0 because hash functions are supposed to be general, and U is really the universe of possible values over all possible problems. There are other solutions for small-universe problems. While there is no perfect hash function, some functions are better for some problems than others. hash1() is known to be pretty good over a wide range of problems. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/