Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: inlist() equivalent for mata?

From	"William Gould, StataCorp LP" <[email protected]>
To	[email protected]
Subject	Re: st: inlist() equivalent for mata?
Date	Fri, 21 Mar 2014 11:42:10 -0500

Andrew Maurer <[email protected]> wrote a response to my last
posting on his problem.

What follows is long, perhaps entertaining, but not really useful
for helping Andrew with this problem.  It is perhaps informative, 
however, in learning how to think about these kinds of problems. 

In addition, I might be misinterpreting what Andrew means 
means by the "array approach", but that doesn't matter.  I'm 
going to assume that "array approach" means "asarray() approach".
Perhaps Andrew means a real array approach.  If so, that would change 
a word here and there in what follows, but not in how I'm thinking 
about the problem. 

Andrew writes, 



> I tried benchmarking the method you described against the original
> method of using an array that I posted. It seems like for small list
> size (rows(L)) the sum method is faster, but as soon as L length is
> greater than a few hundred , it is substantially slower (eg 26s vs
> 3.4s for L length of 2,000). I suspect that sum(b1 :==L) is
> internally looping through the elements of L. So with Bsize of 1
> billion and Lsize of 100,000, it would effectively do a loop of 1
> billion iterations over 100,000 elements. See code comparing methods
> below.

Your speculation as to cause is spot on.  sum() loops over all the elements 
rather than stopping immediately if the element is found.  Thus, on 
any given SUCCESSFUL search, one expects the sum() approach to execute 
twice as many computer operations than a stop-on-success method would 
execute.  Here's the reasoning, 

    1.  Let N_a denote be the number of values being searched over, which
        values I put in the vector.  

    2.  The number of of computer operations required by the sum() 
        approach is hus proportional to N_a.  Let's call the proportionality 
        factor m1, so the number of computer operations required is 
        m1*N_a + f1, where f1 is the fixed time-cost of starting the 
        search.  

    3.  Andrew implemented a stop-on-success approach using Mata. 
        Assuming Andrew used asarray(), the execution time 
        will be m2*g(N_a)+f2.  Describing g() is difficult, except 
        that g(N_a)<N_a for all values of N_a and, for asarray(), 
        dg()/dN_a < 1. 
        In addition, Assuming Andrew is using asarray(), f2>f1.

    4.  I implemented my approach using Mata's built-in function. 
        Andrew wrote Mata code.  Thus we would expect m1 < m2. 
        Moreover, my approach is operation-wise simpler than the 
        asarray() approach used by Andrew, which is yet another reason 
        why m1 should be less than m2. 

    4.  Thus, m1*N_a + f1 < m2*g(N_a) + f2 for small N.  
        As N increases, these conditions lead to the certainty that
        the asarray() approach will eventually run faster than the 
        sum() approach. 


> I also tried a method where I create a mata object f3, f9, f23, etc
> for every element in L. I then loop through the elements of B and
> stata capture if the mata object exists. This method is the slowest
> of them all (although I'm not quite sure why. I wo uld have thought
> that checking to see if scalar f1 exists, for example, would be
> fast.)

No, checking whether an object exists is slow.  It requires searching
tables of objects that do exist.  Stata does that kind of thing
constantly.  Mata, however, is a compiler, and existing things have an
address, addresses are resolved when the routine is compiled, and at
execution time, Mata goes right to the desired object.  When you ask
whether an object exists, you are forcing Mata to behave like Stata's
ado language.  Andrew has just discovered an important reason why Mata
is so much faster than Stata's ado.


> I suspect that hash1'ing is fast, but the collision checking is slow.
> The initial fixed cost of creating a perfect hash table would likely
> be far less than the cost of collision-resolving 1 billion times,
> although I don't know enough about hash algorithms to know if it's
> possible to create a perfect hash from a given set of integers.

hasing1'ing is fast. 

There is no such thing as a perfect hash function in the sense 
that the probability of collision is ever 0.  A hash function is
defined as follows

        Let U be a set of POSSIBLE values. Let u represent a member of
        set.

        Hash function h = H_n(u, n) is defined over U such that 
        1 <= h <= n, h an integer, where n is a number you may
        choose.  The larger is n, the lower will be p, the probability
        that two different u's will have the same H_n() value. 

        It is not required, however, that p ever be zero.  

It is not required that p ever go to 0 because hash functions are
supposed to be general, and U is really the universe of possible values
over all possible problems.  There are other solutions for
small-universe problems.  

While there is no perfect hash function, some functions are better for
some problems than others.  hash1() is known to be pretty good over a
wide range of problems.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: inlist() equivalent for mata?
  - From: Andrew Maurer <[email protected]>

Prev by Date: RE: st: inlist() equivalent for mata?
Next by Date: RE: st: inlist() equivalent for mata?
Previous by thread: RE: st: inlist() equivalent for mata?
Next by thread: RE: st: inlist() equivalent for mata?
Index(es):
- Date
- Thread