Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: indicator variables from -by-


From   Joe Canner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   st: RE: indicator variables from -by-
Date   Mon, 26 Aug 2013 17:08:03 +0000

Laszlo,

My guess is that -bys- takes good advantage of the sorting.  In fact, you are not allowed to run -by- without -sort-, probably because doing so would ruin the optimization.

To illustrate, try the following:

gen obs=_n
sum AGE if inrange(obs,1000000,2000000)

and

sum AGE in 1000000/2000000

In my test (with a  dataset of almost 8 millions observations), the former (not including -gen-) took 20x longer than the latter.  Similarly, the -bys- code presumably accesses all observations in a particular level of the by variable more-or-less by observation number, rather than by -if- testing. (I think Nick Cox alluded to this a while back.)

Regards,
Joe Canner
Johns Hopkins School of Medicine


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of László Sándor
Sent: Sunday, August 25, 2013 11:55 AM
To: [email protected]
Subject: st: indicator variables from -by-

Hi,
I have so many observations that even the byte tempvars of
-marksample- might make me run out of memory.

But -by- must be inefficient in this, as if you -bys- over many groups (e.g. households), you never run out of memory because a new touse tempvar was created for each group.

Thus I don't understand why this wrapper for -sum, meanonly- (just to collect saved results lost otherwise) runs out of copious amounts of memory (bying over 20 groups) while the -bys: sum, meanonly- is still much, much faster than any tabbing or tabstating or statsbying or Mata alternative. What does -by- handle differently about the latter what it cannot do with the former?

prog mymns, byable(recall, noheader)
 syntax [varlist] [if] [in]
 marksample touse
 sum `varlist' if `touse', mean
 mat A=nullmat(A)\r(mean)
end

Thanks,

Laszlo
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index