Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Joe Canner <jcanner1@jhmi.edu> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: RE: indicator variables from -by- |
Date | Mon, 26 Aug 2013 20:03:45 +0000 |
I'm not real familiar with -byable-, but there is some interesting information on it in the PDF documentation (p.pdf, page 8). In particular, there are built-in functions _byn1() and _byn2() which return the first and last observation number of the current by-group. Thus, it is up to the -byable- program to make use of this information for efficiency purposes. Otherwise, if you use `touse' indicators you are stuck with using -if- to identify by-group members. So, presumably your wrapper could look something like this: prog mymns, byable(recall, noheader) syntax [varlist] [if] [in] sum `varlist' in `=_byn1()'/`=_byn2()', mean mat A=nullmat(A)\r(mean) end Keep in mind however, that if the program is called with -if- or -in-, the program will still have to deal with that as well using -marksample-. So, if you want the wrapper program to be as efficient as possible, it may be better to prohibit using -if- and -in-, or else have the program deal with those calls separately. Regards, Joe -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of László Sándor Sent: Monday, August 26, 2013 2:45 PM To: statalist@hsphsun2.harvard.edu Subject: Re: st: RE: indicator variables from -by- Yes, this is true, but bysort'ing my (Austin's) ado wrapper for the (built-in) summarize to save the result should do the same thing. Or you mean there are no `touse' indicators involved? If built-in commands do by differently, then perhaps yes. But the -byable- documentation suggests ado files do use `touse' indicators. Maybe not a new one for each category but one and then use clever in'ing? Probably. All the more so, then: this cannot justify the order of magnitude slowdown and running out of 220 GB free memory… On Mon, Aug 26, 2013 at 1:08 PM, Joe Canner <jcanner1@jhmi.edu> wrote: > Laszlo, > > My guess is that -bys- takes good advantage of the sorting. In fact, you are not allowed to run -by- without -sort-, probably because doing so would ruin the optimization. > > To illustrate, try the following: > > gen obs=_n > sum AGE if inrange(obs,1000000,2000000) > > and > > sum AGE in 1000000/2000000 > > In my test (with a dataset of almost 8 millions observations), the > former (not including -gen-) took 20x longer than the latter. > Similarly, the -bys- code presumably accesses all observations in a > particular level of the by variable more-or-less by observation > number, rather than by -if- testing. (I think Nick Cox alluded to this > a while back.) > > Regards, > Joe Canner > Johns Hopkins School of Medicine > > > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu > [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of László > Sándor > Sent: Sunday, August 25, 2013 11:55 AM > To: statalist@hsphsun2.harvard.edu > Subject: st: indicator variables from -by- > > Hi, > I have so many observations that even the byte tempvars of > -marksample- might make me run out of memory. > > But -by- must be inefficient in this, as if you -bys- over many groups (e.g. households), you never run out of memory because a new touse tempvar was created for each group. > > Thus I don't understand why this wrapper for -sum, meanonly- (just to collect saved results lost otherwise) runs out of copious amounts of memory (bying over 20 groups) while the -bys: sum, meanonly- is still much, much faster than any tabbing or tabstating or statsbying or Mata alternative. What does -by- handle differently about the latter what it cannot do with the former? > > prog mymns, byable(recall, noheader) > syntax [varlist] [if] [in] > marksample touse > sum `varlist' if `touse', mean > mat A=nullmat(A)\r(mean) > end > > Thanks, > > Laszlo > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/