Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: indicator variables from -by-

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: RE: indicator variables from -by-
Date	Tue, 27 Aug 2013 19:50:02 +0100

Yes and no. StataCorp over 20+ years have devoted enormous efforts to
speeding up code! But they add new functionality too. The question is:
Where to strike the balance? It's hardly the case that StataCorp are
indifferent to speed, but those pesky users keep asking for new
modelling commands.

Actually, I suspect Joe agrees.

I am reminded obliquely of a fraught Library Committee meeting several
years ago, in which a rather irritated librarian snapped at the
academics: "We could do a really good job of organising the library
but people keep coming in and borrowing books!" (Scary thing was that
she seemed to mean exactly what she said, that users were a nuisance.)
Nick
[email protected]


On 27 August 2013 19:42, Joe Canner <[email protected]> wrote:
> I presented a paper at the recent Stata Conference in New Orleans on optimizing Stata code for speed.  Based on the response from Stata, it appears that this hasn't been on their radar until recently.  Keep in mind that it is only relatively recently that memory was cheap enough that large Stata data sets could fit in memory and thus spawn issues of performance even for basic tasks.  Now that memory availability has caught up with these big datasets, Stata is becoming more concerned about the issue.  I suspect you will see more efforts in this direction in future versions.
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of László Sándor
> Sent: Tuesday, August 27, 2013 2:35 PM
> To: [email protected]
> Subject: Re: st: RE: indicator variables from -by-
>
> Thanks, Joe, this was very educational.
>
> I just wonder why StataCorp doesn't want to or cannot explain to third-party developers how to write byable ado files that exploit the speed of in vs if. I mean that the documentation clearly suggests using -marksample- in byable commands, which you seem to conclude that itself precludes some optimization available in the -bys- prefix otherwise.
>
> My original code was slow without any special if condition, only using the recommended marksample in my byable command.
>
> On Tue, Aug 27, 2013 at 9:43 AM, Joe Canner <[email protected]> wrote:
>> Laszlo,
>>
>> First, I don't think we can assume that the built-in -bysort- prefix uses -marksample- and `touse' macros.  No doubt they have achieved some efficiencies that user-written -byable- programs cannot.
>>
>> That said, it does appear that adding an -if- qualifier to a -bys:- command slows down performance.  Take the example I provided yesterday; if you add an equivalent -if- qualifier to each:
>>
>> . sum AGE if inrange(obs,1000000,2000000) & RACE==1 versus . sum AGE
>> in 1000000/2000000 if RACE==1
>>
>> The latter is still faster than the former but only by a factor of 6 or 7, rather than 20.   Consider also the following:
>>
>> . bys bins: sum AGE
>> versus
>> . bys bins: sum AGE if RACE==1
>>
>> For my 8 million record dataset the latter took 18 seconds and the former took 17 seconds, despite the fact that the latter involves a 65% subset of the population.
>>
>> So, it is clear that Stata has written -bys- to be optimized for the case where there are no qualifiers, presumably because it can take advantage of the sorting.
>>
>> Incidentally, I also tried the following comparison:
>> . bys bins RACE: sum AGE
>> versus
>> . bys bins: sum AGE if RACE==1
>>
>> The former took only 19.4 seconds, compared to 18.5 for the latter, despite producing six times as much output.  In other words, if you plan to run a command more than once with several different -if- qualifiers, looking at different levels of the same variable, you might as well put that variable in the -bys- varlist.
>>
>> Regards,
>> Joe
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of László
>> Sándor
>> Sent: Monday, August 26, 2013 5:37 PM
>> To: [email protected]
>> Subject: Re: st: RE: indicator variables from -by-
>>
>> Thanks, Joe.
>>
>> I understand the concern, but it is hard to imagine that any byable command if's over all groups because the in-trick cannot be implemented. Again, this would be infeasible for by'ing over many groups.
>>
>> I suspect that something else might be the key because even the documentation mentions when introducing the new _by functions and macros that:
>>
>> So let’s consider the problems one at a time, beginning with the second problem. Your program does not use marksample, and we will assume that your program has good reason for not doing so, because the easy fix would be to use marksample. Still, your program must somehow be determining which observations to use, and we will assume that you are creating a ‘touse’ temporary variable containing 0 if the observation is to be omitted from the analysis and 1 if it is to be used. Somewhere, early in your program, you are setting the ‘touse’
>> variable.
>>
>> But of course, if I have `touse', the whole dummy-generation problem
>> comes back, plus it is not easy to use _byn1() _byn2()…
>>
>> On Mon, Aug 26, 2013 at 4:03 PM, Joe Canner <[email protected]> wrote:
>>> I'm not real familiar with -byable-, but there is some interesting information on it in the PDF documentation (p.pdf, page 8).  In particular, there are built-in functions _byn1() and _byn2() which return the first and last observation number of the current by-group.  Thus, it is up to the -byable- program to make use of this information for efficiency purposes.  Otherwise, if you use `touse' indicators you are stuck with using -if- to identify by-group members.
>>>
>>> So, presumably your wrapper could look something like this:
>>>
>>> prog mymns, byable(recall, noheader)
>>> syntax [varlist] [if] [in]
>>> sum `varlist' in `=_byn1()'/`=_byn2()', mean mat A=nullmat(A)\r(mean)
>>> end
>>>
>>> Keep in mind however, that if the program is called with -if- or -in-, the program will still have to deal with that as well using -marksample-.  So, if you want the wrapper program to be as efficient as possible, it may be better to prohibit using -if- and -in-, or else have the program deal with those calls separately.
>>>
>>> Regards,
>>> Joe
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of László
>>> Sándor
>>> Sent: Monday, August 26, 2013 2:45 PM
>>> To: [email protected]
>>> Subject: Re: st: RE: indicator variables from -by-
>>>
>>> Yes, this is true, but bysort'ing my (Austin's) ado wrapper for the
>>> (built-in) summarize to save the result should do the same thing. Or you mean there are no `touse' indicators involved? If built-in commands do by differently, then perhaps yes. But the -byable- documentation suggests ado files do use `touse' indicators. Maybe not a new one for each category but one and then use clever in'ing?
>>> Probably.
>>>
>>> All the more so, then: this cannot justify the order of magnitude
>>> slowdown and running out of 220 GB free memory…
>>>
>>> On Mon, Aug 26, 2013 at 1:08 PM, Joe Canner <[email protected]> wrote:
>>>> Laszlo,
>>>>
>>>> My guess is that -bys- takes good advantage of the sorting.  In fact, you are not allowed to run -by- without -sort-, probably because doing so would ruin the optimization.
>>>>
>>>> To illustrate, try the following:
>>>>
>>>> gen obs=_n
>>>> sum AGE if inrange(obs,1000000,2000000)
>>>>
>>>> and
>>>>
>>>> sum AGE in 1000000/2000000
>>>>
>>>> In my test (with a  dataset of almost 8 millions observations), the
>>>> former (not including -gen-) took 20x longer than the latter.
>>>> Similarly, the -bys- code presumably accesses all observations in a
>>>> particular level of the by variable more-or-less by observation
>>>> number, rather than by -if- testing. (I think Nick Cox alluded to
>>>> this a while back.)
>>>>
>>>> Regards,
>>>> Joe Canner
>>>> Johns Hopkins School of Medicine
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: [email protected]
>>>> [mailto:[email protected]] On Behalf Of László
>>>> Sándor
>>>> Sent: Sunday, August 25, 2013 11:55 AM
>>>> To: [email protected]
>>>> Subject: st: indicator variables from -by-
>>>>
>>>> Hi,
>>>> I have so many observations that even the byte tempvars of
>>>> -marksample- might make me run out of memory.
>>>>
>>>> But -by- must be inefficient in this, as if you -bys- over many groups (e.g. households), you never run out of memory because a new touse tempvar was created for each group.
>>>>
>>>> Thus I don't understand why this wrapper for -sum, meanonly- (just to collect saved results lost otherwise) runs out of copious amounts of memory (bying over 20 groups) while the -bys: sum, meanonly- is still much, much faster than any tabbing or tabstating or statsbying or Mata alternative. What does -by- handle differently about the latter what it cannot do with the former?
>>>>
>>>> prog mymns, byable(recall, noheader)  syntax [varlist] [if] [in]
>>>> marksample touse  sum `varlist' if `touse', mean  mat
>>>> A=nullmat(A)\r(mean) end
>>>>
>>>> Thanks,
>>>>
>>>> Laszlo
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>

References:
- st: indicator variables from -by-
  - From: László Sándor <[email protected]>
- st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>
- Re: st: RE: indicator variables from -by-
  - From: László Sándor <[email protected]>
- RE: st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>
- Re: st: RE: indicator variables from -by-
  - From: László Sándor <[email protected]>
- RE: st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>
- Re: st: RE: indicator variables from -by-
  - From: László Sándor <[email protected]>
- RE: st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>

Prev by Date: RE: st: RE: indicator variables from -by-
Next by Date: Re: st: RE: Equals: why the "=" (attribution) vs. "==" (Boolean) syntax distinction ???
Previous by thread: RE: st: RE: indicator variables from -by-
Next by thread: RE: st: RE: indicator variables from -by-
Index(es):
- Date
- Thread