Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: xt: unit-specific trends


From   László Sándor <sandorl@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: xt: unit-specific trends
Date   Thu, 19 Apr 2012 18:35:14 -0400

Thank you, Bill.

Of course it's great to have the correct results from Stata!

I am just a bit surprised that the "if" checks slow down operations
this much. Esp. by-loops. And esp. because -by:- wants to start sorted
anyway, I thought you could be less permitting later on (e.g. maintain
sort order). I would have guessed that the extra cost of not allowing
re-sorting would have justified a dramatic speedup of the -by- which
is pretty commonly used.

But exactly these are the sorts of trade-offs that you are experts in.
I will not second-guess your judgment.

Perhaps the biggest lesson to me was how costly an if-check is in
large datasets. Pretty frightening.

But most basic operations (sorting or checks) must be hitting some
theoretical limits, that's what we can squeeze out of computers.

Thanks!

Laszlo

On Thu, Apr 19, 2012 at 2:36 PM, William Gould, StataCorp LP
<wgould@stata.com> wrote:
>
> Laszlo <sandorl@gmail.com> wrote,
>
> > I used "if `touse'" because that is the official way to make a program
> > byable (http://www.stata.com/help.cgi?byable). If there is any case
> > where the -if- condition need not be checked for the entire dataset, a
> > -by: - run is that, isn't it?
>
> Laszlo is wrong in assuming that the data are necessarily sorted, and
> thus -if `touse' is the official way to program this case.
>
> The problem for -by- is that it is turning control over to a
> user-written program, and it is not uncommon for user-written programs
> to re-sort the data and then not put them back into the original
> order.  So -by- was written to accomondate that.
>
> If you as a programmer know that the the data will still be sorted
> you can convert the -if `touse'- into an -in- range by coding,
>
>        tempvar x
>        quietly gen long `x' = `touse'*_n
>        quietly sum `x', meanonly
>        local first = r(min)
>        local last  = r(max)
>        drop `x'
>
> In the rest of your code you can then code -in `first'/`last'- instead
> of -if `touse'-.
>
> There may be a quicker way to convert an -if `touse' into an -in- range.
> This is just the first way that occurred to me.
>
> I would still be hesitant to use -in- range instead of -if `touse'-
> because I would need to be certain that every command I used in my
> ado-file did not change the sort order.
>
> Here's demonstration that of a by-able program that re-sorts the data
> and yet still produces the expected results because it is coded using
> -if `touse'-:
>
>        . program tryit, byable(recall)
>          1.         di "hi"
>          2.         syntax
>          3.         marksample touse
>          4.         list rep78 if `touse'
>          5.         sort mpg
>          6. end
>
>        . sysuse auto, clear
>        (1978 Automobile Data)
>
>        . sort rep78
>
>        . by rep78: tryit
>
>        --------------------------------------
>        -> rep78 = 1
>        hi
>
>             +-------+
>             | rep78 |
>             |-------|
>          1. |     1 |
>          2. |     1 |
>             +-------+
>
>        --------------------------------------
>        -> rep78 = 2
>        hi
>
>             +-------+
>             | rep78 |
>             |-------|
>          3. |     2 |
>         14. |     2 |
>         15. |     2 |
>         22. |     2 |
>         24. |     2 |
>             |-------|
>         45. |     2 |
>         52. |     2 |
>         53. |     2 |
>             +-------+
>
>        <remaining output omitted>
>
>        . _
>
> When -tryit- was called the first time to process rep78==1, the data
> were in order, and we see that, as expected, the observations for
> which rep78 is 1 are at the top of the dataset, namely in observations
> 1 and 2.  Now look at the -tryit- code.  -tryit-, just before exiting,
> re-sorts the data!
>
> So, the second time -tryit- is called, when -tryit- is called to
> process the rep78 = 2 data, the observations will not be in order.
> And we can see that iun the listing.  The listing was produced by
> coding -list rep78 if `touse'- and, just as one would hope, all the
> observations for which `touse' contains 1 are rep78==2 observations.
> This time, however, the data are no longer in order.  The observations
> for which `touse' is 1 are observations 3, 14, 15, 22, 24, 45, 52, and
> 53.  It didn't matter, however, because we coded -if `touse'-.
>
> -by- plust -tryit- still produced correct results.
>
> Our thinking when we coded by and made the recommendation of using
> -if `touse'- was that sometimes it is better to produce correct
> results than to produce incorrect results more quickly.
>
> -- Bill
> wgould@stata.com
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index