Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"William Gould, StataCorp LP" <wgould@stata.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: xt: unit-specific trends |

Date |
Fri, 20 Apr 2012 11:30:44 -0500 |

Laszlo sandorl@gmail.com wrote, > I am just a bit surprised that the "if" checks slow down operations > this much. Esp. by-loops. [...] > But exactly these are the sorts of trade-offs that you are experts in. I would like to show Lazlo and the many others who I suspect would express the same sentiment that they should not be surprised. Let's imagine that we want to perform operations on 20 observations of a 200,000 obseration dataset, the 20 observations selected by -if-. Let's analyze execution time. As a first approximation, let's assume the time necessary to perform a linear operation on a set of observations is T = t_f + t_o*N By a linear operation, I mean an operation whose execution time is linear in the number of observations. -generate- and -replace- are examples of linear operations. -sort- is an example of a non-linear operation. In the above formula, t_f is the time to parse the user's input and set up the problem, which is to say, t_f is small. t_o is the time to perform the operation on a single observation, which is to say, t_o is small, too. Obviously different operations require different amounts of time, but this is an approximaton, so let's just assume t_o is the same across operations. We'll speculate later about the effects of of the assumption on our results. We are going to compare the total time it takes to operate on 20 observations in a 20-observation dataset, T_0 = t_f + 20*t_o and the time it takes to operate on 20 observations on a 200,000-obseration dataset, such as a -gemnerate- statement with an additional -if-. The total time for tht would be T_1 = t_f + 20*t_o + 200,000*t_o For small datasets, it is approximately the case that t_f = t_o*N -- the time to parse and setup the problem is about equal to performing the work of the problem itself. In that caes, the equations can be rewritten as T_0 = (20+1)*t_o T_1 = (20+1)*t_o + 200,000*t_o The ratio of T_1 to T_0 is then T_1 (20+1)*(t_o) + 200,000*t_o ----- = -------------------------- T_0 (20+1)*t_o = 1 + 200,000/(20+1) = (approximately) 9,525 Many of you -- perhaps Lazlo among them -- think that we "experts" at StataCorp can achieve results "mere" users cannot. Sometimes, however, being an expert is about knowing when to give up. At StataCorp, we make calculations like the agove and then check run times, and that's one way that we determine which problems deserve more work. In the above calculaton, we assumed all operations take roughly the same time. In particular, in . generate x = <exp1> if <exp2> we assumed that <exp1> takes the same amount of time as <exp2>. Clearly an <exp2> such as -if `touse'- is a light-weight. The ratio above might be better written by distinguishing between the execution times for <exp1> and <exp2>: T_1 (20+1)*(t_exp1) + 200,000*t_exp2 ----- = -------------------------------- T_0 (20+1)*t_exp1 = 1 + 200,000*(t_exp2)/(21*t_exp1) Actually, the ratio of t_exp2/t_exp1 is probably much closer than 1 than you expect, at least in interpretive languages like ado. Nontheless, if it pleases you, substitute 1/2 for the ratio and get approximately T_1/T_0 = 4763. By the way, t_exp1 might be approximately equal to t_exp2 in interpretive languages, but in compiled languages like Mata, the can be whoppingly different. Had we been analyzing run times in compiled languages and you were bothered by the assumption tht t_exp1 == t_exp2, you would have been right. Lazlo also wrote, > I would have guessed that the extra cost of not allowing re-sorting > would have justified a dramatic speedup of the -by- which is pretty > commonly used. Thi choice we made in this particular issue is something about which reasonably people can disagree. Let me outline our thinking in general. When we make such decisions, our view of ado-files is that ease-of-programming and likelihood-of-correctness trumps performance in most cases. I am not saying that ado-files perform poorly or that it is pure luck that they don't. We work to make them perform well, but when there is a tradeoff between speed of execution and ease of programming (which includes likelhood of correctness), we usually make the decision in favor of of ease of programming. Simultaneously, we provide a second programming language, Mata, in which the trade-off is reversed. That does not mean Mata is better than ado. We at StataCorp write lots of ado code. We choose the language according the problem. In some problems, there is little speed difference between Mata and ado because of the nature of the problem, so we choose ado. In other problems, there is a difference, but the speed really doesn't matter. We choose ado. In still other problems, the is a difference is speed, that does matter, and we choose Mata. There's one more case in which we choose Mata, which is when the problem is complex and the organizational aspects of Mata such as structures and classes makes it is easy for us to write readable code, meaning the code will require less debugging, and meaning the code will be more modifiable in the future. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: xt: unit-specific trends***From:*László Sándor <sandorl@gmail.com>

- Prev by Date:
**Re: st: proper use of aweight** - Next by Date:
**st: question concerning mlsum for ml with panel-data** - Previous by thread:
**Re: st: xt: unit-specific trends** - Next by thread:
**Re: st: xt: unit-specific trends** - Index(es):