Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: xt: unit-specific trends

From   László Sándor <>
Subject   Re: st: xt: unit-specific trends
Date   Tue, 24 Apr 2012 10:45:41 -0400

Getting back to this: I must thank Bill for his explanation, clear as always.

Yet I want to point out what I learnt from this: All of us (a)do-file
authors should be careful with by-loops. When we use this device to
loop over a few values, there is no problem. Yet if we use it for some
panel-like setting, it can be "treacherous." If there is no way out of
this but Mata, at least we should be aware that commands that like
-egen- should be high on our priority list to rewrite in Mata.

In my experience, people use -egen- to generate (many-many) variables
in a panel, or "worse", leave-out means and alike. There the loops are
definitely on the order of N, which might be a high price in large


On Fri, Apr 20, 2012 at 12:30 PM, William Gould, StataCorp LP
<> wrote:
> Laszlo wrote,
> > I am just a bit surprised that the "if" checks slow down operations
> > this much. Esp. by-loops.  [...]
> > But exactly these are the sorts of trade-offs that you are experts in.
> I would like to show Lazlo and the many others who I suspect would
> express the same sentiment that they should not be surprised.
> Let's imagine that we want to perform operations on 20 observations
> of a 200,000 obseration dataset, the 20 observations selected by
> -if-.
> Let's analyze execution time.
> As a first approximation, let's assume the time necessary to perform
> a linear operation on a set of observations is
>          T = t_f + t_o*N
> By a linear operation, I mean an operation whose execution time is
> linear in the number of observations.  -generate- and -replace- are
> examples of linear operations.  -sort- is an example of a non-linear
> operation.
> In the above formula, t_f is the time to parse the user's input and
> set up the problem, which is to say, t_f is small.  t_o is the time to
> perform the operation on a single observation, which is to say, t_o is
> small, too.  Obviously different operations require different amounts
> of time, but this is an approximaton, so let's just assume t_o is the
> same across operations.  We'll speculate later about the effects of of
> the assumption on our results.
> We are going to compare the total time it takes to operate on 20
> observations in a 20-observation dataset,
>          T_0 = t_f + 20*t_o
> and the time it takes to operate on 20 observations on a
> 200,000-obseration dataset, such as a -gemnerate- statement with an
> additional -if-.  The total time for tht would be
>          T_1 = t_f + 20*t_o + 200,000*t_o
> For small datasets, it is approximately the case that t_f = t_o*N --
> the time to parse and setup the problem is about equal to performing
> the work of the problem itself.  In that caes, the equations can be
> rewritten as
>          T_0 = (20+1)*t_o
>          T_1 = (20+1)*t_o + 200,000*t_o
> The ratio of T_1 to T_0 is then
>          T_1      (20+1)*(t_o) + 200,000*t_o
>         -----  =  --------------------------
>          T_0           (20+1)*t_o
>                =   1 + 200,000/(20+1)
>                =   (approximately) 9,525
> Many of you -- perhaps Lazlo among them -- think that we "experts" at
> StataCorp can achieve results "mere" users cannot.  Sometimes,
> however, being an expert is about knowing when to give up.  At
> StataCorp, we make calculations like the agove and then check run
> times, and that's one way that we determine which problems deserve
> more work.
> In the above calculaton, we assumed all operations take roughly the
> same time.  In particular, in
>        . generate x = <exp1>  if  <exp2>
> we assumed that <exp1> takes the same amount of time as <exp2>.
> Clearly an <exp2> such as -if `touse'- is a light-weight.  The ratio
> above might be better written by distinguishing between the execution
> times for <exp1> and <exp2>:
>          T_1      (20+1)*(t_exp1) + 200,000*t_exp2
>         -----  =  --------------------------------
>          T_0                 (20+1)*t_exp1
>                = 1 + 200,000*(t_exp2)/(21*t_exp1)
> Actually, the ratio of t_exp2/t_exp1 is probably much closer than 1
> than you expect, at least in interpretive languages like ado.
> Nontheless, if it pleases you, substitute 1/2 for the ratio and get
> approximately T_1/T_0 = 4763.
> By the way, t_exp1 might be approximately equal to t_exp2 in
> interpretive languages, but in compiled languages like Mata,
> the can be whoppingly different.   Had we been analyzing
> run times in compiled languages and you were bothered by the
> assumption tht t_exp1 == t_exp2, you would have been right.
> Lazlo also wrote,
> > I would have guessed that the extra cost of not allowing re-sorting
> > would have justified a dramatic speedup of the -by- which is pretty
> > commonly used.
> Thi choice we made in this particular issue is something about which
> reasonably people can disagree.  Let me outline our thinking in general.
> When we make such decisions, our view of ado-files is that
> ease-of-programming and likelihood-of-correctness trumps performance
> in most cases.  I am not saying that ado-files perform poorly or that
> it is pure luck that they don't.  We work to make them perform well,
> but when there is a tradeoff between speed of execution and ease of
> programming (which includes likelhood of correctness), we usually make
> the decision in favor of of ease of programming.
> Simultaneously, we provide a second programming language, Mata,
> in which the trade-off is reversed.
> That does not mean Mata is better than ado.  We at StataCorp write
> lots of ado code.  We choose the language according the problem.  In
> some problems, there is little speed difference between Mata and ado
> because of the nature of the problem, so we choose ado.  In other
> problems, there is a difference, but the speed really doesn't matter.
> We choose ado.  In still other problems, the is a difference is speed,
> that does matter, and we choose Mata.  There's one more case in which
> we choose Mata, which is when the problem is complex and the
> organizational aspects of Mata such as structures and classes makes it
> is easy for us to write readable code, meaning the code will require
> less debugging, and meaning the code will be more modifiable in the
> future.
> -- Bill
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index