Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: xt: unit-specific trends
"William Gould, StataCorp LP" <firstname.lastname@example.org>
Re: st: xt: unit-specific trends
Fri, 20 Apr 2012 11:30:44 -0500
Laszlo email@example.com wrote,
> I am just a bit surprised that the "if" checks slow down operations
> this much. Esp. by-loops. [...]
> But exactly these are the sorts of trade-offs that you are experts in.
I would like to show Lazlo and the many others who I suspect would
express the same sentiment that they should not be surprised.
Let's imagine that we want to perform operations on 20 observations
of a 200,000 obseration dataset, the 20 observations selected by
Let's analyze execution time.
As a first approximation, let's assume the time necessary to perform
a linear operation on a set of observations is
T = t_f + t_o*N
By a linear operation, I mean an operation whose execution time is
linear in the number of observations. -generate- and -replace- are
examples of linear operations. -sort- is an example of a non-linear
In the above formula, t_f is the time to parse the user's input and
set up the problem, which is to say, t_f is small. t_o is the time to
perform the operation on a single observation, which is to say, t_o is
small, too. Obviously different operations require different amounts
of time, but this is an approximaton, so let's just assume t_o is the
same across operations. We'll speculate later about the effects of of
the assumption on our results.
We are going to compare the total time it takes to operate on 20
observations in a 20-observation dataset,
T_0 = t_f + 20*t_o
and the time it takes to operate on 20 observations on a
200,000-obseration dataset, such as a -gemnerate- statement with an
additional -if-. The total time for tht would be
T_1 = t_f + 20*t_o + 200,000*t_o
For small datasets, it is approximately the case that t_f = t_o*N --
the time to parse and setup the problem is about equal to performing
the work of the problem itself. In that caes, the equations can be
T_0 = (20+1)*t_o
T_1 = (20+1)*t_o + 200,000*t_o
The ratio of T_1 to T_0 is then
T_1 (20+1)*(t_o) + 200,000*t_o
----- = --------------------------
= 1 + 200,000/(20+1)
= (approximately) 9,525
Many of you -- perhaps Lazlo among them -- think that we "experts" at
StataCorp can achieve results "mere" users cannot. Sometimes,
however, being an expert is about knowing when to give up. At
StataCorp, we make calculations like the agove and then check run
times, and that's one way that we determine which problems deserve
In the above calculaton, we assumed all operations take roughly the
same time. In particular, in
. generate x = <exp1> if <exp2>
we assumed that <exp1> takes the same amount of time as <exp2>.
Clearly an <exp2> such as -if `touse'- is a light-weight. The ratio
above might be better written by distinguishing between the execution
times for <exp1> and <exp2>:
T_1 (20+1)*(t_exp1) + 200,000*t_exp2
----- = --------------------------------
= 1 + 200,000*(t_exp2)/(21*t_exp1)
Actually, the ratio of t_exp2/t_exp1 is probably much closer than 1
than you expect, at least in interpretive languages like ado.
Nontheless, if it pleases you, substitute 1/2 for the ratio and get
approximately T_1/T_0 = 4763.
By the way, t_exp1 might be approximately equal to t_exp2 in
interpretive languages, but in compiled languages like Mata,
the can be whoppingly different. Had we been analyzing
run times in compiled languages and you were bothered by the
assumption tht t_exp1 == t_exp2, you would have been right.
Lazlo also wrote,
> I would have guessed that the extra cost of not allowing re-sorting
> would have justified a dramatic speedup of the -by- which is pretty
> commonly used.
Thi choice we made in this particular issue is something about which
reasonably people can disagree. Let me outline our thinking in general.
When we make such decisions, our view of ado-files is that
ease-of-programming and likelihood-of-correctness trumps performance
in most cases. I am not saying that ado-files perform poorly or that
it is pure luck that they don't. We work to make them perform well,
but when there is a tradeoff between speed of execution and ease of
programming (which includes likelhood of correctness), we usually make
the decision in favor of of ease of programming.
Simultaneously, we provide a second programming language, Mata,
in which the trade-off is reversed.
That does not mean Mata is better than ado. We at StataCorp write
lots of ado code. We choose the language according the problem. In
some problems, there is little speed difference between Mata and ado
because of the nature of the problem, so we choose ado. In other
problems, there is a difference, but the speed really doesn't matter.
We choose ado. In still other problems, the is a difference is speed,
that does matter, and we choose Mata. There's one more case in which
we choose Mata, which is when the problem is complex and the
organizational aspects of Mata such as structures and classes makes it
is easy for us to write readable code, meaning the code will require
less debugging, and meaning the code will be more modifiable in the
* For searches and help try: