Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: A query about sorting.


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: RE: A query about sorting.
Date   Wed, 27 Aug 2008 12:05:47 +0100

Reading this, I got a mental picture of Sergiy taking some time out to
watch some Olympic events and muttering from his seat, "That's not fast.
You could go a lot faster if you used C." 

He's right. 

I agree that 

1. The official command -reshape- is interpreted Stata code and can be a
lot slower than would be the equivalent code in C. Also, it can require
a lot of extra memory. 

2. -rowsort- was written by me for Stata 8 and -- as its help indicates
-- can be very slow at times because it loops over observations. It's a
pity that I was not able to use Mata before it was released but no
matter, Jeff Arnold did that with -sortrows- for Stata 9, which should
be much faster. -rowsort-, like many other things, was offered via
Statalist as a practical solution to someone's problem. It was not
offered as an exercise in some introductory computer science course in
which marks are deducted for poor performance or not using the fastest
possible language or algorithm, so it is no surprise that it scores
poorly on such criteria. Anyone able to identify a better solution
should go straight there. 

3. -rowsort- and -sortrows- are not intended for this problem in which
Sergiy has 

motherid   child_id1 child_id2 child_id3 age1 age2 age3 gender1 gender2
gender3
.....
and might want to sort children by age keeping their ordering stable
and moving their (string) ids and (numeric) gender dummies in sync.

They are indeed useless at that problem. But see -rowranks- from SSC.
That's not intended, quite, for that problem either, but it is perfectly
soluble using basic Stata. 

Sergiy's underlying implication appears to be that Stata solutions can
be much slower than customised solutions based on C programming. I
agree. Who wouldn't? But the message isn't of much practical use to
those who are not C programmers, do not intend to become C programmers,
or do not have access to C programmers to do the work for them. Even for
those inclined to program in C, there is still the question of how much
person time is needed to write those programs. If something takes 2
minutes' machine time, and in principle a faster solution is possible in
2 seconds, then forget it, unless I am going to be able to need that
program enough to make the writing time, a lot longer than 1 minute 58
seconds, a good expenditure of a scarce resource. Of course, I made up
this example. And there are many problems at which Stata is
frustratingly slow and C-based solutions may be essential. 

More simply, if Sergiy or those like-minded can write something faster
than -reshape- (or -rowsort- or -sortrows- or whatever), that is as easy
or easier to use, then do show us. I'd use it very happily. 

I advised Ashim "My main thought is that you should never have to write
your own sort
programs, bubble sort or other, in Stata" and I still think that advice
to be much nearer right than wrong. Note that -rowsort- and -sortrows-
are, contrary to appearances, not exceptions. Both are wrappers for
Stata's (or Mata's) own sorting routines. Writing your own sorting
programs from scratch, as Ashim was doing, might be fun but the
interpretive overhead is likely to be severe and the outcome very slow.
(His latest post appears to confirm this.) My guess is that you would
have to do it in C to outperform Mata, but issues would remain about how
fast whatever else you were doing was going to be. 

Nick 
n.j.cox@durham.ac.uk 

Sergiy Radyakin

I have probably missed the first part of this thread, but is this what
needs to be done?




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index