[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
[no subject]

What exactly causes the difference in ascending/descending sorting performance?

It seems that to sort the data from descending order to ascending
order 2 seconds is enough.

If sorting is implemented efficiently, than knowing that the data is
sorted in ascending order may not slow Stata down, and it should take
about 2 seconds to reverse the order again. But it takes 25!

To simply ensure that the data is in descending order (repeated gsort
-x) one again needs 25 seconds. Why??? It is MUCH MORE than it takes
to actually reverse the order and confirm that the data is sorted!!!!

Am I comparing apples and oranges here somewhere?

In any case the difference in performance seems to be quite stable and
not trivial. And I really like sorting of my data in 2 seconds more
than same sorting in 25 seconds.

[D] sort is silent about the sorting algorithms implemented by Stata,
but it is likely to be "quicksort" or "heapsort", though presence of
the option -stable- suggests that there is also another sorting
algorithm ("mergesort" perhaps?).

In case it is indeed "quicksort" does recursion pose any serious
problem in terms of memory requirements? Since even in case of a
minimalistic implementation it still requires THETA(log n)  space
during sorting under most favorable conditions up to THETA(n) under
the unfavorable conditions.



So if my observations are correct, then:

1. It is not Stata that is aware of the data order, but some of the
Stata commands. Some of the Stata commands pretend to be unaware of
data being sorted, even when this information is known.

2. Stata recognizes the data as sorted only if it is sorted in
ascending order. Descending order is not recognized. This is
documented. Inability of Stata to use this information, when it is
aware of it, is somewhat puzzling. Was this simply overlooked by the
developers, or is there a major obstacle that I don't see?

3. The difference between sorting in ascending or descending order is
not only the difference in that Stata will later recognize one dataset
as sorted (ascending) and the other one not (descending), but also a
huge difference in execution time.


Any comments are welcomed.


Thank you,
   Sergiy Radyakin







. set mem 900m

Current memory allocation
// ------ cut ---- standard memory report not important here ---
cut-----------------------


. set obs 10000000
obs was 0, now 10000000

. gen x=uniform()

. set rmsg on
r; t=0.00 20:57:12

. sum x

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           x |  10000000    .5001136    .2886926   1.28e-07   .9999999
r; t=1.02 20:57:15


// ---- INITIAL SORT - DATA WAS IN RANDOM ORDER
. sort x
r; t=38.87 20:58:09

// --- REPEATED SORT - DATA IS IN OPPOSITE ORDER
. gsort -x
r; t=25.01 20:58:48

// --- REPEATED SORT - DATA IS IN OPPOSITE ORDER
. sort x
r; t=2.05 20:58:58

// ---- TESTING INTELLIGENCE ----------------------------------------------

. gsort -x
r; t=24.98 22:49:02

. gsort -x
r; t=25.67 22:49:31

. gsort x
r; t=2.09 22:49:47

. gsort x
r; t=1.39 22:49:53
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: Sorting data puzzles
Next by Date: st: panel tobit model with a binary endogenous regressor
Previous by thread: st: Sorting data puzzles
Next by thread: st: panel tobit model with a binary endogenous regressor
Index(es):
- Date
- Thread