[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

What exactly causes the difference in ascending/descending sorting performance? It seems that to sort the data from descending order to ascending order 2 seconds is enough. If sorting is implemented efficiently, than knowing that the data is sorted in ascending order may not slow Stata down, and it should take about 2 seconds to reverse the order again. But it takes 25! To simply ensure that the data is in descending order (repeated gsort -x) one again needs 25 seconds. Why??? It is MUCH MORE than it takes to actually reverse the order and confirm that the data is sorted!!!! Am I comparing apples and oranges here somewhere? In any case the difference in performance seems to be quite stable and not trivial. And I really like sorting of my data in 2 seconds more than same sorting in 25 seconds. [D] sort is silent about the sorting algorithms implemented by Stata, but it is likely to be "quicksort" or "heapsort", though presence of the option -stable- suggests that there is also another sorting algorithm ("mergesort" perhaps?). In case it is indeed "quicksort" does recursion pose any serious problem in terms of memory requirements? Since even in case of a minimalistic implementation it still requires THETA(log n) space during sorting under most favorable conditions up to THETA(n) under the unfavorable conditions. So if my observations are correct, then: 1. It is not Stata that is aware of the data order, but some of the Stata commands. Some of the Stata commands pretend to be unaware of data being sorted, even when this information is known. 2. Stata recognizes the data as sorted only if it is sorted in ascending order. Descending order is not recognized. This is documented. Inability of Stata to use this information, when it is aware of it, is somewhat puzzling. Was this simply overlooked by the developers, or is there a major obstacle that I don't see? 3. The difference between sorting in ascending or descending order is not only the difference in that Stata will later recognize one dataset as sorted (ascending) and the other one not (descending), but also a huge difference in execution time. Any comments are welcomed. Thank you, Sergiy Radyakin . set mem 900m Current memory allocation // ------ cut ---- standard memory report not important here --- cut----------------------- . set obs 10000000 obs was 0, now 10000000 . gen x=uniform() . set rmsg on r; t=0.00 20:57:12 . sum x Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x | 10000000 .5001136 .2886926 1.28e-07 .9999999 r; t=1.02 20:57:15 // ---- INITIAL SORT - DATA WAS IN RANDOM ORDER . sort x r; t=38.87 20:58:09 // --- REPEATED SORT - DATA IS IN OPPOSITE ORDER . gsort -x r; t=25.01 20:58:48 // --- REPEATED SORT - DATA IS IN OPPOSITE ORDER . sort x r; t=2.05 20:58:58 // ---- TESTING INTELLIGENCE ---------------------------------------------- . gsort -x r; t=24.98 22:49:02 . gsort -x r; t=25.67 22:49:31 . gsort x r; t=2.09 22:49:47 . gsort x r; t=1.39 22:49:53 * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: Sorting data puzzles** - Next by Date:
**st: panel tobit model with a binary endogenous regressor** - Previous by thread:
**st: Sorting data puzzles** - Next by thread:
**st: panel tobit model with a binary endogenous regressor** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |