Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: A query about sorting.


From   "Ashim Kapoor" <ashimkapoor@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: A query about sorting.
Date   Wed, 27 Aug 2008 11:33:55 +0530

Sergey,

Thank you for your reply. My OWN simple code takes 40 minutes to run
on historical date ( around 10 years ) which sortrows takes minutes.
Is'nt that amazing ?

Thank you,
Ashim.

On Wed, Aug 27, 2008 at 8:33 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote:
> Hello All,
>
> I have probably missed the first part of this thread, but is this what
> needs to be done?
>
> From this:
> x1   x2   x3
> 2   3   1
> 3   5   4
> 9   7   9
>
> Get this:
> x1   x2   x3
> 1   2   3
> 3   4   5
> 7   9   9
>
> For all I know, there is no linear complexity O(n) sorting
> algorithm(http://en.wikipedia.org/wiki/Sorting_algorithm)
>
> Here you can see how different algorithms actually perform:
> http://linux.wku.edu/~lamonml/algor/sort/sort.html
>
> For N observations and K variables, Nick suggested reshaping data to
> get NxK observations, sort them (perhaps on a complex key)  and
> reshape it back. I would doubt that this strategy improves
> performance, because of the above considerations of nonlinearity of
> sorting with respect to N:
>
> "sorting K values N times" is much simplier than "sorting NxK values".
> Just how much easier - it depends. When K is rather small, the gain
> can be quite big. Consider e.g. K=2 :) -- a simple conditional swap --
> linear! (number of swaps in the worst case = N, number of comparisons:
> also N).
>
> Reshape itself works quite slow. Like really slow. Because it is
> implemented as an .ado program. A commentary by the author of rowsort
> notes that "It may be faster to reshape, sort within blocks, and
> reshape again". So I take that as a sign of even worse performance.
>
> Another issue to consider is memory requirements. How much memory does
> reshape require? (we are all so concerned about fitting our data into
> 2GB/1GB/700MB/... limit, but often forget that once it is there,
> working with it requires memory too!) In my experience reshape
> requires at least two times as much:
>
> . memory
>                                                  bytes
> --------------------------------------------------------------------
> Details of set memory usage
>    overhead (pointers)                          14,800       14.45%
>    data                                         37,000       36.14%
>                                        ----------------------------
>    data + overhead                              51,800       50.59%
>    free                                         50,592       49.41%
>                                        ----------------------------
>    Total allocated                             102,392      100.00%
> --------------------------------------------------------------------
> Other memory usage
>    set maxvar usage                          2,041,738
>    set matsize usage                         1,315,200
>    programs, saved results, etc.                45,523
>                                        ---------------
>    Total                                     3,402,461
> -------------------------------------------------------
> Grand total                                   3,504,853
>
> . reshape long r p, i(id)
> (note: j = 1 2 3 4 5 6)
> no room to add more observations
> r(901)
>
> (sorry for mentioning this taboo topic on public, have you ever
> estimated how much memory does an XYZ command require? Even Stata
> people guesstimate here).
>
> So Ashim, don't get discouraged and write your own code. It might work
> better for your particular case.
>
> One last advice is that sometimes the problem can be resolved by a
> radically different solution. E.g. if your variables are, say age of
> children for a particular mother in a [e.g. DHS] survey, you might
> find a dataset already in the long-format and start working with it.
> The dataset description will let you know what is the ordering there.
> In particular, data may be ordered with extra (not available to you)
> information which may be lost during sorting.
>
> E.g.: the codebook may say that children are ordered by age: 6, 8, 8,
> 12. You might not know whether 2nd and 3rd children are twins, or were
> born in the same year, but the data providers might very well do!
> Changing their places (remember that Stata does not guarantee sort
> stability for observations, unless special option is used, and I am
> not sure how this works for sorting rows) will change their birth
> ranks - quite an important characteristic, if we believe a large body
> of literature on the topic.
>
> Hope this helps.
>
> Anyways, I don't immediately see how -sortrows- or -rowssort- will
> deal with this situatuion:
>
> motherid   child_id1 child_id2 child_id3 age1 age2 age3 gender1 gender2 gender3
> .....
> and I might want to sort children by age keeping their ordering stable
> and moving their (string) ids and (numeric) gender dummies in sync.
>
> Regards, Sergiy Radyakin
>
>
>
>
>
>
>
> On 8/26/08, Nick Cox <n.j.cox@durham.ac.uk> wrote:
>> This generated various suggestions.
>>
>> My main thought is that you should never have to write your own sort
>> programs, bubble sort or other, in Stata.
>>
>> Even sorting rows can make use of Stata's own sort functionality -- as
>> in a combination of -reshape- and -sort-, as mentioned by Maarten Buis
>> -- or as in -rowsort or -sortrows-, as mentioned in part by Martin
>> Weiss.
>>
>> -rowsort- is on SSC and was written for Stata 8.
>>
>> -sortrows- is also on SSC and was written for Stata 9. It makes use of
>> Mata and in large part supersedes -rowsort-, so long as you have Stata 9
>> at least.
>>
>> Nick
>> n.j.cox@durham.ac.uk
>>
>> Ashim Kapoor
>>
>> I have a simple code to write.
>>
>> I have 6 variables v1, v2 , v3, v4 ,v5 , v6.
>>
>> I want to sort "according to each observation". That is , in the end
>> for each i I want , v1[i] < = v2[i] <=... v6[i].
>>
>> So I wrote a simple bubble for each row and then loop from 1 to _N.
>>
>> Now the problem is that this works but it's VERY VERY slow. Any
>> comments anyone ?
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index