[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

[no subject]

Get this:
x1   x2   x3
1   2   3
3   4   5
7   9   9

For all I know, there is no linear complexity O(n) sorting

Here you can see how different algorithms actually perform:

For N observations and K variables, Nick suggested reshaping data to
get NxK observations, sort them (perhaps on a complex key)  and
reshape it back. I would doubt that this strategy improves
performance, because of the above considerations of nonlinearity of
sorting with respect to N:

"sorting K values N times" is much simplier than "sorting NxK values".
Just how much easier - it depends. When K is rather small, the gain
can be quite big. Consider e.g. K=2 :) -- a simple conditional swap --
linear! (number of swaps in the worst case = N, number of comparisons:
also N).

Reshape itself works quite slow. Like really slow. Because it is
implemented as an .ado program. A commentary by the author of rowsort
notes that "It may be faster to reshape, sort within blocks, and
reshape again". So I take that as a sign of even worse performance.

Another issue to consider is memory requirements. How much memory does
reshape require? (we are all so concerned about fitting our data into
2GB/1GB/700MB/... limit, but often forget that once it is there,
working with it requires memory too!) In my experience reshape
requires at least two times as much:

. memory
Details of set memory usage
    overhead (pointers)                          14,800       14.45%
    data                                         37,000       36.14%
    data + overhead                              51,800       50.59%
    free                                         50,592       49.41%
    Total allocated                             102,392      100.00%
Other memory usage
    set maxvar usage                          2,041,738
    set matsize usage                         1,315,200
    programs, saved results, etc.                45,523
    Total                                     3,402,461
Grand total                                   3,504,853

. reshape long r p, i(id)
(note: j = 1 2 3 4 5 6)
no room to add more observations

(sorry for mentioning this taboo topic on public, have you ever
estimated how much memory does an XYZ command require? Even Stata
people guesstimate here).

So Ashim, don't get discouraged and write your own code. It might work
better for your particular case.

One last advice is that sometimes the problem can be resolved by a
radically different solution. E.g. if your variables are, say age of
children for a particular mother in a [e.g. DHS] survey, you might
find a dataset already in the long-format and start working with it.
The dataset description will let you know what is the ordering there.
In particular, data may be ordered with extra (not available to you)
information which may be lost during sorting.

E.g.: the codebook may say that children are ordered by age: 6, 8, 8,
12. You might not know whether 2nd and 3rd children are twins, or were
born in the same year, but the data providers might very well do!
Changing their places (remember that Stata does not guarantee sort
stability for observations, unless special option is used, and I am
not sure how this works for sorting rows) will change their birth
ranks - quite an important characteristic, if we believe a large body
of literature on the topic.

Hope this helps.

Anyways, I don't immediately see how -sortrows- or -rowssort- will
deal with this situatuion:

motherid   child_id1 child_id2 child_id3 age1 age2 age3 gender1 gender2 gender3
and I might want to sort children by age keeping their ordering stable
and moving their (string) ids and (numeric) gender dummies in sync.

Regards, Sergiy Radyakin

On 8/26/08, Nick Cox <[email protected]> wrote:
> This generated various suggestions.
> My main thought is that you should never have to write your own sort
> programs, bubble sort or other, in Stata.
> Even sorting rows can make use of Stata's own sort functionality -- as
> in a combination of -reshape- and -sort-, as mentioned by Maarten Buis
> -- or as in -rowsort or -sortrows-, as mentioned in part by Martin
> Weiss.
> -rowsort- is on SSC and was written for Stata 8.
> -sortrows- is also on SSC and was written for Stata 9. It makes use of
> Mata and in large part supersedes -rowsort-, so long as you have Stata 9
> at least.
> Nick
> [email protected]
> Ashim Kapoor
> I have a simple code to write.
> I have 6 variables v1, v2 , v3, v4 ,v5 , v6.
> I want to sort "according to each observation". That is , in the end
> for each i I want , v1[i] < = v2[i] <=... v6[i].
> So I wrote a simple bubble for each row and then loop from 1 to _N.
> Now the problem is that this works but it's VERY VERY slow. Any
> comments anyone ?
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index