Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: Stata treatment of sort order


From   Andrew Maurer <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: RE: Stata treatment of sort order
Date   Thu, 6 Mar 2014 21:00:40 +0000

Rich,
Thanks for this reference. This is interesting, since I don't know how Stata could sort datasets without the "`: sortedby'" flag "instantly". Wouldn't the sort on an already sorted set take at least O(n)? (ie: doesn't the program need to loop once and verify that x[i] <= x[i+1] for i from 1 to _N-1?)

Sarah,
Thanks for the response. However, here's an example of an unsorted list, with repeated values of the sort variable, where the final sort order is always the same after --sort x--. This seems like it contradicts the documentation's assertion that, "the ordering of observations with equal values of varlist is randomized". Perhaps "sometimes randomized" would be more appropriate.

****** Begin code ******
clear all
set obs 10
gen id = _n
gen x = 1 in 1/9
replace x = 0 in 10
sort x
****** End code ********

Output is always:
id	x
10	0
9	1
8	1
7	1
6	1
5	1
4	1
3	1
2	1
1	1


Andrew Maurer 

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Richard Goldstein
Sent: Thursday, March 06, 2014 2:20 PM
To: [email protected]
Subject: Re: st: RE: Stata treatment of sort order

actually, the manual specifically deals with this: "Stata may be
dumb, but it is also fast. It sorts already-sorted datasets instantly,
so Stata's ignorance costs us little." p. 603

Rich

On 3/6/14, 3:14 PM, Sarah Edgington wrote:
> Andrew,
> In the example in your second question you're asking Stata to sort the data
> on a variable on which it is already sorted.  In that case I would not
> expect Stata to change the ordering of the data at all, with or without the
> stable option.  Even though you're pasting in new data (so Stata has no
> knowledge of the existing sort order) I would expect that the sorting
> algorithm would do some checking of whether the data was already in the
> order you requested.  Since it is already sorted in that order, I wouldn't
> expect the data to be changed.  Admittedly that's just a guess since I don't
> have any information on how Stata implements sorting, but it would explain
> the behavior.  
> 
> However, you can see that if the data is NOT already sorted on the variable
> of interest that the sort order does change over multiple sorts.  For
> example, using the auto data, try to -sort price- then -sort foreign-.  If
> you do this multiple times you'll note that the ordering is different after
> -sort foreign-.
> 
> -Sarah
> 
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Andrew Maurer
> Sent: Thursday, March 06, 2014 11:36 AM
> To: [email protected]
> Subject: st: Stata treatment of sort order
> 
> Hi Statalist,
> 
> I'm wondering if anyone can help explain some details about Stata and
> sorting
> 
> First, where does Stata hold information about current sort order? Ie, the
> extended macro function --`: sortedby'-- returns the current sort order.
> However, looking at --char dir-- and --macro dir-- I don't see the
> information there. In particular, I want to overwrite the value, so that
> --`: sortedby'-- will return the value that I insert. One use might be if I
> -infile-, and I already know the sort order of the data, but don't want to
> have to run sort just to populate `: sortedby'. (In --help dta--, I see
> where it's stored in a physical dta file [<sortlist>sortlist</sortlist>],
> but it doesn't explain where it is put in memory.
> 
> Second, the help file for sort seems somewhat misleading. --help sort--
> explains, "Without the stable option, the ordering of observations with
> equal values of varlist is randomized." What does "randomized" here mean? I
> interpret it to mean that each residual observation has an equal probability
> of being in any of the slots specified by the sort list (eg that --sort
> var1-- is equivalent to --gen rand = runiform()-- --sort var1 rand-- --drop
> rand-- However, residual sort order doesn't always appear random. For
> example, if I --sysuse auto--, --sort foreign--, then copy the data to
> clipboard, --clear--, then use data editor to paste the data back, and
> finally --sort foreign--, the ordering is always the same as the original
> ordering (ie: the ordering of observations with equal values of varlist was
> /not/ randomized.
> 
> Is anyone able to explain these observations?
> 
> Thank you,
> 
> Andrew Maurer 
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index