Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: Stata treatment of sort order

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: RE: Stata treatment of sort order
Date	Fri, 7 Mar 2014 15:07:24 +0000

Andrew,

I was intrigued by your response to Rich, so I did a little benchmark of my own using a data set with 8M observations and 155 variables (4.2 GB total memory).

Based on my results, I think "instantly" is a slight exaggeration.  It takes about 1.5 seconds to sort this data set when it is already sorted, compared to 15-20 seconds to sort it when it is not already sorted.  This seems to correspond to O(n) for already sorted and O(nlogn) for unsorted.  Of course, for small or medium sized data sets O(n) probably appears to be "instantly".

What was even more interesting, though, is that it took about 1.5 seconds to sort an already sorted data set, regardless of the -sortedby- setting.  I did two trials, one in which I sorted twice in a row on the same variable, and one in which I modified some values of the sort variable (in a way that did not disturb the sort order) between the two sorts (as per the example in the Stata manual).  In the second trial the -sortedby- macro gets set to missing (i.e., Stata assumes it is unsorted), but it takes the same amount of time to sort as when the -sortedby- macro is correct.  In other words, it appears that Stata makes a single pass through the data set to make sure it is sorted regardless of whether the -sortedby- macro is set or not.  This would suggest that the -sortedby- macro is just for informational purposes and not for Stata's internal use.

I did not exhaustively test this however, so I could be mistaken...

Regards,
Joe Canner
Johns Hopkins University School of Medicine

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Maurer
Sent: Thursday, March 06, 2014 4:01 PM
To: [email protected]
Subject: RE: st: RE: Stata treatment of sort order

Rich,
Thanks for this reference. This is interesting, since I don't know how Stata could sort datasets without the "`: sortedby'" flag "instantly". Wouldn't the sort on an already sorted set take at least O(n)? (ie: doesn't the program need to loop once and verify that x[i] <= x[i+1] for i from 1 to _N-1?)

Sarah,
Thanks for the response. However, here's an example of an unsorted list, with repeated values of the sort variable, where the final sort order is always the same after --sort x--. This seems like it contradicts the documentation's assertion that, "the ordering of observations with equal values of varlist is randomized". Perhaps "sometimes randomized" would be more appropriate.

****** Begin code ******
clear all
set obs 10
gen id = _n
gen x = 1 in 1/9
replace x = 0 in 10
sort x
****** End code ********

Output is always:
id	x
10	0
9	1
8	1
7	1
6	1
5	1
4	1
3	1
2	1
1	1

Andrew Maurer 

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Richard Goldstein
Sent: Thursday, March 06, 2014 2:20 PM
To: [email protected]
Subject: Re: st: RE: Stata treatment of sort order

actually, the manual specifically deals with this: "Stata may be
dumb, but it is also fast. It sorts already-sorted datasets instantly,
so Stata's ignorance costs us little." p. 603

Rich

On 3/6/14, 3:14 PM, Sarah Edgington wrote:
> Andrew,
> In the example in your second question you're asking Stata to sort the data
> on a variable on which it is already sorted.  In that case I would not
> expect Stata to change the ordering of the data at all, with or without the
> stable option.  Even though you're pasting in new data (so Stata has no
> knowledge of the existing sort order) I would expect that the sorting
> algorithm would do some checking of whether the data was already in the
> order you requested.  Since it is already sorted in that order, I wouldn't
> expect the data to be changed.  Admittedly that's just a guess since I don't
> have any information on how Stata implements sorting, but it would explain
> the behavior.  
> 
> However, you can see that if the data is NOT already sorted on the variable
> of interest that the sort order does change over multiple sorts.  For
> example, using the auto data, try to -sort price- then -sort foreign-.  If
> you do this multiple times you'll note that the ordering is different after
> -sort foreign-.
> 
> -Sarah
> 
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Andrew Maurer
> Sent: Thursday, March 06, 2014 11:36 AM
> To: [email protected]
> Subject: st: Stata treatment of sort order
> 
> Hi Statalist,
> 
> I'm wondering if anyone can help explain some details about Stata and
> sorting
> 
> First, where does Stata hold information about current sort order? Ie, the
> extended macro function --`: sortedby'-- returns the current sort order.
> However, looking at --char dir-- and --macro dir-- I don't see the
> information there. In particular, I want to overwrite the value, so that
> --`: sortedby'-- will return the value that I insert. One use might be if I
> -infile-, and I already know the sort order of the data, but don't want to
> have to run sort just to populate `: sortedby'. (In --help dta--, I see
> where it's stored in a physical dta file [<sortlist>sortlist</sortlist>],
> but it doesn't explain where it is put in memory.
> 
> Second, the help file for sort seems somewhat misleading. --help sort--
> explains, "Without the stable option, the ordering of observations with
> equal values of varlist is randomized." What does "randomized" here mean? I
> interpret it to mean that each residual observation has an equal probability
> of being in any of the slots specified by the sort list (eg that --sort
> var1-- is equivalent to --gen rand = runiform()-- --sort var1 rand-- --drop
> rand-- However, residual sort order doesn't always appear random. For
> example, if I --sysuse auto--, --sort foreign--, then copy the data to
> clipboard, --clear--, then use data editor to paste the data back, and
> finally --sort foreign--, the ordering is always the same as the original
> ordering (ie: the ordering of observations with equal values of varlist was
> /not/ randomized.
> 
> Is anyone able to explain these observations?
> 
> Thank you,
> 
> Andrew Maurer 
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Stata treatment of sort order
  - From: Andrew Maurer <[email protected]>
- st: RE: Stata treatment of sort order
  - From: "Sarah Edgington" <[email protected]>
- Re: st: RE: Stata treatment of sort order
  - From: Richard Goldstein <[email protected]>
- RE: st: RE: Stata treatment of sort order
  - From: Andrew Maurer <[email protected]>

Prev by Date: Re: st: cluster analysis: differences between clusters
Next by Date: st: solution for paste function not working
Previous by thread: RE: st: RE: Stata treatment of sort order
Next by thread: st: ANOVA
Index(es):
- Date
- Thread