Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: bug in Stata's sorted-by flag


From   Sergiy Radyakin <[email protected]>
To   "[email protected]" <[email protected]>
Subject   st: bug in Stata's sorted-by flag
Date   Wed, 14 Aug 2013 20:50:31 -0400

Dear All,

it seems that under some conditions Stata 9.2-12.1 (Windows)
incorrectly reports that the dataset is sorted while in fact it is
not.

The following program demonstrates this:
do http://radyakin.org/statalist/2013081402/sortbug.do

The problem seems that the Stata's built-in -set obs N- command is not
clearing the sorted flag while changing the data.


Here are some thoughts:

This does have important implications. In particular the sorted state
is saved into a data file, and other (external) programs might rely on
it being correct. Stata itself might get confused in some cases, when
it inspects the sorted state, though I can't readily demonstrate it.

An example of such an inconsistent datafile produced by Stata is here
(in v12 format):
http://radyakin.org/statalist/2013081402/sortbug.dta
or here (in v9 format):
http://radyakin.org/statalist/2013081402/sortbug9.dta

A technical note in the following document:
http://www.stata.com/manuals13/dsort.pdf
explains that Stata is conservative and believes any chang to
variables involved in the sort order is destroying the sort order.
This means that sometimes one has to forgo a bit of performance to
verify the sort order when it is not needed. And this is OK.

The converse is not good. Reporting that dataset as sorted when it is
not causes serious implications as (at least some) user-written
commands might be relying on the reported sort order to be credible.
Stata's own commands would probably also get confused. I expect (but
not checked) the -merge- command to behave erratically in this case,
since I expect it relies on the saved sorted order for the 'using'
datasets (secondary datasets).

The list of the variables, by which a dataset is sorted is contained
in the macro sortedby as in:
display `"`: sortedby'"'

This problem is found as partial explanation to what's happening with
the sortpreserve option in my code, the discussion started in this
thread:
http://www.stata.com/statalist/archive/2013-08/msg00563.html
and in which I am still interested. Even older discussions on the
-sort-'s performance can be found in my "sorting data puzzles"
postings here:
http://www.stata.com/statalist/archive/2008-01/index.html#00810

Interestingly you would think that Stata itself should then refuse to
sort the already sorted dataset. But no, it does re-sort it as can be
seen here:
********************************************************************
use http://radyakin.org/statalist/2013081402/sortbug.dta
list
describe
sort price, stable
list
describe
display c(changed)
********************************************************************

And given the problem, I am surprised to see how -collapse- continues
to produce the correct results, but it seems to be working despite the
dataset is not sorted.

Best, Sergiy Radyakin
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index