Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: rank error?


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: RE: rank error?
Date   Thu, 29 May 2003 11:15:52 +0100

Jeph Herrin

> > -egen-
> > ======
> >
> > See [R] egen p.325:
> >
> > "The order of the groups is that of the sort order of varlist."
> > An example follows.
> >
>
> Point conceded, though reluctantly - sticking that into an Example
> (not preceding an example as you suggest) is a bit subtle; perhaps
> I'm naive to expect functionality to be fully specified in the
> Description?

Your first point hinges on the difference between an "example" and
a subsection title "Example".

More importantly, there is a very good case for including a sentence
on sort order in the description of -egen, group()-. I join
you in recommending this to Stata Corp.

> Yes, I know about the Stata principle, and that -egen- and
> -gen- respect it. But recall that my coding
> question/uncertainty was:
> is
>
> 	 by `varlist' [,sort] : gen var1 = exp
>
> "a command which is designed to -sort- the data"? Intuitively the
> answer is no, but I'm not embarrassed about being unsure whether a
> command which *requires* either pre-sorting or a [sort]
> option changes
> the sort order. For instance, why does -by- insist on the
> user supplying
> the sort? "Surely", one can't but think, because you can't do a -by-
> without changing the sort order; otherwise, -by- would just do
> a -sortpreserve-, do it's job, and restore the original sort order.
> It's always seemed a bit of a Stata anomaly.

This is changing the original question, but this version raises
several interesting points. In case anyone is still following, and
to recap, the code commented on was

		sort `touse' `varlist'
		quietly by `touse' `varlist': /*
			*/ gen `type' `g'=1 if _n==1 & `touse'
		replace `g'=sum(`g')

This is -by:- done old-style and it remains a perfectly valid way of
doing things. It has always been the case that you cannot do things
-by:- without getting data into the appropriate -sort- order.  The
only issue is how you do this syntactically.

In Stata at present there are two short-cut ways of indicating the
same thing. If Stata Corp were writing -egen, group()- from scratch
today they could write either (1)

		bysort `touse' `varlist': ///
			gen `type' `g'=1 if _n==1 & `touse'
		replace `g'=sum(`g')

or (2)

		by `touse' `varlist', sort: ///
			gen `type' `g'=1 if _n==1 & `touse'
		replace `g'=sum(`g')

(a further nuance being that you can abbreviate -bysort- down to
-bys-).  In a tutorial "How to move step by: step" in Stata Journal
SJ 2(1):86-102 (2002), I referred to the difference as a matter of
taste, although in that tutorial and elsewhere I always use the
method of (1), -bysort-, partly because it is shorter but much more
because to me it makes it more obvious that a -sort- is being done
and that is crucial, and not incidental or accidental. I also find
that I always write -bysort-, not -bys-. Again, this is a matter of
taste, but the little reminders that I am sorting the data and must
think about them in that light have served me well.

So, in short, my answer is:

1. -by- is not designed to -sort- the data as such.

2. But you need to -sort- the data to be able to use -by:-.

3. Because that need is so common, there is a convenient syntactical
short-cut letting you do that in one. It is not compulsory. If
anyone found it too compressed or confusing, the old-style method
remains available (and is occasionally preferable on other grounds
anyway).

Historically, despite excellent algorithms, sorting could often be
expensive in time, which I imagine was one reason why Stata wanted
all -sort-s to be requested explicitly.

In addition, if users read in data in an informative order which was
implicit rather than explicit (i.e. _n after data entry was in
effect an identifier) then the first -sort- away from that order
would destroy information. Although it would be foolish to do that,
Stata wanted to protect users from that kind of error. If you type
-sort-, you should be aware of the consequences. Despite that, some
official Stata programs and several user-written Stata programs
(including some of my own) would change the sort order of the data
without making that obvious to the user.  The point is arguable, but
this is now widely regarded as bad style.  This is why Stata
introduced -sortpreserve- for Stata programs, to make very easy for
users not to commit this style error.

If Stata Corp were introducing -by:- today, would they give it
"sortpreserve" behaviour? That is, would a statement

		by `touse' `varlist': ///
			gen `type' `g'=1 if _n==1 & `touse'

_automatically_ include a prior sort and a posterior return to the
previous sort order? That's an intriguing thought. In one way, that
would be closer to present thinking about how commands should
behave. But I doubt it. One main reason for doubting is that in
practice it is so common to want to follow such commands with other
commands which depend on the same sort order and the labour of
continually sorting back and forth would be very inefficient. We
need look no further than the present example

		bysort `touse' `varlist': ///
			gen `type' `g'=1 if _n==1 & `touse'
		replace `g'=sum(`g')

in which the cumulative -sum()- following depends crucially on the
sort order produced by the previous command.

I should stress that this is just thinking aloud about how -by:-
would be implemented if it were introduced today. -by:- is there and
established.

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index