# Re: st: A note on -sort- order, especially panel data

 From "Nisha Malhotra" To Subject Re: st: A note on -sort- order, especially panel data Date Thu, 12 Sep 2002 18:19:01 -0400

```Thankyou for the comments and all the help. Yes, I wanted the jump to come after the first action was taken.

Thanks once more
>>> n.j.cox@durham.ac.uk 09/12/02 11:11 AM >>>
Nisha Malhotra posted a panel data problem which
attracted a flurry of overlapping answers,
from which an acceptable solution should emerge,
once Nisha has sorted out whether the jump
should take place at or after the first
action and what is appropriate for the very
first value in a panel (for which previous
conditions are unknown, at least to Stata).

I want to expand on a point arising which is much
more general and can bite you (and you won't
always notice). Let's abstract to a structure of
panel identifier

id

and time variable

time

The problem is with code like this:

. sort id
. by id : gen <whatever>

which Stata 7 users can happily telescope to

. bysort id : gen <whatever>

The way this arises is that

(1) you want to do something separately
for each panel

and

(2) you know that Stata requires a prior
-sort- for that, so you oblige. (More
than courtesy here: it's the law.)

What's tricky is that the code often
should be

. sort id time
. by id : gen <whatever>

or the equivalent

. bysort id (time) : gen <whatever>

-- whenever, that is, you also want
observations within each panel to be in
time order. Even when correct within-panel order
is irrelevant to what you want, as when say you are
computing means, it rarely does any harm.

What underlies all this is the literal-mindedness
of Stata, which does what you say, not what
you mean. Given the instruction

. sort id

Stata will be satisfied with _any_ ordering
of observations for which -id- is sorted,
and there are usually lots of possibilities,
as some combinatorial calculations will confirm.
Stata does not care about any other point.
Indeed, having done what you want, it sits there
smirking.

Now it is often the case in practice that panel data
will come in order of -id- and then -time-,
or will be left that way after a previous command.
And, increasingly, it is a standard
that Stata commands should not
change the -sort- order of your
data unless you explicitly specify
that or it is among the purposes
of a command. So no harm may ensue.

But -- as said, and this is the crunch --
Stata makes absolutely no promises about
order of observations within each block defined
by -id- (or within any other varlist
given as argument to -sort-). So there
is a possibility that operations dependent
on within-panel order will give incorrect results.
In the problem here, operations based on
the -sum()- function are a case in point.

With panel data there is another and
in many ways a better approach. -tsset-
your data and use time-series operators.
Then given some initial

. tsset id time

any later

. tsset

will automatically return panel data
to the correct sort order, so that

. by id: ...

is then guaranteed to work on the
correct within-panel order. In
do calculations based on operators
such as L. unless data are in the
correct sort order, providing for
you a safety catch. Conversely, for
operators like L. you don't
need to specify separate
calculations within panels:
that is done automatically
given a -tsset- to panel data.

-sum()-, however, has nothing
to do with time series as such. It
long predates specific time series
syntax in Stata and indeed stands outside
that framework.

Nick
n.j.cox@durham.ac.uk
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```