Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: A note on -sort- order, especially panel data

From   "Nick Cox" <>
To   <>
Subject   st: A note on -sort- order, especially panel data
Date   Thu, 12 Sep 2002 12:15:08 +0100

Nisha Malhotra posted a panel data problem which 
attracted a flurry of overlapping answers, 
from which an acceptable solution should emerge, 
once Nisha has sorted out whether the jump 
should take place at or after the first 
action and what is appropriate for the very 
first value in a panel (for which previous 
conditions are unknown, at least to Stata). 

I want to expand on a point arising which is much 
more general and can bite you (and you won't 
always notice). Let's abstract to a structure of 
panel identifier 


and time variable 


The problem is with code like this: 

. sort id 
. by id : gen <whatever> 

which Stata 7 users can happily telescope to 

. bysort id : gen <whatever> 

The way this arises is that 

(1) you want to do something separately 
for each panel 


(2) you know that Stata requires a prior 
-sort- for that, so you oblige. (More 
than courtesy here: it's the law.) 

What's tricky is that the code often 
should be 

. sort id time 
. by id : gen <whatever> 

or the equivalent 

. bysort id (time) : gen <whatever> 

-- whenever, that is, you also want 
observations within each panel to be in 
time order. Even when correct within-panel order 
is irrelevant to what you want, as when say you are 
computing means, it rarely does any harm. 

What underlies all this is the literal-mindedness 
of Stata, which does what you say, not what 
you mean. Given the instruction 

. sort id 

Stata will be satisfied with _any_ ordering 
of observations for which -id- is sorted, 
and there are usually lots of possibilities, 
as some combinatorial calculations will confirm. 
Stata does not care about any other point. 
Indeed, having done what you want, it sits there

Now it is often the case in practice that panel data 
will come in order of -id- and then -time-, 
or will be left that way after a previous command. 
And, increasingly, it is a standard 
that Stata commands should not 
change the -sort- order of your 
data unless you explicitly specify
that or it is among the purposes 
of a command. So no harm may ensue. 

But -- as said, and this is the crunch -- 
Stata makes absolutely no promises about 
order of observations within each block defined 
by -id- (or within any other varlist 
given as argument to -sort-). So there 
is a possibility that operations dependent 
on within-panel order will give incorrect results. 
In the problem here, operations based on 
the -sum()- function are a case in point. 

With panel data there is another and 
in many ways a better approach. -tsset- 
your data and use time-series operators. 
Then given some initial 

. tsset id time 

any later 

. tsset 

will automatically return panel data 
to the correct sort order, so that 

. by id: ... 

is then guaranteed to work on the 
correct within-panel order. In 
addition, Stata refuses to 
do calculations based on operators 
such as L. unless data are in the 
correct sort order, providing for 
you a safety catch. Conversely, for 
operators like L. you don't 
need to specify separate 
calculations within panels:  
that is done automatically 
given a -tsset- to panel data. 

-sum()-, however, has nothing 
to do with time series as such. It 
long predates specific time series 
syntax in Stata and indeed stands outside
that framework. 

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index