Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

deleting observations [was: RE: st: about deleting variables]

From   "Nick Cox" <>
To   <>
Subject   deleting observations [was: RE: st: about deleting variables]
Date   Wed, 11 Dec 2002 10:11:30 -0000

Hailing Zang

> > I have a question about deleting variables that have
> > a certain character. I know if I want to drop, then
> > simply use
> > 
> > .drop if <character>
> > 
> > But if I want to keep the first observation that has
> > that character, what should I do?

Ricardo Ovaldia 

> Remember that the observations are indexed by _n. That
> is, the first observation is _n=1, the second _n=2 and
> so on. 
> When you sort observations and use the -by- statement
> the first observation in each of the possible values
> of the "by-variable" is index by _n=1 and so on.
> So if you are able to sort your data by the variable,
> say x,  you can keep the first occurrence of each of
> the x values with the statement:
> . by x: keep if _n==1

This is a step in the right direction, 
but within Ricardo's solution lurks a bug. 

If you ask Stata to 

. sort x 
. by x : keep if _n == 1 

or (more concisely) to 

. bysort x : keep if _n == 1 

it is not _guaranteed_ that the first observation 
within each category after -sort- will be that which 
was first to occur in that category, in 
the pre-existing order of observations. That is, 
you asked Stata to -sort- on -x-; often 
there will be many equivalent solutions, 
as your first lesson in combinatorics will have 
emphasised. To give a concrete example, suppose 
with the auto data bundled with Stata you asked 

. sort rep78 

then any order of observations in which 
all the observations for which -rep78- is 
1 precede all the observations for which 
-rep78- is 2, etc., is to Stata a solution 
to this problem; yet given 2 values of 1 
and 8 values of 2, ..., 5 values of missing
there are 2! 8! 30! 18! 11! 5! 
such solutions and a little mental arithmetic 
shows that this is a big number. 

The way to guarantee this is more like 

. gen long order = _n 
. bysort x (order) : keep if _n == 1 

Then Stata will ensure that in the categories 
of -x- the first observation to occur will have 
the lowest value of -order-, and so on. 

It may be that no new variable -order- 
is needed, as you may have some variable 
such as a time variable which plays the
same role. 

To see that this is needed, try 

. use auto 
. gen order = _n 
. sort rep78 
. l rep78 order 

There is more discussion of -by:- in 

2002. How to move step by: step. Stata 
Journal 2(1): 86-102. 


*   For searches and help try:

© Copyright 1996–2023 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index