# deleting observations [was: RE: st: about deleting variables]

 From "Nick Cox" To Subject deleting observations [was: RE: st: about deleting variables] Date Wed, 11 Dec 2002 10:11:30 -0000

```Hailing Zang

> > I have a question about deleting variables that have
> > a certain character. I know if I want to drop, then
> > simply use
> >
> > .drop if <character>
> >
> > But if I want to keep the first observation that has
> > that character, what should I do?

Ricardo Ovaldia

> Remember that the observations are indexed by _n. That
> is, the first observation is _n=1, the second _n=2 and
> so on.
>
> When you sort observations and use the -by- statement
> the first observation in each of the possible values
> of the "by-variable" is index by _n=1 and so on.
>
> So if you are able to sort your data by the variable,
> say x,  you can keep the first occurrence of each of
> the x values with the statement:
>
> . by x: keep if _n==1
>

This is a step in the right direction,
but within Ricardo's solution lurks a bug.

. sort x
. by x : keep if _n == 1

or (more concisely) to

. bysort x : keep if _n == 1

it is not _guaranteed_ that the first observation
within each category after -sort- will be that which
was first to occur in that category, in
the pre-existing order of observations. That is,
you asked Stata to -sort- on -x-; often
there will be many equivalent solutions,
as your first lesson in combinatorics will have
emphasised. To give a concrete example, suppose
with the auto data bundled with Stata you asked

. sort rep78

then any order of observations in which
all the observations for which -rep78- is
1 precede all the observations for which
-rep78- is 2, etc., is to Stata a solution
to this problem; yet given 2 values of 1
and 8 values of 2, ..., 5 values of missing
there are 2! 8! 30! 18! 11! 5!
such solutions and a little mental arithmetic
shows that this is a big number.

The way to guarantee this is more like

. gen long order = _n
. bysort x (order) : keep if _n == 1

Then Stata will ensure that in the categories
of -x- the first observation to occur will have
the lowest value of -order-, and so on.

It may be that no new variable -order-
is needed, as you may have some variable
such as a time variable which plays the
same role.

To see that this is needed, try

. use auto
. gen order = _n
. sort rep78
. l rep78 order

There is more discussion of -by:- in

2002. How to move step by: step. Stata
Journal 2(1): 86-102.

Nick
n.j.cox@durham.ac.uk

