[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
deleting observations [was: RE: st: about deleting variables]
> > I have a question about deleting variables that have
> > a certain character. I know if I want to drop, then
> > simply use
> > .drop if <character>
> > But if I want to keep the first observation that has
> > that character, what should I do?
> Remember that the observations are indexed by _n. That
> is, the first observation is _n=1, the second _n=2 and
> so on.
> When you sort observations and use the -by- statement
> the first observation in each of the possible values
> of the "by-variable" is index by _n=1 and so on.
> So if you are able to sort your data by the variable,
> say x, you can keep the first occurrence of each of
> the x values with the statement:
> . by x: keep if _n==1
This is a step in the right direction,
but within Ricardo's solution lurks a bug.
If you ask Stata to
. sort x
. by x : keep if _n == 1
or (more concisely) to
. bysort x : keep if _n == 1
it is not _guaranteed_ that the first observation
within each category after -sort- will be that which
was first to occur in that category, in
the pre-existing order of observations. That is,
you asked Stata to -sort- on -x-; often
there will be many equivalent solutions,
as your first lesson in combinatorics will have
emphasised. To give a concrete example, suppose
with the auto data bundled with Stata you asked
. sort rep78
then any order of observations in which
all the observations for which -rep78- is
1 precede all the observations for which
-rep78- is 2, etc., is to Stata a solution
to this problem; yet given 2 values of 1
and 8 values of 2, ..., 5 values of missing
there are 2! 8! 30! 18! 11! 5!
such solutions and a little mental arithmetic
shows that this is a big number.
The way to guarantee this is more like
. gen long order = _n
. bysort x (order) : keep if _n == 1
Then Stata will ensure that in the categories
of -x- the first observation to occur will have
the lowest value of -order-, and so on.
It may be that no new variable -order-
is needed, as you may have some variable
such as a time variable which plays the
To see that this is needed, try
. use auto
. gen order = _n
. sort rep78
. l rep78 order
There is more discussion of -by:- in
2002. How to move step by: step. Stata
Journal 2(1): 86-102.
* For searches and help try: