"Nick Cox" <n.j.cox@durham.ac.uk>

<statalist@hsphsun2.harvard.edu>

deleting observations [was: RE: st: about deleting variables]

Wed, 11 Dec 2002 10:11:30 -0000

Hailing Zang > > I have a question about deleting variables that have > > a certain character. I know if I want to drop, then > > simply use > > > > .drop if <character> > > > > But if I want to keep the first observation that has > > that character, what should I do? Ricardo Ovaldia > Remember that the observations are indexed by _n. That > is, the first observation is _n=1, the second _n=2 and > so on. > > When you sort observations and use the -by- statement > the first observation in each of the possible values > of the "by-variable" is index by _n=1 and so on. > > So if you are able to sort your data by the variable, > say x, you can keep the first occurrence of each of > the x values with the statement: > > . by x: keep if _n==1 > This is a step in the right direction, but within Ricardo's solution lurks a bug. If you ask Stata to . sort x . by x : keep if _n == 1 or (more concisely) to . bysort x : keep if _n == 1 it is not _guaranteed_ that the first observation within each category after -sort- will be that which was first to occur in that category, in the pre-existing order of observations. That is, you asked Stata to -sort- on -x-; often there will be many equivalent solutions, as your first lesson in combinatorics will have emphasised. To give a concrete example, suppose with the auto data bundled with Stata you asked . sort rep78 then any order of observations in which all the observations for which -rep78- is 1 precede all the observations for which -rep78- is 2, etc., is to Stata a solution to this problem; yet given 2 values of 1 and 8 values of 2, ..., 5 values of missing there are 2! 8! 30! 18! 11! 5! such solutions and a little mental arithmetic shows that this is a big number. The way to guarantee this is more like . gen long order = _n . bysort x (order) : keep if _n == 1 Then Stata will ensure that in the categories of -x- the first observation to occur will have the lowest value of -order-, and so on. It may be that no new variable -order- is needed, as you may have some variable such as a time variable which plays the same role. To see that this is needed, try . use auto . gen order = _n . sort rep78 . l rep78 order There is more discussion of -by:- in 2002. How to move step by: step. Stata Journal 2(1): 86-102. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

