L'esprit de l'escalier:
When observations come in blocks of 2 another
approach is possible that removes the need for
sorting on each variable, although one prior
sort id
is needed.
foreach var of varlist var1-var3 {
by id: replace `var' = `var'[3 - _n] if missing(`var')
}
When the missing values are in the first observation, _n
is 1 and 3 - _n is 2. When the second, _n is 2 and 3 - _n
is 1. Hence 3 - _n indexes the other observation in blocks
of 2. This works regardless of variable type.
Nick
[email protected]
Nick Cox
> You're correct. I misread the question. Sorry.
Gary Longton
> > Daphna Bassok wrote:
> >
> > > I have several duplicate observations in my data set.
> > However, they are
> > > not perfect duplicates. Only the id # is the same. So
> > there might be
> > > two observations with id#16 for instance, the first will
> > have values for
> > > some variables, and missing values for others. The second
> > also have some
> > > values filled and some missing. There are no cases in
> > which both have
> > > values- that is... either the first in the pair has the
> > value OR the
> > > second has a value (or neither).
> > >
> > > For example: suppose I have two observations with id# 16...
> > The first
> > > has values for var1 and 2 and not 3. The second ONLY has
> > values for
> > > var 3. What i would like to do is simply collapse these
> > into a single
> > > observation with all the relevant info. meaning, 1
> observation with
> > > id#16 that has values for all three variables.
> > >
> > > I am trying to do this with the collapse command with no success.
> > >
> > > My code is:
> > >
> > > collapse (min) var1-var3, by(id)
> > >
> > > I thought this would create a new observation that has all
> > the data in it.
> > >
> > > I am getting a "type mismatch" error.
> > >
> > > Is this because some of my variables are string variables?
> >
> > Nick Cox suggested:
> >
> > > What you can do is -- if your description is correct --
> > >
> > > egen nmiss = rowmiss(<insert variable names>)
> > > bysort id (nmiss) : keep if _n == 1
> > >
> > > as the sort will sort the observation with
> > > more missings to second place.
> >
> > and Austin Nichols suggested:
> >
> > > foreach v of varlist put all the relevant varnames here {
> > > bys id (`v'): qui replace `v'=`v'[_n-1] if mi(`v')
> > > }
> > > bys id: drop if _n>1
> >
> > It is a rare day when one can make a correction to a
> > typically accurate and
> > elegant Nick Cox solution, so I make this one fearing that
> > I've probably missed
> > somthing obvious.
> >
> > If I understand the problem correctly, I think this solution
> > will discard
> > non-missing data for some variables.
> >
> > Eg. a simplified dataset like this seems consistent with
> > Daphna's description:
> >
> > obs id var1 var2 var3 nmiss
> >
> > 1 16 a . 3 1
> > 2 16 . 2 . 2
> > 3 17 . . 7 2
> > 4 17 c 3 . 1
> >
> > sorting on nmiss will discard observations 2 & 3, throwing
> > away non-missing data
> > for var2
> >
> > As Austin's solution suggests, one needs to sort separately
> > for each variable in
> > the list and carry out the replace for that variable.
> > However missings sort
> > differently for string and numeric variables, taking first
> > place for strings and
> > last for numerics, so need to be handled differently in a
> > sort solution.
> > Austin's solution won't sort correctly for string variables.
> >
> > There is probably a shorter approach, but I think this will do it:
> >
> > ********
> > foreach var of varlist var1-var3 {
> > if substr("`:type `var''",1,3) == "str" {
> > bysort id (`var') replace `var' = `var'[_N]
> > }
> > else {
> > bysort id (`var') replace `var' = `var'[1]
> > }
> > }
> > bysort id: drop if _n>1
> > ************
>
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/