You're correct. I misread the question. Sorry.
Nick
[email protected]
Gary Longton
> Daphna Bassok wrote:
>
> > I have several duplicate observations in my data set.
> However, they are
> > not perfect duplicates. Only the id # is the same. So
> there might be
> > two observations with id#16 for instance, the first will
> have values for
> > some variables, and missing values for others. The second
> also have some
> > values filled and some missing. There are no cases in
> which both have
> > values- that is... either the first in the pair has the
> value OR the
> > second has a value (or neither).
> >
> > For example: suppose I have two observations with id# 16...
> The first
> > has values for var1 and 2 and not 3. The second ONLY has
> values for
> > var 3. What i would like to do is simply collapse these
> into a single
> > observation with all the relevant info. meaning, 1 observation with
> > id#16 that has values for all three variables.
> >
> > I am trying to do this with the collapse command with no success.
> >
> > My code is:
> >
> > collapse (min) var1-var3, by(id)
> >
> > I thought this would create a new observation that has all
> the data in it.
> >
> > I am getting a "type mismatch" error.
> >
> > Is this because some of my variables are string variables?
>
> Nick Cox suggested:
>
> > What you can do is -- if your description is correct --
> >
> > egen nmiss = rowmiss(<insert variable names>)
> > bysort id (nmiss) : keep if _n == 1
> >
> > as the sort will sort the observation with
> > more missings to second place.
>
> and Austin Nichols suggested:
>
> > foreach v of varlist put all the relevant varnames here {
> > bys id (`v'): qui replace `v'=`v'[_n-1] if mi(`v')
> > }
> > bys id: drop if _n>1
>
> It is a rare day when one can make a correction to a
> typically accurate and
> elegant Nick Cox solution, so I make this one fearing that
> I've probably missed
> somthing obvious.
>
> If I understand the problem correctly, I think this solution
> will discard
> non-missing data for some variables.
>
> Eg. a simplified dataset like this seems consistent with
> Daphna's description:
>
> obs id var1 var2 var3 nmiss
>
> 1 16 a . 3 1
> 2 16 . 2 . 2
> 3 17 . . 7 2
> 4 17 c 3 . 1
>
> sorting on nmiss will discard observations 2 & 3, throwing
> away non-missing data
> for var2
>
> As Austin's solution suggests, one needs to sort separately
> for each variable in
> the list and carry out the replace for that variable.
> However missings sort
> differently for string and numeric variables, taking first
> place for strings and
> last for numerics, so need to be handled differently in a
> sort solution.
> Austin's solution won't sort correctly for string variables.
>
> There is probably a shorter approach, but I think this will do it:
>
> ********
> foreach var of varlist var1-var3 {
> if substr("`:type `var''",1,3) == "str" {
> bysort id (`var') replace `var' = `var'[_N]
> }
> else {
> bysort id (`var') replace `var' = `var'[1]
> }
> }
> bysort id: drop if _n>1
> ************
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/