Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Collapsing with strings


From   Gary Longton <glongton@fhcrc.org>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Collapsing with strings
Date   Fri, 13 Jan 2006 11:57:53 -0800

Daphna Bassok wrote:

I have several duplicate observations in my data set. However, they are not perfect duplicates. Only the id # is the same. So there might be two observations with id#16 for instance, the first will have values for some variables, and missing values for others. The second also have some values filled and some missing. There are no cases in which both have values- that is... either the first in the pair has the value OR the second has a value (or neither).

For example: suppose I have two observations with id# 16... The first has values for var1 and 2 and not 3. The second ONLY has values for var 3. What i would like to do is simply collapse these into a single observation with all the relevant info. meaning, 1 observation with id#16 that has values for all three variables.

I am trying to do this with the collapse command with no success.

My code is:

collapse (min) var1-var3, by(id)

I thought this would create a new observation that has all the data in it.

I am getting a "type mismatch" error.

Is this because some of my variables are string variables?
Nick Cox suggested:

What you can do is -- if your description is correct --
egen nmiss = rowmiss(<insert variable names>) bysort id (nmiss) : keep if _n == 1

as the sort will sort the observation with more missings to second place.
and Austin Nichols suggested:

foreach v of varlist put all the relevant varnames here {
 bys id (`v'): qui replace `v'=`v'[_n-1] if mi(`v')
}
bys id: drop if _n>1
It is a rare day when one can make a correction to a typically accurate and elegant Nick Cox solution, so I make this one fearing that I've probably missed somthing obvious.

If I understand the problem correctly, I think this solution will discard non-missing data for some variables.

Eg. a simplified dataset like this seems consistent with Daphna's description:

obs id var1 var2 var3 nmiss

1 16 a . 3 1
2 16 . 2 . 2
3 17 . . 7 2
4 17 c 3 . 1

sorting on nmiss will discard observations 2 & 3, throwing away non-missing data for var2

As Austin's solution suggests, one needs to sort separately for each variable in the list and carry out the replace for that variable. However missings sort differently for string and numeric variables, taking first place for strings and last for numerics, so need to be handled differently in a sort solution. Austin's solution won't sort correctly for string variables.

There is probably a shorter approach, but I think this will do it:

********
foreach var of varlist var1-var3 {
if substr("`:type `var''",1,3) == "str" {
bysort id (`var') replace `var' = `var'[_N]
}
else {
bysort id (`var') replace `var' = `var'[1]
}
}
bysort id: drop if _n>1
************

- Gary

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index