Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Collapsing with strings


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Collapsing with strings
Date   Mon, 16 Jan 2006 14:17:11 -0000

When observations come in blocks of 1 
then missings are replaced by missings
with this approach. So, a little 
unnecessary, but no harm done. Some
might prefer to use the code below 
and add an extra condition 

... & _N == 2 

Nick 
n.j.cox@durham.ac.uk 

Nick Cox 
 
> L'esprit de l'escalier:
> 
> When observations come in blocks of 2 another
> approach is possible that removes the need for 
> sorting on each variable, although one prior
> 
> sort id 
> 
> is needed. 
> 
> foreach var of varlist var1-var3 {
> 	by id: replace `var' = `var'[3 - _n] if missing(`var') 
> }
> 
> When the missing values are in the first observation, _n 
> is 1 and 3 - _n is 2. When the second, _n is 2 and 3 - _n
> is 1. Hence 3 - _n indexes the other observation in blocks 
> of 2. This works regardless of variable type. 
 
Nick Cox 
  
> > You're correct. I misread the question. Sorry. 
>  
> Gary Longton
>   
> > > Daphna Bassok wrote:
> > > 
> > > > I have several duplicate observations in my data set.  
> > > However, they are 
> > > > not perfect duplicates.  Only the id # is the same. So 
> > > there might be 
> > > > two observations with id#16 for instance, the first will 
> > > have values for 
> > > > some variables, and missing values for others. The second 
> > > also have some 
> > > > values filled and some missing.  There are no cases in 
> > > which both have 
> > > > values- that is... either the first in the pair has the 
> > > value OR the 
> > > > second has a value (or neither).
> > > > 
> > > > For example: suppose I have two observations with id# 16... 
> > >  The first 
> > > > has values for var1 and  2 and not 3.   The second ONLY has 
> > > values for 
> > > > var 3.   What i would like to do is simply collapse these 
> > > into a single 
> > > > observation with all the relevant info. meaning, 1 
> > observation with 
> > > > id#16 that has values for all three variables.
> > > > 
> > > > I am trying to do this with the collapse command with 
> no success.
> > > > 
> > > > My code is:
> > > > 
> > > > collapse (min) var1-var3, by(id)
> > > > 
> > > > I thought this would create a new observation that has all 
> > > the data in it.
> > > > 
> > > > I am getting a "type mismatch" error.
> > > > 
> > > > Is this because some of my variables are string variables?
> > > 
> > > Nick Cox suggested:
> > > 
> > > > What you can do is -- if your description is correct -- 
> > > > 
> > > > egen nmiss = rowmiss(<insert variable names>) 
> > > > bysort id (nmiss) : keep if _n == 1
> > > > 
> > > > as the sort will sort the observation with 
> > > > more missings to second place. 
> > > 
> > > and Austin Nichols suggested:
> > > 
> > > > foreach v of varlist put all the relevant varnames here {
> > > >  bys id (`v'): qui replace `v'=`v'[_n-1] if mi(`v')
> > > > }
> > > > bys id: drop if _n>1
> > > 
> > > It is a rare day when one can make a correction to a 
> > > typically accurate and 
> > > elegant Nick Cox solution, so I make this one fearing that 
> > > I've probably missed 
> > > somthing obvious.
> > > 
> > > If I understand the problem correctly, I think this solution 
> > > will discard 
> > > non-missing data for some variables.
> > > 
> > > Eg. a simplified dataset like this seems consistent with 
> > > Daphna's description:
> > > 
> > > obs   id   var1   var2  var3  nmiss
> > > 
> > > 1     16    a      .      3     1
> > > 2     16    .      2      .     2
> > > 3     17    .      .      7     2
> > > 4     17    c      3      .     1
> > > 
> > > sorting on nmiss will discard observations 2 & 3, throwing 
> > > away non-missing data 
> > > for var2
> > > 
> > > As Austin's solution suggests, one needs to sort separately 
> > > for each variable in 
> > > the list and carry out the replace for that variable.  
> > > However missings sort 
> > > differently for string and numeric variables, taking first 
> > > place for strings and 
> > > last for numerics, so need to be handled differently in a 
> > > sort solution. 
> > > Austin's solution won't sort correctly for string variables.
> > > 
> > > There is probably a shorter approach, but I think this will do it:
> > > 
> > > ********
> > > foreach var of varlist var1-var3 {
> > >       if substr("`:type `var''",1,3) == "str" {
> > > 		bysort id (`var') replace `var' = `var'[_N]
> > > 	}
> > > 	else {
> > > 		bysort id (`var') replace `var' = `var'[1]
> > > 	}
> > > }
> > > bysort id: drop if _n>1
> > > ************
> >  
> > 
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index