Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: AW: RE: dropping observation


From   "Martin Weiss" <martin.weiss1@gmx.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: AW: RE: dropping observation
Date   Thu, 11 Jun 2009 17:56:58 +0200

<> 

"Experienced users would want me to underline that any missing values on -employerID- would need consideration."


Difficult indeed, everything depends on what Stefano wants to assume about the missing cases. In the code below, I have included several guys with various degrees of "missingness"...


*************
clear*

input forecast /* 
 */ analystID employerID
1	1	1
2	1	1
3	1	1
1	2	1
2	2	1
3	2	2
4	2	2
1	3	3
2	3	4
1 4 .
2 4 5
3 4 .
4 4 5
1 5 6
2 5 .
3 5 7
4 5 .
1 6 .
2 6 .
end

compress
list, noobs /* 
 */ sepby(analy) 

 
 bys anal (employ): /* 
 get the last nonmissing
 employer, trick from
 http://www.stata.com/support/faqs/data/dropmiss.html
 */ egen lastnonmiempl =/* 
 egen allows expressions for some
 of its functions
 */ max(cond(!missing(employ), employ, .))


bys anal:/* 
 */ egen miss=/* 
 */ total(mi(employ))

replace miss=miss!=0

list, noobs /* 
 */ sepby(analy) 
 
bysort analystID (employerID) :/* 
 */  drop if employerID[1] /* 
 */ == lastnonmiempl[1] /* 
 additionally: only those w/o
 missings on the employer var
 */ & miss==0
 
list, noobs /* 
 */ sepby(analy) 
 
 /*
 Now it really depends
 whether you want to drop
 those who did not change jobs
 during the "visible" part
 of their career. If so, comment
 this in:
 
 bysort analystID (employerID) :/* 
 */  drop if employerID[1] /* 
 */ == lastnonmiempl[1]
 
 */ 
 
 
 /* OR you could give them the
 benefit of doubt, assuming
 that the missing indicates
 a job change. Leave everyting
 as it is, then.
 You still have to decide
 how to go about this business
 regarding analyst # 6
 who has all missings... 
 */ 
 
 list, noobs /* 
 */ sepby(analy) 
*************



HTH
Martin


-----Ursprüngliche Nachricht-----
Von: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] Im Auftrag von Nick Cox
Gesendet: Donnerstag, 11. Juni 2009 10:26
An: statalist@hsphsun2.harvard.edu
Betreff: st: RE: dropping observation

The solutions suggested all work with this kind of data and all have a clear logic. 

Note that only Tirthankar's and Kieran's would apply as well to a string identifier. 

They all involve a constructed extra variable. That can be avoided in this way: 

bysort analystID (employerID) : drop if employerID[1] == employerID[_N] 

The logic here is that if all values are the same in a group, then the first will equal the last, except that we must sort too. 

See also the FAQ 

How do I list observations in a group that differ on a variable?
http://www.stata.com/support/faqs/data/diff.html

This may not sound like the same problem, but change != to == and -list- to -drop- and the logic carries over. 

Experienced users would want me to underline that any missing values on -employerID- would need consideration. 

Nick 
n.j.cox@durham.ac.uk

Eric A. Booth
==============

bysort analystID: egen max = max(employerID) 
bysort analystID: egen min = min(employerID) 
drop if max==min

Tirthankar Chakravarty
======================

Using Nick Cox's -egenmore- package (SSC):

/* Spells */
clear
// ssc install egenmore, replace
input forecast_no analystID employerID
1                 1            1
2                 1            1
3                 1            1
1                 2            1
2                 2            1
3                 2            2
4                 2            2
1                 3            3
2                 3            4
end
egen nvalsID = nvals(employerID), by(analystID) 
drop if nvalsID==1 
list, clean

Howie Lempel
============

Create a variable with the mean absolute deviation from the mean of employer ID for each analyst.  This will be 0 if the employer ID never changes.

bysort analystID: egen Demp = mdev(employerID)

Drop observations where the employer ID never changed.

drop if Demp==0

Kieran McCaul
=============

sort analystID employerID
by analystID employerID: gen N1=_N
by analystID: gen N2=_N
drop if N2==N1


Stefano Bonini
==============

I have a huge panel dataset containing analyst forecasts. Each analyst is associated with an employer. Sometimes analyst change employer. I want to restrict my dataset, dropping the observations of analysts that never change employer. The dataset may look like this

forecast#     analystID   employer ID
1                 1            1
2                 1            1
3                 1            1

1                 2            1
2                 2            1
3                 2            2
4                 2            2

1                 3            3
2                 3            4

In this case I'd nee to drop all observations by analyst 1 because he never changes employer, while keeping those of analysts 2 and 3.

I really cannot figure out the way to do it as visual inspection is just impossible with over 1.2m obs.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index