[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: Sorting by and testing within subsets |

Date |
Tue, 11 Nov 2003 11:45:17 -0000 |

Steinar Fossedal, apart from MIME/HTML, > I am currently working with a dataset which I suspect contains > numerous errors > - faulty settings of classification variables and such. Thus I > want to run different logical tests to sort out which > observations I need to have a closer look at. > > The dataset is set up like this: > > PersonID PersonInfo > 1 A > 2 A > 2 A > 2 B > 3 A > 3 A > 4 B > 5 C > 6 C > . . > . . > > I am interested in checking wether the information registered on a > person is consistent. In the example above (sorry I can't give you > the real deal, but it's sensitive information) we can see that > Person 2 is registered twice as A and once as B. Person three is > registered twice, both times as A. > > What I would like is a list which shows the persons who have > conflicting PersonInfo, one line for each person (Only person 2 in > the example above). I figure I have to sort the PersonID into > groups somehow and then do a check within each group if the > registered information is consistent. However I'm having a hard > time getting Stata to do so. My best suggestion so far would be > > bysort PersonID: gen dummy=1 if( <not all values of PersonInfo > within the group are equal> ) > > but I'm not able to specify what goes in the if-statement since I > cannot seem to find any function which counts number of distinct > values. Also listing the troublesome observations only one line > per person seems to be out of my grasp. Any comments or > suggestions you might have would be greatly appreciated. You're moving in exactly the right direction. You just need to climb one hill and then the destination is in sight. There is an -egen- function to count number of distinct values as part of -egenmore- on SSC, but I don't think you need it. Official Stata is more than adequate for this problem. Anyway, on # of distinct values, see the FAQ How do I compute the number of distinct observations? http://www.stata.com/support/faqs/data/distinct.html On your main problem, see the FAQ How do I list observations in a group that differ on a variable? http://www.stata.com/support/faqs/data/diff.html In your case, something like bysort PersonID (PersonInfo) : gen diff = PersonInfo[1] != PersonInfo[_N] egen tag = tag(PersonID) list PersonID if tag & diff gets a listing of problematic IDs listed once only. The Stata logic here, leaning heavily on -by <varlist>:-, is explained at the URL just cited. A tutorial on -by:- is available within _Stata Journal_ 2(1):86-102 (2002). Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Sorting by and testing within subsets***From:*steinar.fossedal@skandiabanken.no

- Prev by Date:
**st: Sorting by and testing within subsets** - Next by Date:
**Re: st: Re: downloading without internet access** - Previous by thread:
**st: Sorting by and testing within subsets** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |