Stata
Products Purchase Support Company
Search
   >> Home >> Resources & support >> FAQs >> Listing observations in a group that differ on a variable

The following material is based on a question and answer that appeared on Statalist.

How do I list observations in a group that differ on a variable?

Title   Listing observations in a group that differ on a variable
Author Nicholas J. Cox, Durham University, UK
Date November 2001

The problem

I have data on various individuals with genotypes ascertained from samples taken at different times. I want to list only those samples with differing genotypes for each individual.

The data are

 eid     egenotype  
 0       vv         
 0       vv         
 1       vv         
 1       ww         
 2       ww         
 2       vv         
 2       ww

The solution

The question does not specify whether egenotype is a string variable or a numeric variable with labels. The solution here applies to both and also to numeric variables without labels. First, we sort the data on eid and then on egenotype:

        . sort eid egenotype 

If all the values of egenotype are the same for each eid, then after sorting the first value within each eid will equal the last. If there is any variation within eid, that will not be true. This will work irrespective of the number of observations for each eid, the number of egenotypes, and the type of variable used. Thus, for eid 0, the first value vv will equal the last, but for eid 1 and 2, the first and last values will differ. The example of eid 2 also shows why sorting is essential, as at present the first and third values are both ww, but the middle value is vv.

Accordingly, we work out which groups have different values and then list those groups only:

        . by eid (egenotype), sort: gen diff = egenotype[1] != egenotype[_N] 
        . list eid egenotype if diff 

The by ..., sort combines sort eid egenotype with an ensuing by eid: generate statement. Under the protection of by:, subscripts apply to observations within each group. Thus [1] denotes the first observation, and [_N] denotes the last observation within each group. If the corresponding values differ, diff will be 1, and if they are the same, diff will be 0. (For more information on this, see http://www.stata.com/support/faqs/data/trueorfalse.html.) Then the list is restricted to values that are different.

How would this be extended to identifying groups that differ on at least one of two or more variables? One way would be to use egen. For example, egen, group() could be used to group values according to one or more variables, and then the same method could be used on the resulting variable.

FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Macintosh
Technical support
Resources & support
FAQs
Technical support
NetCourses
Short courses
Users Group meetings
Statalist
Links
Software updates
Software archives
Customer service
Manuals & supplements
Stata Journal
STB
Stata News
Stata Automation
Plugins

Site overview
Products
Resources & support
Company
Site index

© Copyright 1996–2008 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index