[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: Tables showing ordered lists [was: New question] |

Date |
Fri, 9 Aug 2002 14:16:29 +0100 |

Rodrigo Briceņo > I have another question that maybe is related with the same > egen function: > > I have a hospital discharges database. I need to obtain > the first 10 diagnoses for each service of the hospital. > How can I do that? My variables are: > > Servicio (cirugia, obstetricia, etc.) > Diagnostico This looks exactly like Rodrigo's previous question of 12 July. Two approaches were given in my posts of 14 and 15 July. Once more, I'll answer using the auto data so that this is of maximum benefit to others. There is a translation at the end to what I conjecture is an answer to his problem. Let's generalise: 1. I have a criterion: it might be just magnitude, or group frequency, or group mean, or group maximum, or anything else computable. 2. I want to see a table in which entries are ordered by that criterion. (Possibly, I want not the whole list of entries, but the medallists, or the top ten, or the top k.) 3. I may want to do that separately for groups defined by one or more variables. This does not seem a particularly exotic class of problems, but official Stata is short of solutions, to the best of my knowledge. Specific members of this class have been tackled by user-written programs, which is great whenever the right one exists for your problem, but useless otherwise. Another approach is to go back to first principles. 1. I need to have the criterion in memory. Either it already exists, or I will need to use -generate-, -egen-, .... 2. I need to feed that in the right order to a table command. * Ordering means -sort-. * The more low level that table command is, the more you can exercise control. Thus, I commend -tabdisp-. Don't be put off by the fact that it is billed, sometimes, as a programmer's command. 3. Sooner or later, I will want to do this separately for different groups. This means -by varlist:-, and a lot of users find that tricky. (One tutorial is in Stata Journal 2(1), 86-102 (2002).) Fortunately, once you know the basic idea, the extension from one-way classifications to two-way to multi-way is immediate. Example 1 ========= I want to see a table of the top ten -makes- on -mpg-. Clearly, highest -mpg- is "top". 1. The values of -mpg- already exist. 2. A trick: -sort- sorts lowest values first, so I need to negate -mpg- to get the right -sort- order. . gen nmpg = -mpg . sort nmpg . gen order = _n . tabdisp order if order <= 10, c(make mpg) ------------------------------------------ order | Make and Model Mileage (mpg) ----------+------------------------------- 1 | VW Diesel 41 2 | Datsun 210 35 3 | Subaru 35 4 | Plym. Champ 34 5 | Toyota Corolla 31 6 | Dodge Colt 30 7 | Mazda GLC 30 8 | Chev. Chevette 29 9 | Honda Civic 28 10 | Ford Fiesta 28 ------------------------------------------ Comments: * You won't need to bother with negation if you want lowest first. * If you know -gsort-, you could have done this in one fewer line. It turns out that this doesn't help us much for more complicated problems. * -tabdisp <rowvar>- shows rows ordered by the values of <rowvar>. If any value of <rowvar> occurs more than once, it uses cell values for the first observation in memory with that value. That's fine for us, as each value of -order- is unique. * The ability to use -if- gives you the functionality to select the top k. * We haven't done anything special about tied values. Example 2 ========= I want to see a table of the top ten -make-s on -mpg-. This time, I want separate listings for values of -foreign-. 1. The values of -mpg- already exist. 2. Once again, I negate -mpg- to get the right -sort- order. . gen nmpg = -mpg Now the most Stataish bit: . bysort foreign (nmpg) : gen order = _n Unpacking that, -sort- by foreign; and -sort- within -foreign- by -nmpg-; and separately for each group defined by -foreign, calculate -order-. Under -by:-, _n is interpreted within group. . by foreign: tabdisp order if order <= 10, c(make mpg) ______________________________________________________________________ _________ -> foreign = Domestic ------------------------------------------ order | Make and Model Mileage (mpg) ----------+------------------------------- 1 | Plym. Champ 34 2 | Dodge Colt 30 3 | Chev. Chevette 29 4 | Ford Fiesta 28 5 | Plym. Arrow 28 6 | Buick Opel 26 7 | Plym. Sapporo 26 8 | Plym. Horizon 25 9 | Olds Starfire 24 10 | Chev. Monza 24 ------------------------------------------ ______________________________________________________________________ _________ -> foreign = Foreign ------------------------------------------ order | Make and Model Mileage (mpg) ----------+------------------------------- 1 | VW Diesel 41 2 | Subaru 35 3 | Datsun 210 35 4 | Toyota Corolla 31 5 | Mazda GLC 30 6 | Honda Civic 28 7 | Renault Le Car 26 8 | VW Scirocco 25 9 | BMW 320i 25 10 | VW Rabbit 25 ------------------------------------------ Comments: * We do need, for this, a variable with negative mpg. . bysort foreign (- mpg) : gen order = _n would be neat if it were legal syntax, but it isn't. * Each value of -order- is no longer guaranteed unique, but -by:- keeps them apart. * We still haven't done anything special about tied values. Example 3 ========= I want to see a table of the top ten -make-s on -mpg-. This time, I want separate listings for classes defined by -foreign- and -rep78-. . gen nmpg = -mpg . bysort foreign rep78 (nmpg) : gen order = _n . by foreign rep78: tabdisp order if order <= 10, c(make mpg) (It does work.) The principle extends not only to this two-way problem, but from there to multi-way problems. Example 4 ========= I want to see a table of the most frequent manufacturers, defined as the first word of -make-. We have to get this ourselves, but that's just to set up an example: . egen manuf = ends(make), head 1. Frequency must be calculated. And we'll need to negate it again as we'll want highest frequency first. . bysort manuf : gen freq = -_N Under -by:-, _N is interpreted within group. Another way to do this is with -egen-: . bysort manuf: egen freq = sum(-1) 2. Now each observation with the same value of -manuf- has been assigned the same value of -freq-. Each also will need to be assigned the same value of -order-: . bysort freq manuf : gen order = _n == 1 . replace order = sum(order) Unpacking this again, we want the -manuf- with the highest frequency to be assigned 1, etc., so we need to -sort- on -freq-. But it is possible that different manufacturers could have the same frequency, so we would need to split any ties. Hence the first part, . bysort freq manuf: Now we tag the first value within each group by 1 (_n == 1 is true (numerically 1) whenever _n is 1), and get the cumulative sum. This gives us blocks of 1s, of 2s, etc. We have to reverse the negation . replace freq = - freq and then we are home and dry . tabdisp order, c(manuf freq) Comments: * Each value is no longer unique, but we are quite safe, as -manuf- and -freq- do not vary within -order-. * Ties! Example 5 (and last) ==================== I want to see a table of the most frequent manufacturers, defined as the first word of -make-, but separately by -foreign-. This is the analogue of Rodrigo's problem. 1. Frequency must be calculated. And we'll need to negate it again as we'll want highest frequency first. . bysort foreign manuf : gen freq = -_N Under -by:-, _N is interpreted within group. 2. Now each observation within groups of -foreign manuf- has been assigned the same value of -freq-. Each also will need to be assigned the same value of -order-: . bysort foreign freq manuf : gen order = _n == 1 . by foreign: replace order = sum(order) Unpacking this again, we want our highest level to be -foreign-, as we want separate tables for separate values of -foreign-. Within that level, we want the -manuf- with the highest frequency to be assigned 1, etc., so we need to -sort- on -freq-. But it is possible that different manufacturers could have the same frequency, so we would need to split any ties on -manuf-. Then we tag the first value within each group by 1 (_n == 1 is true (numerically 1) whenever _n is 1), and get the cumulative sum. This gives us blocks of 1s, of 2s, etc. We have to reverse the negation once more . replace freq = - freq and then we are home and dry . bysort foreign: tabdisp order, c(manuf freq) Comments: * See comments for Example 4. * Ties! * Extension to multi-way is immediate. * Lots of problems still remain, including handling weights and -if-/-in- restrictions. P.S. for Rodrigo. modulo some translation into Spanish (Frequencia???): . bysort Servicio Diagnostico : gen freq = _N . bysort Servicio freq Diagnostico : gen order = _n == 1 . by Servicio : replace order = sum(order) . replace freq = -freq . by Servicio : tabdisp order, c(Diagnostico freq) Nick n.j.cox@durham.ac.uk

`<<attachment: winmail.dat>>`

**References**:**st: New question***From:*"Rodrigo Briceņo" <rbriceno@sanigest.com>

- Prev by Date:
**st: drop exact name only** - Next by Date:
**st: RE: How SHOULD cut behave** - Previous by thread:
**st: New question** - Next by thread:
**st: update to update to cut** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |