Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: -egenmore- updated on SSC


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: -egenmore- updated on SSC
Date   Thu, 11 Jul 2002 10:59:53 +0100

Thanks to Kit Baum, the -egenmore- package on SSC has been 
updated. This consists of (you've guessed it) more -egen- 
functions. Most require no more than Stata 6, but some 
require Stata 7, as is flagged in the package description 
and the collective help -egenmore-. (Other user-written 
-egen- functions can be located with -findit-.) 

To get a listing of function names, type 

. ssc desc egenmore 

To get more details, type 

. ssc type egenmore.hlp 

To install, use 

. ssc inst egenmore 

or 

. ssc inst egenmore, replace

as appropriate. 

If your Stata is not up-to-date enough to include 
either -findit- or -ssc-, please see the first URL under 
my signature for advice. 

The update consists of a single new function 
-egroup()-. Its nonce name -egroup()- is intended 
merely to flag a small _e_xtension to the official 
Stata egen function -group()-. The extension is 
that the -label- option may specify a list of 
variables to use in the value labels of the new 
variable. The use of this is best shown by an 
example. Suppose as a small variation on examples 
with the auto data, we strip off the first word 
of -make- 

. egen manuf = head(make) 

and ask for a simple table showing frequencies: 

. tab manuf

      manuf |      Freq.     Percent        Cum.
------------+-----------------------------------
        AMC |          3        4.05        4.05
       Audi |          2        2.70        6.76
        BMW |          1        1.35        8.11
      Buick |          7        9.46       17.57
       Cad. |          3        4.05       21.62
      Chev. |          6        8.11       29.73
     Datsun |          4        5.41       35.14
      Dodge |          4        5.41       40.54
       Fiat |          1        1.35       41.89
       Ford |          2        2.70       44.59
      Honda |          2        2.70       47.30
      Linc. |          3        4.05       51.35
      Mazda |          1        1.35       52.70
      Merc. |          6        8.11       60.81
       Olds |          7        9.46       70.27
    Peugeot |          1        1.35       71.62
      Plym. |          5        6.76       78.38
      Pont. |          6        8.11       86.49
    Renault |          1        1.35       87.84
     Subaru |          1        1.35       89.19
     Toyota |          3        4.05       93.24
         VW |          4        5.41       98.65
      Volvo |          1        1.35      100.00
------------+-----------------------------------
      Total |         74      100.00

This shows a familiar feature: with string variables
(and also with numeric variables with value labels 
-encode-d alphabetically), we get alphabetic (strictly, 
alphanumeric) order, which is great for look-up, but 
often lousy for identifying patterns or interesting 
features. A more useful table would be ordered on 
frequency, and highest first, or so I suggest. 

As it happens, there is a kludged solution to this 
particular problem with -tabulate-, a program 
called -tabsort-, but it is of more interest to identify 
a general approach to a solution, because the same 
irritation can arise with other tabular and graphical output. 

We can get most of the way there in two lines of 
official Stata. Calculate the frequencies ourselves, 

. bysort manuf  : gen freq = -_N 

(remembering to negate values to get the desired 
sort order), and use -egen, group() label- to 
get an equivalent categorical variable. 

. egen Manuf = group(freq manuf) , label 

. tab Manuf 

 group(freq |
     manuf) |      Freq.     Percent        Cum.
------------+-----------------------------------
   -7 Buick |          7        9.46        9.46
    -7 Olds |          7        9.46       18.92
   -6 Chev. |          6        8.11       27.03
   -6 Merc. |          6        8.11       35.14
   -6 Pont. |          6        8.11       43.24
   -5 Plym. |          5        6.76       50.00
  -4 Datsun |          4        5.41       55.41
   -4 Dodge |          4        5.41       60.81
      -4 VW |          4        5.41       66.22
     -3 AMC |          3        4.05       70.27
    -3 Cad. |          3        4.05       74.32
   -3 Linc. |          3        4.05       78.38
  -3 Toyota |          3        4.05       82.43
    -2 Audi |          2        2.70       85.14
    -2 Ford |          2        2.70       87.84
   -2 Honda |          2        2.70       90.54
     -1 BMW |          1        1.35       91.89
    -1 Fiat |          1        1.35       93.24
   -1 Mazda |          1        1.35       94.59
 -1 Peugeot |          1        1.35       95.95
 -1 Renault |          1        1.35       97.30
  -1 Subaru |          1        1.35       98.65
   -1 Volvo |          1        1.35      100.00
------------+-----------------------------------
      Total |         74      100.00

The nuisance remaining is that we have the 
negated frequencies cluttering up the value labels. 
(Ask for a value label, and -egen, group()- uses
all the variables mentioned.) Hence the need 
for a new option, which is the only thing added 
in -egroup()-: 

. egen Manuf2 = egroup(freq manuf) , label(manuf) 

. tab Manuf2  

group(manuf |
          ) |      Freq.     Percent        Cum.
------------+-----------------------------------
      Buick |          7        9.46        9.46
       Olds |          7        9.46       18.92
      Chev. |          6        8.11       27.03
      Merc. |          6        8.11       35.14

< it's OK > 

    Peugeot |          1        1.35       95.95
    Renault |          1        1.35       97.30
     Subaru |          1        1.35       98.65
      Volvo |          1        1.35      100.00
------------+-----------------------------------
      Total |         74      100.00

This approach can be extended to other requests, 
standard or bizarre. Suppose we want a table 
ordered on maximum mpg:  

. bysort manuf : egen maxmpg = min(-mpg) 

(you can see that by hand-waving) 

. egen Manuf3 = egroup(maxmpg manuf) , label(manuf) 

. tabstat mpg , by(Manuf3) s(max) 

Summary for variables: mpg
     by categories of: Manuf3 (group(manuf))

 Manuf3 |       max
--------+----------
     VW |        41
 Datsun |        35
 Subaru |        35
  Plym. |        34

<it's OK too > 

   Fiat |        21
  Volvo |        17
  Linc. |        14
Peugeot |        14
--------+----------
  Total |        41
-------------------

(Why we can't go 

. egen Manuf3 = egroup(maxmpg), label(manuf) 

Because we need to break ties on maxmpg.) 

A lot of detail explaining one little option, but it may
be useful. 

Nick 
n.j.cox@durham.ac.uk 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index