[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: -groups- available on SSC

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: -groups- available on SSC
Date	Mon, 28 Jul 2003 14:25:50 +0100
Thanks to Kit Baum, a utility -groups- for listing groups and
their frequencies is now downloadable from SSC.  Stata 8 is
required.

. ssc desc groups

will tell you less than what is said below.

. ssc inst groups

will install.

"Surely",  I can imagine you saying, "Stata is already well
supplied -- if not over-supplied -- with such commands". Yes
indeed; for a very good start, we have -tabulate-, -table-,
-tabstat-, not to mention special-purpose tabulation commands in
official Stata, nor even to continue with various user-written
commands.  But there are some twists in -groups- which prompt me
to draw it to your attention. Some of the twists reflect, directly
or indirectly, various questions posted on Statalist.

The main twist is this: everyone knows that even with two-way
tables there can be too many columns for comfort, and the problem
of space is usually compounded with three-way and higher tables.
Even if there is enough space, the sparsity (lots of zeroes) of
some tables makes other kinds of tabulation attractive in at least
some circumstances.

What is -groups-? It is a kind of cross-breed of -tabulate- and
-list-. Like -list-, it offers a way of seeing a lot of data --
in this case, a lot of results -- given the constraint of distinctly
limited width. As it happens, -groups- is just a wrapper for -list-
and can make use of (almost all) the new features introduced in
-list- in Stata 8.

After

. sysuse auto

the results of

. groups foreign

look very much like the results of -tabulate foreign-, and indeed
-groups- is designed to be that way:

  +-------------------------------------+
  |  foreign   Freq.   Percent     Cum. |
  |-------------------------------------|
  | Domestic      52     70.27    70.27 |
  |  Foreign      22     29.73   100.00 |
  +-------------------------------------+

A two-way table, on the other hand, is pulled down so that it is a
listing, a "long" structure rather than a "wide" one in -reshape-
jargon. (The same applies to three-way and higher tables.)

. groups foreign rep78

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

A -fillin- option is available for Sartrean existentialists who
like to contemplate nothingness:

. groups foreign rep78, fillin

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       1       0      0.00 |
  |  Foreign       2       0      0.00 |
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

-groups- can be issued -by <varlist>:-. That is the key to how
percents are calculated. At the same time, let me illustrate
-order(h)-, which puts the highest frequencies first, and -N-,
which is an option of -list-:

. bysort foreign: groups rep78, ord(h) N

_________________________________________________
-> foreign = Domestic

  +----------------------------------+
  | rep78   Freq.   Percent     Cum. |
  |----------------------------------|
  |     3      27     56.25    56.25 |
  |     4       9     18.75    75.00 |
  |     2       8     16.67    91.67 |
  |     1       2      4.17    95.83 |
  |     5       2      4.17   100.00 |
  |----------------------------------|
  |     N       5         5        5 |
  +----------------------------------+

_________________________________________________
-> foreign = Foreign

  +----------------------------------+
  | rep78   Freq.   Percent     Cum. |
  |----------------------------------|
  |     4       9     42.86    42.86 |
  |     5       9     42.86    85.71 |
  |     3       3     14.29   100.00 |
  |----------------------------------|
  |     N       3         3        3 |
  +----------------------------------+

The frequencies shown by default are

	frequencies           (one or more variables in varlist)
	percents              (ditto)
	cumulative percents   (one variable in varlist)

-- the surmise being that _cumulatives_ are rather more arbitrary
with two or more variables, being necessarily dependent on the
order of variables. That is not the law, however, and a -show()-
option allows you to have none or one or two or three of those --
and/or indeed cumulative frequencies are available on request:

. groups mpg, show(f F)

  +--------------------+
  | mpg   Freq.   Cum. |
  |--------------------|
  |  12       2      2 |
  |  14       6      8 |
  |  15       2     10 |
  |  16       4     14 |
  |  17       4     18 |
  |--------------------|
  |  18       9     27 |
  |  19       8     35 |
  |  20       3     38 |
  |  21       5     43 |
  |  22       5     48 |
  |--------------------|
  |  23       3     51 |
  |  24       4     55 |
  |  25       5     60 |
  |  26       3     63 |
  |  28       3     66 |
  |--------------------|
  |  29       1     67 |
  |  30       2     69 |
  |  31       1     70 |
  |  34       1     71 |
  |  35       2     73 |
  |--------------------|
  |  41       1     74 |
  +--------------------+

Here -f- stands for -freq-uency and -F- stands for _cumulative_
frequency (the capital F is supposed to be reminiscent of a common
notation). In addition, reverse cumulatives (# or % > value rather
than # or % <= value) are also available. There is also
a -show(none)-.

A further option -select()- lets you select which groups are to be
listed, for example by a condition on the -f-requencies.
-select(f == 1)- selects those groups that occur precisely once,
in which case there is no need to see a frequency of column of 1s,
and the percents and cumulative percents are possibly of no use or
interest:

. groups mpg, sel(f == 1) show(none)

  +-----+
  | mpg |
  |-----|
  |  29 |
  |  31 |
  |  34 |
  |  41 |
  +-----+

The -select()- option can be used in another way. -select(5)-
says: list just the first five of the groups which would otherwise
have been listed. By default, with just one variable specified,
that is just the five lowest groups of values of the variable.
Each group, naturally, could occur more than once:

. groups mpg, sel(5)

  +-------------------------------+
  | mpg   Freq.   Percent    Cum. |
  |-------------------------------|
  |  12       2      2.70    2.70 |
  |  14       6      8.11   10.81 |
  |  15       2      2.70   13.51 |
  |  16       4      5.41   18.92 |
  |  17       4      5.41   24.32 |
  +-------------------------------+

You can guess that -select(-5)- starts at the other end:

. groups mpg, sel(-5)

  +--------------------------------+
  | mpg   Freq.   Percent     Cum. |
  |--------------------------------|
  |  30       2      2.70    93.24 |
  |  31       1      1.35    94.59 |
  |  34       1      1.35    95.95 |
  |  35       2      2.70    98.65 |
  |  41       1      1.35   100.00 |
  +--------------------------------+

So these commands give you pictures of the tails of a
distribution. (For single variables, -extremes- on SSC is another
way to do it.)

You can -order(high)- or -order(low)-, namely specify a listing in
order of the frequencies, not the values of the variables in each
group. In the first case, -select(5)- gives you the 5 groups which
are most frequent.

. groups mpg, sel(5) ord(h)

  +-------------------------------+
  | mpg   Freq.   Percent    Cum. |
  |-------------------------------|
  |  18       9     12.16   12.16 |
  |  19       8     10.81   22.97 |
  |  14       6      8.11   31.08 |
  |  21       5      6.76   37.84 |
  |  22       5      6.76   44.59 |
  +-------------------------------+

If you specify -fillin- with two or more variables, zeros are
shown explicitly. These are the cells that would be shown by 0s in
-tabulate- or by blanks in -table-. -select()-ing zeroes gives you
a listing of the cells _not_ present in your dataset.  That's not
often wanted, but when it is, it can be tricky to automate, unless
you know about -fillin-, the command after which the option is
named.

. groups foreign rep78, fill sel(f == 0) show(none)

  +-----------------+
  | foreign   rep78 |
  |-----------------|
  | Foreign       1 |
  | Foreign       2 |
  +-----------------+

-groups- is just a hack sitting on the shoulders of the giant
-list-, so there are several ways to tweak appearances. Here is
one:

. groups foreign rep78, sepby(foreign)

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

We did get the same appearance earlier, but that was just
fortuitous, as the default of separating lines every 5 happened to
give a sensible answer.

Nick
[email protected]

P.S. I stole the English word -groups- to name this program. All
proper English words are supposedly reserved by statute for
official Stata Corp commands. However, there are a lot of programs
called -tab*-, and I have run out of inspiration in that
direction, short of something fairly obscure. In any case, Stata
Corp can steal that name back if and when they want it, in which
case my -groups- could always be renamed.


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- st: graphing normally distributed variables
  - From: "victor michael zammit" <[email protected]>
Prev by Date: Re: st: Decimal precision, again
Next by Date: Re: st: xttobit with fixed effect???
Previous by thread: st: Odd change in R^2 with IV regression
Next by thread: st: graphing normally distributed variables
Index(es):
- Date
- Thread