[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: -groups- available on SSC |

Date |
Mon, 28 Jul 2003 14:25:50 +0100 |

Thanks to Kit Baum, a utility -groups- for listing groups and their frequencies is now downloadable from SSC. Stata 8 is required. . ssc desc groups will tell you less than what is said below. . ssc inst groups will install. "Surely", I can imagine you saying, "Stata is already well supplied -- if not over-supplied -- with such commands". Yes indeed; for a very good start, we have -tabulate-, -table-, -tabstat-, not to mention special-purpose tabulation commands in official Stata, nor even to continue with various user-written commands. But there are some twists in -groups- which prompt me to draw it to your attention. Some of the twists reflect, directly or indirectly, various questions posted on Statalist. The main twist is this: everyone knows that even with two-way tables there can be too many columns for comfort, and the problem of space is usually compounded with three-way and higher tables. Even if there is enough space, the sparsity (lots of zeroes) of some tables makes other kinds of tabulation attractive in at least some circumstances. What is -groups-? It is a kind of cross-breed of -tabulate- and -list-. Like -list-, it offers a way of seeing a lot of data -- in this case, a lot of results -- given the constraint of distinctly limited width. As it happens, -groups- is just a wrapper for -list- and can make use of (almost all) the new features introduced in -list- in Stata 8. After . sysuse auto the results of . groups foreign look very much like the results of -tabulate foreign-, and indeed -groups- is designed to be that way: +-------------------------------------+ | foreign Freq. Percent Cum. | |-------------------------------------| | Domestic 52 70.27 70.27 | | Foreign 22 29.73 100.00 | +-------------------------------------+ A two-way table, on the other hand, is pulled down so that it is a listing, a "long" structure rather than a "wide" one in -reshape- jargon. (The same applies to three-way and higher tables.) . groups foreign rep78 +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 2.90 | | Domestic 2 8 11.59 | | Domestic 3 27 39.13 | | Domestic 4 9 13.04 | | Domestic 5 2 2.90 | |------------------------------------| | Foreign 3 3 4.35 | | Foreign 4 9 13.04 | | Foreign 5 9 13.04 | +------------------------------------+ A -fillin- option is available for Sartrean existentialists who like to contemplate nothingness: . groups foreign rep78, fillin +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 2.90 | | Domestic 2 8 11.59 | | Domestic 3 27 39.13 | | Domestic 4 9 13.04 | | Domestic 5 2 2.90 | |------------------------------------| | Foreign 1 0 0.00 | | Foreign 2 0 0.00 | | Foreign 3 3 4.35 | | Foreign 4 9 13.04 | | Foreign 5 9 13.04 | +------------------------------------+ -groups- can be issued -by <varlist>:-. That is the key to how percents are calculated. At the same time, let me illustrate -order(h)-, which puts the highest frequencies first, and -N-, which is an option of -list-: . bysort foreign: groups rep78, ord(h) N _________________________________________________ -> foreign = Domestic +----------------------------------+ | rep78 Freq. Percent Cum. | |----------------------------------| | 3 27 56.25 56.25 | | 4 9 18.75 75.00 | | 2 8 16.67 91.67 | | 1 2 4.17 95.83 | | 5 2 4.17 100.00 | |----------------------------------| | N 5 5 5 | +----------------------------------+ _________________________________________________ -> foreign = Foreign +----------------------------------+ | rep78 Freq. Percent Cum. | |----------------------------------| | 4 9 42.86 42.86 | | 5 9 42.86 85.71 | | 3 3 14.29 100.00 | |----------------------------------| | N 3 3 3 | +----------------------------------+ The frequencies shown by default are frequencies (one or more variables in varlist) percents (ditto) cumulative percents (one variable in varlist) -- the surmise being that _cumulatives_ are rather more arbitrary with two or more variables, being necessarily dependent on the order of variables. That is not the law, however, and a -show()- option allows you to have none or one or two or three of those -- and/or indeed cumulative frequencies are available on request: . groups mpg, show(f F) +--------------------+ | mpg Freq. Cum. | |--------------------| | 12 2 2 | | 14 6 8 | | 15 2 10 | | 16 4 14 | | 17 4 18 | |--------------------| | 18 9 27 | | 19 8 35 | | 20 3 38 | | 21 5 43 | | 22 5 48 | |--------------------| | 23 3 51 | | 24 4 55 | | 25 5 60 | | 26 3 63 | | 28 3 66 | |--------------------| | 29 1 67 | | 30 2 69 | | 31 1 70 | | 34 1 71 | | 35 2 73 | |--------------------| | 41 1 74 | +--------------------+ Here -f- stands for -freq-uency and -F- stands for _cumulative_ frequency (the capital F is supposed to be reminiscent of a common notation). In addition, reverse cumulatives (# or % > value rather than # or % <= value) are also available. There is also a -show(none)-. A further option -select()- lets you select which groups are to be listed, for example by a condition on the -f-requencies. -select(f == 1)- selects those groups that occur precisely once, in which case there is no need to see a frequency of column of 1s, and the percents and cumulative percents are possibly of no use or interest: . groups mpg, sel(f == 1) show(none) +-----+ | mpg | |-----| | 29 | | 31 | | 34 | | 41 | +-----+ The -select()- option can be used in another way. -select(5)- says: list just the first five of the groups which would otherwise have been listed. By default, with just one variable specified, that is just the five lowest groups of values of the variable. Each group, naturally, could occur more than once: . groups mpg, sel(5) +-------------------------------+ | mpg Freq. Percent Cum. | |-------------------------------| | 12 2 2.70 2.70 | | 14 6 8.11 10.81 | | 15 2 2.70 13.51 | | 16 4 5.41 18.92 | | 17 4 5.41 24.32 | +-------------------------------+ You can guess that -select(-5)- starts at the other end: . groups mpg, sel(-5) +--------------------------------+ | mpg Freq. Percent Cum. | |--------------------------------| | 30 2 2.70 93.24 | | 31 1 1.35 94.59 | | 34 1 1.35 95.95 | | 35 2 2.70 98.65 | | 41 1 1.35 100.00 | +--------------------------------+ So these commands give you pictures of the tails of a distribution. (For single variables, -extremes- on SSC is another way to do it.) You can -order(high)- or -order(low)-, namely specify a listing in order of the frequencies, not the values of the variables in each group. In the first case, -select(5)- gives you the 5 groups which are most frequent. . groups mpg, sel(5) ord(h) +-------------------------------+ | mpg Freq. Percent Cum. | |-------------------------------| | 18 9 12.16 12.16 | | 19 8 10.81 22.97 | | 14 6 8.11 31.08 | | 21 5 6.76 37.84 | | 22 5 6.76 44.59 | +-------------------------------+ If you specify -fillin- with two or more variables, zeros are shown explicitly. These are the cells that would be shown by 0s in -tabulate- or by blanks in -table-. -select()-ing zeroes gives you a listing of the cells _not_ present in your dataset. That's not often wanted, but when it is, it can be tricky to automate, unless you know about -fillin-, the command after which the option is named. . groups foreign rep78, fill sel(f == 0) show(none) +-----------------+ | foreign rep78 | |-----------------| | Foreign 1 | | Foreign 2 | +-----------------+ -groups- is just a hack sitting on the shoulders of the giant -list-, so there are several ways to tweak appearances. Here is one: . groups foreign rep78, sepby(foreign) +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 2.90 | | Domestic 2 8 11.59 | | Domestic 3 27 39.13 | | Domestic 4 9 13.04 | | Domestic 5 2 2.90 | |------------------------------------| | Foreign 3 3 4.35 | | Foreign 4 9 13.04 | | Foreign 5 9 13.04 | +------------------------------------+ We did get the same appearance earlier, but that was just fortuitous, as the default of separating lines every 5 happened to give a sensible answer. Nick n.j.cox@durham.ac.uk P.S. I stole the English word -groups- to name this program. All proper English words are supposedly reserved by statute for official Stata Corp commands. However, there are a lot of programs called -tab*-, and I have run out of inspiration in that direction, short of something fairly obscure. In any case, Stata Corp can steal that name back if and when they want it, in which case my -groups- could always be renamed. * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: graphing normally distributed variables***From:*"victor michael zammit" <vmz@vol.net.mt>

- Prev by Date:
**Re: st: Decimal precision, again** - Next by Date:
**Re: st: xttobit with fixed effect???** - Previous by thread:
**st: Odd change in R^2 with IV regression** - Next by thread:
**st: graphing normally distributed variables** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |