[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Tabling: an agenda

From	[email protected]
To	[email protected]
Subject	Re: st: Tabling: an agenda
Date	Wed, 08 Oct 2003 09:50:50 -0500

The venerable PROC TABULATE in SAS is a good model (at least in terms of
functionality).  It allows complete specification of the table layout
including lines and text, independent format specifications for each stat
in the table, and the ability to specify the exact denominator for each
percent, inlcuding various row and column subtotals.  But it's a bear to
learn.

--On Wednesday, October 08, 2003 12:33 PM +0100 Nick Cox
<[email protected]> wrote:

Phil Ryan mused generally in the light of a question
from Daniel Sabath:

As I think Nick Cox has pointed out recently, Stata's tabulation
facilities are somewhat scattered and it can be difficult to find
exactly what you want among the myriad of official and unofficial
commands.  My own opinion is that, usually, user-written add-ons are
a *very* good thing and add immeasurably to Stata's functionality.
But tabulation is such a basic and important tool that a more
unified system is needed. Many of us have written front-ends to
_tabdisp for particular functions that -table- does not support, but
(i) _tabdisp itself is limited and (ii) there is no unified
<command/subcommand/option> construct to allow a reasonable choice
of presentations of tabular material. (One has in mind the v8
graphics subsystem - complex, admittedly, but now allows a deal of
control over the end-product).  In Dan's example below, what we have
is essentially a collection of  Rx2 subtables appended, that is, we
have a sex X smoker table then an age group X smoker table and then
perhaps other subtables.  This is often the format given as "Table
1" of a published paper wherein the baseline characteristics of two
or more groups are displayed.  Stata can produce the subtables, but
(I think) not the end-product, because Stata's tables are all about
complete cross-classifications,whereas the display we want here has
cross-classifications within a subtable but not between subtables.

In summary I can imagine a tabulation subsystem in Stata that
supports a user-defined output - contents and layout - for
presentation.  Imagination is, of course, cheap.

Imagination is where ideas come from!

I agree, as would be expected, with the general diagnosis here.
I also agree that at least for certain tabulation tasks the
needs go beyond what amateurs can do with Stata's own language,
so that we need a major input from Stata Corp.

However, in the spirit of Phil's later comments, let's talk
specifics. Here is a first PARtial list of a miserable seven Problems,
what can be done with Available material and what seems
Required. Join in with your own additions (or subtractions).

Problem 1: awareness
====================

I think one of the major problems users face is just to be aware of
what is possible, given the multiplicity of commands.

Available solutions: At some point, there is no substitute
for reading the manual and playing with the existing
commands, e.g. so that you know the strengths and weaknesses
of -tabulate-, -table-, -tabstat-, -tabdisp- etc. (and
-list- etc.). Some articles in the Stata Journal aim
to provide comparative material.

Required solutions: More documentation of various kinds!
More FAQs please. Anyone who was willing to write a book
on Stata tabulation tasks and tricks would not make the conceptual
breakthrough which Deans and Chairs expect, but they would
be able to start financing their retirement home.

Problem 2: combining tables
===========================

As Phil has clearly highlighted, one common need is to put
together what in effect sub-tables into combined tables.
It could be argued that Stata should not interfere between
you and your word and text processor; any way, at first sight
it offers next to no tools for doing this.

Available solutions: ... except that, in a sense, there
is a bunch of commands for joining tables so long as they
are (expressible as) Stata matrices. This line of attack
is probably under-appreciated; at the same time, it
falls short of what I guess people often need here.

Required solutions: a whole mini-language for combining
tables. In effect tables could be seen as objects
and there would be a set of operations for combining
them, with tunable control of output form: e.g.
join along rows; join along columns; layer. Each
combining would produce alignment, and be more than
what anybody could do as a cut/copy/paste
exercise. I guess that this would be a substantial
project for Stata Corp. -graph combine- is a partial
analogue.

(But there's more, such as elementwise addition,
subtraction, multiplication, division of tables...)

Problem 3: multiple variables
=============================

Stata does not offer much support for tabulating
frequency / proportion / percent results from
several variables simultaneously. Suppose (e.g.) I have
variables on trips to theatre, cinema, opera house,
funfair, etc. and I want a single table for all
variables so I can compare frequency distributions.

Available solutions: Some user efforts. Much can
be done once you see that a different data structure
is often the key (-stack-, -reshape- etc.), but
most users understandably prefer getting results on the fly
to mapping to a different data structure. (Even seeing
that you need a different structure can depend on
a lot of experience. Doing the restructuring can be
tricky too.)

Required solutions: Stata Corp to take this seriously!

Problem 4: sorting
==================

Sorting on the margins is often of limited analytical use.
To see patterns, rather than to provide easy look-up
(what is the population of Texas? Look under "Texas"...),
you often need to sort tables on their contents (i.e.
cell entries).

Available solutions: -tabulate, sort-. Some user
efforts. In general, this is not provided very
widely.

Required solutions: Stata Corp to take this seriously!

Problem 5: cell composites
==========================

What I call cell composites are cells containing
values from two or more variables, whether variables
in your dataset or temporary variables constructed by
the command running. In Daniel Sabath's
example which started this thread, he wanted cells
with concatenated strings

<cell freq> (<row percent>)

This is quite distinct cosmetically from what
might be called cell stacks

        <cell freq>
	 <row percent>

In general, Stata directly supports cell stacks, but
not output like the first form. Cell stacks can
be more space-consuming and difficult to read in
some circumstances, although it is also easy to
run out of space with the first form.

Available solutions: Much is possible once
you see that setting tabulation up as a display
of string variables is the key. However, this
requires some prior manipulations and indeed
moderate fluency with some Stata basics. Canned
solutions, whether official commands or
user-written programs, appear lacking.

Required solutions: Support for output specifications,
i.e. if I want a table to show

<cell freq> (<row percent>)

something like

"#1 (#2)"

would specify "the first number followed
by a space followed by a parenthesis followed
by the second number followed by a parenthesis".
(Naturally there is a danger of reinventing e.g.
TeX's tabulation syntax.)

Problem 6: cell text
====================

Think of the number of ways in which you
might specify substantive missings as one
example. Depending on the boss's whims, the
house rules, the journal's prescribed
style, your own tastes, you could want

NA

or

--

or

(no data)

etc., etc. This is an example of how, frequently,
even in a numeric table, you often want extra
text. Or think of cell entries which are footnoted.

Available solutions: As with Problem 5,
much is possible once you see that setting tabulation
up as a display of string variables is the key. However, this
requires some prior manipulations and indeed
moderate fluency with some Stata basics. Canned
solutions, whether official commands or
user-written programs, appear lacking.

Required solutions: Stata Corp to take this seriously!

Problem 7: table design
=======================

In fact, we can easily extend this. This last problem
is really a rag-bag of all sorts of small and large
design issues, such as

	 support for different fonts and bold, italic, etc.
	 different kinds of divider and separator
	 control of titles, subtitles, notes, etc.
	 control of margin layout
	 multiple formats

A very simple example of the last is with -tabstat-.
If I go

. tabstat mpg, by(rep78) s(n mean sd)

Summary for variables: mpg
     by categories of: rep78 (Repair Record 1978)

   rep78 |         N      mean        sd
---------+------------------------------
       1 |         2        21  4.242641
       2 |         8    19.125  3.758324
       3 |        30  19.43333  4.141325
       4 |        18  21.66667   4.93487
       5 |        11  27.36364  8.732385
---------+------------------------------
   Total |        69  21.28986  5.866408
----------------------------------------

then it's clear that the number of decimal
places is silly for mean and sd. Specifying
one d.p. is easy

. tabstat mpg, by(rep78) s(n mean sd) format(%2.1f)

Summary for variables: mpg
     by categories of: rep78 (Repair Record 1978)

   rep78 |         N      mean        sd
---------+------------------------------
       1 |       2.0      21.0       4.2
       2 |       8.0      19.1       3.8
       3 |      30.0      19.4       4.1
       4 |      18.0      21.7       4.9
       5 |      11.0      27.4       8.7
---------+------------------------------
   Total |      69.0      21.3       5.9
----------------------------------------

but now the format of N is ill-chosen. And it is common to want
yet other formats for other cells:

. tabstat mpg, by(rep78) s(n mean sd skew kurt) format(%2.1f)

Summary for variables: mpg
     by categories of: rep78 (Repair Record 1978)

   rep78 |         N      mean        sd  skewness  kurtosis
---------+--------------------------------------------------
       1 |       2.0      21.0       4.2       0.0       1.0
       2 |       8.0      19.1       3.8       0.2       1.6
       3 |      30.0      19.4       4.1       0.4       3.1
       4 |      18.0      21.7       4.9      -0.1       2.0
       5 |      11.0      27.4       8.7      -0.0       1.6
---------+--------------------------------------------------
   Total |      69.0      21.3       5.9       1.0       4.0
------------------------------------------------------------

Here one might want 2 d.p. for skew and kurt, at least
cosmetically.

Available solutions: There is a territorial issue here,
as with Problem 2, on how far Stata should get into terrain
which normally you would negotiate with (or in some cases
without) the assistance of your word or text processing software.
A lot can be done with SMCL, but either for one-off tasks or
for repetitive tasks that often requires Stata programming or at
least considerable Stata expertise. Multiple formats are
fairly easy to implement; one example can be seen in -makematrix-
from SSC.

Required solutions: Mostly, the finger points at Stata Corp,
again. But user-programmers can do more here than is
sometimes appreciated.

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/




=========================================================================
Paul A. Jargowsky, Ph.D., Assoc. Prof. of Political Economy
Director, The Bruton Center, School of Social Sciences (GR 31)
University of Texas at Dallas, 2601 North Floyd Road, Richardson TX 75080
=========================================================================
email: [email protected] or [email protected]
Home page: http://www.utdallas.edu/~jargo
Voice: 972-883-2992; FAX: 972-883-2735
=========================================================================
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Tabling: an agenda
  - From: "Nick Cox" <[email protected]>

Prev by Date: st: bizarre missing observations
Next by Date: st: tests of spatio-temporal clustering
Previous by thread: Re: st: Object oriented help files (and Tabling: an agenda)
Next by thread: st: RE: Duration and panel data
Index(es):
- Date
- Thread