[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: tabulate question

From	"Mostafa Baladi" <[email protected]>
To	<[email protected]>
Subject	st: RE: tabulate question
Date	Fri, 4 Jan 2008 12:52:17 -0600

Hi Jacob,
Another solution is to keep the last observation for each Name and drop
the rest as follows:  

by Name, sort: drop if _n != _N

Then you can summarize or count.


Mostafa

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of jwegelin
Sent: Friday, January 04, 2008 12:00 PM
To: [email protected]
Subject: st: tabulate question

There is probably an obvious answer to this.

Suppose you have a dataset in long form with repeated (say, yearly) 
measures on a large number of individuals. The data are *not* balanced: 
Some individuals have 10 or more observations (rows); some have only one

observation. You'd like to find out how many individuals are in your 
dataset.

In the current example, mytmp.dta contains NAME, the id variable 
(labeled numeric), as well as Sex01 (labeled numeric), Age, and Year.

Below are two solutions, neither particularly elegant. Either way, a 
couple steps are required to find that there are 3969 individuals in the

dataset. Both ways replace the long dataset with a dataset which is only

useful for counting the number of individuals. Is there a more elegant 
solution?

One way *not* to do it is

tabulate NAME

since this causes an inconveniently long table to scroll past which has 
a row for each NAME.

One solution is to pick any variable other than NAME (provided the 
variable has no missing values) and collapse as follows:

. use mytmp, clear

. summarize NAME

     Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         NAME |      8282    1995.673    1154.142          1       3969

. collapse (count) nYears=Sex01, by(NAME)

. summarize NAME

     Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         NAME |      3969        1985    1145.896          1       3969


Another way is to use reshape:

. use mytmp, clear

. keep NAME Age Year Sex01

. reshape wide Age, i(NAME) j(Year)
(note: j = 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007)

Data                               long   ->   wide
------------------------------------------------------------------------
-----
Number of obs.                     8282   ->    3969
Number of variables                   4   ->      24
j variable (22 values)             Year   ->   (dropped)
xij variables:
                                     Age   ->   Age1986 Age1987 ...
Age2007
------------------------------------------------------------------------
-----

. summarize NAME

     Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         NAME |      3969        1985    1145.896          1       3969


Thanks for any suggestions.

Jacob A. Wegelin
Assistant Professor
Department of Biostatistics
Virginia Commonwealth University
730 East Broad Street Room 3006
P. O. Box 980032
Richmond VA 23298-0032
U.S.A.
http://www.people.vcu.edu/~jwegelin

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: tabulate question
  - From: jwegelin <[email protected]>

Prev by Date: st: Re:groups of pairs of variables
Next by Date: Re: st: clinical trials
Previous by thread: st: tabulate question
Index(es):
- Date
- Thread