Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: output svymean


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: output svymean
Date   Wed, 23 Jul 2003 11:30:10 -0500

Rita Luk <Rita_Luk@camh.net> is a SAS user needing to use Stata.  She wants to
create a dataset of tabulations, such as 

    OBS  NAME1  VALUE1  NAME2  VALUE2  COUNT  PERCENT1  PERCENT2 
      1    sex  female      .       .     50      0.25         .
      2    sex  male        .       .    150      0.75         .
      3    sex  female    age     old     25      0.50      0.33 
      4    sex  female    age   young     25      0.50      0.20
      5    sex  male      age     old     50      0.33      0.67
      6    sex  male      age   young    100      0.67      0.80

about which she writes, 

> [...] this dataset would result from running two different tabulations. The
> first two observations [...] come from a single variable tabulation of sex.
> The subsequent 4 observations (3-6) [... result ...]  from a cross
> tabulation of the dichotomous variables age and sex.

She adds,

> In stata by using [...an inventive but convoluted procedure...] I can get
> the percents, and with a little data manipulation the counts and names, but
> I cannot get the values [...]  There must be a quicker way.

Problem is, Rita is still thinking SAS while using Stata.


Introduction, one-way tabulations
---------------------------------

To obtain a dataset containing the results of a one-way tabulation, here 
is one recipe:

        . sort <varname>
        . by <varname>: gen count = _N
        . by <varname>: keep if _n==_N
        . keep <varname> count

Let's try it with the auto data:

        . sysuse auto, clear
        (1978 Automobile Data)

        . sort rep78

        . by rep78: gen count=_N

        . by rep78: keep if _n==_N
        (68 observations deleted)

        . keep rep78 count

        . list

               rep78   count  
          1.       1       2  
          2.       2       8  
          3.       3      30  
          4.       4      18  
          5.       5      11  
          6.       .       5  

Note that we got the missing value.  If we did not want that, we could 
now type 

        . drop if missing(rep78)

We also do not have the percent.  That's easy enough to get:

        . gen sum = sum(count) 
        . gen percent = count/sum[_N]

Doing that with the dataset above (and leaving in the missings), we would 
get:

        . list

               rep78   count   sum    percent  
          1.       1       2     2    .027027  
          2.       2       8    10   .1081081  
          3.       3      30    40   .4054054  
          4.       4      18    58   .2432432  
          5.       5      11    69   .1486486  
          6.       .       5    74   .0675676  


If we did not want variable sum, we could drop it  by typing 

        . drop sum

So the entire solution is
       
        . sort <varname>
        . by <varname>: gen count = _N
        . by <varname>: keep if _n==_N
        . keep <varname> count
        . gen sum = sum(count) 
        . generate percent = count/sum[_N]
        . drop sum


Introduction, two-way tabulations
---------------------------------

Obtaining two-way tabulations is no more difficult that one-way tabulations:

        . sort <varname1> <varname2>
        . by <varname1> <varname2>: gen count = _N
        . by <varname1> <varname2>: keep if _n==_N
        . keep <varname1> <varname2> count

The solution is the same, we just -sort- and -by- on more variables.  Getting
the two percentages is a little more work then in the one-way case:

        . by <varname1>:  gen sum = sum(count)
        . by <varname1>:  gen percent1 = count/sum[_N]
        . drop sum

        . sort <varname2>
        . by <varname2>:  gen sum = sum(count)
        . by <varname2>:  gen percent2 = count/sum[_N]
        . drop sum

I follow the above procedure using the auto data, substituting rep78 for
<varname1> and foreign for <varname2>.  Before listing the data, I 
-sort rep78 foreign-:

        . list, clean

               rep78    foreign   count   percent1   percent2  
          1.       1   Domestic       2          1   .0384615  
          2.       2   Domestic       8          1   .1538462  
          3.       3   Domestic      27         .9   .5192308  
          4.       3    Foreign       3         .1   .1363636  
          5.       4   Domestic       9         .5   .1730769  
          6.       4    Foreign       9         .5   .4090909  
          7.       5   Domestic       2   .1818182   .0384615  
          8.       5    Foreign       9   .8181818   .4090909  
          9.       .   Domestic       4         .8   .0769231  
         10.       .    Foreign       1         .2   .0454545  


Combining results
-----------------

We now need to rename and append to get the results in the form Rita wants.
Let's pretend I saved the results of each of the above steps as result1.dta 
and result2.dta:

        . use result1
        . gen str name1 = "rep78"
        . rename rep78 value1
        . rename percent percent1
        . save result1, replace

        . use result2
        . gen str name1="rep78"
        . gen str name2="foreign"
        . rename rep78 value1
        . rename foreign value2
        . save result2, replace 

        . use result1
        . append using result2

Oh yes, and Rita wanted the variables in the order name1, value1, name2,
value2, count, percent1, percent2:

	. order name1 value1 name2 value2 count percent1 percent2

I did the above using my example and obtained:

    . list, clean 

           name1   value1     name2     value2   count   percent1   percent2  
      1.   rep78        1                    .       2    .027027          .  
      2.   rep78        2                    .       8   .1081081          .  
      3.   rep78        3                    .      30   .4054054          .  
      4.   rep78        4                    .      18   .2432432          .  
      5.   rep78        5                    .      11   .1486486          .  
      6.   rep78        .                    .       5   .0675676          .  
      7.   rep78        1   foreign   Domestic       2          1   .0384615  
      8.   rep78        2   foreign   Domestic       8          1   .1538462  
      9.   rep78        3   foreign   Domestic      27         .9   .5192308  
     10.   rep78        3   foreign    Foreign       3         .1   .1363636  
     11.   rep78        4   foreign   Domestic       9         .5   .1730769  
     12.   rep78        4   foreign    Foreign       9         .5   .4090909  
     13.   rep78        5   foreign   Domestic       2   .1818182   .0384615  
     14.   rep78        5   foreign    Foreign       9   .8181818   .4090909  
     15.   rep78        .   foreign   Domestic       4         .8   .0769231  
     16.   rep78        .   foreign    Foreign       1         .2   .0454545  

That was the desired result.


Automation
----------

Rita wrote, 

> I plan on doing A LOT of single tabulations and cross tabulations and want
> to produce an output dataset that might look like:

so now let's automate it:

        program forrita
                syntax varlist(min=2 max=2)
                local v1 : word 1 of `varlist'
                local v2 : word 2 of `varlist'
        
                preserve
        
                sort `v1'
                by `v1': gen count = _N
                by `v1': keep if _n==_N
                keep `v1' count
                gen sum = sum(count)
                gen percent1 = count/sum[_N]
                drop sum
                save result1, replace
        
                restore, preserve
                sort `v1' `v2'
                by `v1' `v2': gen count = _N
                by `v1' `v2': keep if _n==_N
                keep `v1' `v2' count
                by `v1': gen sum = sum(count)
                by `v1': gen percent1 = count/sum[_N]
                drop sum
                sort `v2'
                by `v2': gen sum = sum(count)
                by `v2': gen percent2 = count/sum[_N]
                drop sum
                sort `v1' `v2'
                save result2, replace
        
                use result1
                gen str name1 = "`v1'"
                rename `v1' value1
                save result1, replace
        
                use result2
                gen str name1 = "`v1'"
                gen str name2 = "`v2'"
                rename `v1' value1
                rename `v2' value2
                save result2, replace 
        
        	use result1
        	append using result2
        	order name1 value1 name2 value2 count percent1 percent1
        	restore, not

        	erase result1.dta
        	erase result2.dta
	end

With the above program, I can type 

        . sysuse auto, clear 
        . forrita rep78 foreign

and obtain the desired result.

-- Bill
wgould@stat.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index