Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: creating a new variable


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: creating a new variable
Date   Wed, 18 Jul 2012 13:17:41 +0100

That is not surprising. You are not asking exactly the same question.
The -egen- command will ignore missings on -bw- and assign the group
mean to observations that include them, so long as -gestwk- is not
missing. -tabstat- will ignore the missings on -bw- out of hand.
Evidently you have 2991456  - 2972666 missing values on -bw-.

This is the sort of discrepancy that you can investigate yourself, if
only with a smaller dataset.

To ensure identical results, always exclude the missings, e.g. by
-drop-ping them first.

. sysuse auto

. tabstat rep78, by(foreign) s(n mean)

Summary for variables: rep78
     by categories of: foreign (Car type)

 foreign |         N      mean
---------+--------------------
Domestic |        48  3.020833
 Foreign |        21  4.285714
---------+--------------------
   Total |        69  3.405797
------------------------------

. egen mean_rep78 = mean(rep78), by(foreign)

. tab mean_rep78

 mean_rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
   3.020833 |         52       70.27       70.27
   4.285714 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00


On Wed, Jul 18, 2012 at 1:02 PM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:
> Thank you Nick, Maarten & steve for your suggestions.
>
> The tabstat command is the perfect way to get a descriptive take on what I wanted.
>
> I tried the following and find a discrepency in the number of subjects:
>
> . egen mean_bw = mean(bw),  by(gestwk)
>
> . tab mean_bw
>
>     mean_bw |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>    559.5574 |        134        0.00        0.00
>    616.5096 |        387        0.01        0.02
>    699.3734 |        738        0.02        0.04
>    790.9377 |      1,235        0.04        0.08
>    902.7249 |      1,688        0.06        0.14
>    1014.961 |      2,125        0.07        0.21
>    1138.658 |      2,723        0.09        0.30
>    1295.815 |      3,415        0.11        0.42
>    1461.302 |      4,481        0.15        0.57
>    1655.637 |      5,876        0.20        0.76
>    1858.227 |      8,533        0.29        1.05
>    2092.705 |     12,958        0.43        1.48
>    2325.826 |     21,420        0.72        2.20
>    2592.584 |     36,710        1.23        3.42
>    2837.138 |     70,297        2.35        5.77
>    3081.272 |    151,310        5.06       10.83
>    3309.638 |      9,763        0.33       11.16
>    3313.268 |    373,660       12.49       23.65
>    3488.345 |    660,536       22.08       45.73
>    3627.659 |      1,648        0.06       45.78
>    3637.902 |    822,376       27.49       73.28
>    3698.833 |      5,470        0.18       73.46
>    3755.764 |    542,442       18.13       91.59
>    3791.726 |     31,928        1.07       92.66
>    3826.705 |    219,603        7.34      100.00
> ------------+-----------------------------------
>       Total |  2,991,456      100.00
>
>  . tabstat bw, by(gestwk) stat (mean n sd)
>
> Summary for variables: bw
>      by categories of: gestwk
>
>   gestwk |      mean         N        sd
> ---------+------------------------------
>       22 |  559.5574       122  209.6139
>       23 |  616.5096       365  134.5845
>       24 |  699.3734       691  135.2207
>       25 |  790.9377      1171   147.066
>       26 |  902.7248      1610  189.5523
>       27 |  1014.961      2024   201.809
>       28 |  1138.658      2613   238.724
>       29 |  1295.815      3316  278.1803
>       30 |  1461.302      4367  299.6202
>       31 |  1655.637      5732  345.8412
>       32 |  1858.227      8369  359.1699
>       33 |  2092.704     12771   402.861
>       34 |  2325.826     21149  416.8742
>       35 |  2592.584     36451  458.3818
>       36 |  2837.138     69940  464.2042
>       37 |  3081.272    150767  465.5551
>       38 |  3313.268    372601   453.221
>       39 |  3488.345    658969  445.2462
>       40 |  3637.902    820460  453.1178
>       41 |  3755.764    541160  467.3571
>       42 |  3826.705    219074  485.0738
>       43 |  3791.726     31859  507.7569
>       44 |  3698.833      5454  512.7899
>       45 |  3627.659      1631  531.2405
> ---------+------------------------------
>    Total |  3502.912   2972666  575.2709
> ----------------------------------------
>
>
> As one can see from above the N for each gestational week isn't the same for the two tabs. I get the same problem when using:
>
> bys gestwk : egen mean1 = mean(bw)
>
> The N's are almost the same for most gestwk thus giving the same mean BW. But in some cases the N's differ quite a bit giving larger differences in mean BW.
>
>
> Thanks,
> /Amal
>
> ________________________________________
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com]
> Sent: 18 July 2012 13:40
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: creating a new variable
>
> Here are five solutions for a similar problem.
>
> . sysuse auto
>
> . tab rep78, su(mpg)
>
>      Repair |      Summary of Mileage (mpg)
> Record 1978 |        Mean   Std. Dev.       Freq.
> ------------+------------------------------------
>           1 |          21 4.2426407           2
>           2 |      19.125   3.7583241           8
>           3 |   19.433333   4.1413252          30
>           4 |   21.666667   4.9348699          18
>           5 |   27.363636   8.7323849          11
> ------------+------------------------------------
>       Total |   21.289855   5.8664085          69
>
> . tabstat mpg , by(rep78)
>
> Summary for variables: mpg
>      by categories of: rep78 (Repair Record 1978)
>
>    rep78 |      mean
> ---------+----------
>        1 |        21
>        2 |    19.125
>        3 |  19.43333
>        4 |  21.66667
>        5 |  27.36364
> ---------+----------
>    Total |  21.28986
> --------------------
>
> . graph dot (mean) mpg, over(rep78) vertical
>
> . egen mean_mpg = mean(mpg),  by(rep78)
>
> . scatter mean_mpg rep78
>
> . dotplot mpg, over(rep78) bar
>
>
> On Wed, Jul 18, 2012 at 11:34 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:
>
>> I have a very simple problem that I'm unable to find a simple solution for:
>>
>> Below is the data concerned:
>>
>> Gestational age in weeks:
>>
>>  tab gestwk
>>
>>      gestwk |      Freq.     Percent        Cum.
>> ------------+-----------------------------------
>>          22 |        134        0.00        0.00
>>          23 |        387        0.01        0.02
>>          24 |        738        0.02        0.04
>>          25 |      1,235        0.04        0.08
>>          26 |      1,688        0.06        0.14
>>          27 |      2,125        0.07        0.21
>>          28 |      2,723        0.09        0.30
>>          29 |      3,415        0.11        0.42
>>          30 |      4,481        0.15        0.57
>>          31 |      5,876        0.20        0.76
>>          32 |      8,533        0.29        1.05
>>          33 |     12,958        0.43        1.49
>>          34 |     21,420        0.72        2.20
>>          35 |     36,710        1.23        3.44
>>          36 |     70,297        2.36        5.79
>>          37 |    151,310        5.07       10.87
>>          38 |    373,660       12.53       23.40
>>          39 |    660,536       22.15       45.55
>>          40 |    822,376       27.58       73.13
>>          41 |    542,442       18.19       91.33
>>          42 |    219,603        7.37       98.69
>>          43 |     31,928        1.07       99.76
>>          44 |      5,470        0.18       99.94
>>          45 |      1,648        0.06      100.00
>> ------------+-----------------------------------
>>       Total |  2,981,693      100.00
>>
>>
>> Mean birth weight of my study sample:
>>
>> . sum bw
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>           bw |   2980093    3502.431    575.7603        300       6780
>>
>> sum bw if gestwk==26
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>           bw |      1610    902.7248    189.5523        350       1970
>>
>> . sum bw if gestwk==26
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>           bw |      1610    902.7248    189.5523        350       1970
>>
>>
>> Below, if I would like to look at the mean birth weight for a particular gestational week:
>>
>> . sum bw if gestwk==27
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>           bw |      2024    1014.961     201.809        380       1920
>>
>> . sum bw if gestwk==28
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>           bw |      2613    1138.658     238.724        370       2000
>>
>> . sum bw if gestwk==29
>>
>>     Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>           bw |      3316    1295.815    278.1803        370       2480
>>
>>
>> What I would like to do is to create a single continuous variable that would give me the mean birth weight for each gestational week so that I don't have to look at it individually as above. I would like to ideally be able to use this variable in scatter plots.
>>
>> If I plot as follows:
>>
>> scatter twoway bw gestwk
>>
>> I of course don't get a single estimate for each gestational week, but instaed the entire range of birth weight for a particular week is plotted.
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index