Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
jpitblado@stata.com (Jeff Pitblado, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: correct confidence intervals of -mean- ? |

Date |
Mon, 08 Mar 2010 11:10:40 -0600 |

Dirk Enzmann <dirk.enzmann@uni-hamburg.de> is comparing the confidence intervals from -mean, over()- with those produced by hand using -collapse-: > Very carefully I want to ask: Are the confidence intervals given by > -mean- really correct? > > Below I compare the results of -mean- with the results of a different > procedure: > > * -------------------------------------- > sysuse auto, clear > mean price, over(rep78) > > local df = e(df_r) > display "degrees of freedom = n-1 = `df'" > * Note that here df is the number of the total sample - 1! > > * alternative route: > collapse (mean) mprice=price (sd) sdprice=price (count) nprice=price if > (rep78 < .), by(rep78) > * calculate the confidence intervals the -mean- way: > gen ci95la = mprice - invttail(`df',.025)*sdprice/sqrt(nprice) > gen ci95ua = mprice + invttail(`df',.025)*sdprice/sqrt(nprice) > * calculate the confidence intervals a different way: > gen ci95lb = mprice - invttail(nprice-1,.025)*sdprice/sqrt(nprice) > gen ci95ub = mprice + invttail(nprice-1,.025)*sdprice/sqrt(nprice) > * compare both sets of confidence intervals (..a vs ..b): > list > * -------------------------------------- > > My question: Which procedure is correct? Tirthankar Chakravarty <tirthankar.chakravarty@gmail.com> replied: > I assume the behaviour you are looking for is given by: > mean price if rep78==1 > etc. > > The reason for the CIs being the way they are in the output of -mean- > is because the degrees of freedom takes into account the calculation > of a covariance matrix with observations in other subpopulations > (albeit with covariance restricted to zero). The manual makes this > clear for the estimation of means, but not for the covariance > estimates. Just a guess though. -mean, over()- computes multiple subpopulation means and estimates their standard errors simultaneously, posting the means in -e(b)- and the variance-covariance matrix of the estimated means in -e(V)-. The degrees of freedom here are those associated with -e(V)-, which was computed using the entire estimation sample. With the auto dataset, this number is 68 = 69-1, since there are 5 missing values in -rep78- 69 = 74-5. Tirthankar point out that the covariance is restricted to zero. Actually the covariance are zero by construction in the above example. Non-zero covariances are only possible with clustered data for this situation, provided the subpopulations are not nested within the clusters or vice versa. With clustered data, the degrees of freedom would become C-1, where C is the number of clusters. Kit Baum <baum@bc.edu> also replied: > These are the same standard errors of mean reported by > > tabstat price,by(rep78) stat(mean sd n semean) > > He wonders whether the DF used in calculating s.e.(mean) should be that of > the full sample. I think that -mean- and -tabstat- are both using the > notion that you have a model y = mu + \epsilon, where var(\epsilon} is a > population parameter. Thus the variance of \epsilon is a constant for all > subsamples, and when you calculate s.e. mean, you use the sqrt of that > common variance and divide by the sqrt(sample size) of the subpopulation. > You can see that is being done by -tabstat- by comparing the sd, n and > semean columns. > > What does surprise me is that the CIs generated by these methods differ so > widely from those computed by > > reg price i.rep78 > margins rep78 > > The differences are not just a small-sample/large-sample adjustment of the > Root MSE. If you take apart the VCE of a regression of price on all five > dummies, no constant term, you find a diagonal matrix containing the > inverses of the respective sample sizes, so the difference has to lie in the > computation of \hat{sigma2} which multiplies inv(X'X). Each subpopulation has its own variance, so multiple -regress- fits are required. The only way to get -regress- to reproduce the results from -mean, over()- is to use -suest-, which requires us to use -[pw=1]- in order to trick -mean- into producing the robust version of -e(V)-. ***** BEGIN: . sysuse auto (1978 Automobile Data) . svyset _n pweight: <none> VCE: linearized Single unit: missing Strata 1: <one> SU 1: <observations> FPC 1: <zero> . gen byte touse = !missing(rep78) . forval i = 1/5 { 2. quietly regress price if rep78 == `i' 3. est store r`i' 4. } . suest r1 r2 r3 r4 r5 Simultaneous results for r1, r2, r3, r4, r5 Number of obs = 69 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- r1_mean | _cons | 4564.5 263.1901 17.34 0.000 4048.657 5080.343 -------------+---------------------------------------------------------------- r1_lnvar | _cons | 12.51745 .3561436 35.15 0.000 11.81942 13.21548 -------------+---------------------------------------------------------------- r2_mean | _cons | 5967.625 1192.433 5.00 0.000 3630.499 8304.751 -------------+---------------------------------------------------------------- r2_lnvar | _cons | 16.36588 .6505966 25.16 0.000 15.09073 17.64102 -------------+---------------------------------------------------------------- r3_mean | _cons | 6429.233 637.4178 10.09 0.000 5179.917 7678.549 -------------+---------------------------------------------------------------- r3_lnvar | _cons | 16.33535 .284271 57.46 0.000 15.77819 16.89251 -------------+---------------------------------------------------------------- r4_mean | _cons | 6071.5 394.4743 15.39 0.000 5298.345 6844.655 -------------+---------------------------------------------------------------- r4_lnvar | _cons | 14.88804 .2665945 55.85 0.000 14.36552 15.41055 -------------+---------------------------------------------------------------- r5_mean | _cons | 5913 757.488 7.81 0.000 4428.351 7397.649 -------------+---------------------------------------------------------------- r5_lnvar | _cons | 15.73862 .4663947 33.75 0.000 14.82451 16.65274 ------------------------------------------------------------------------------ . est store suest . mean price [pw=1], over(rep78) Mean estimation Number of obs = 69 1: rep78 = 1 2: rep78 = 2 3: rep78 = 3 4: rep78 = 4 5: rep78 = 5 -------------------------------------------------------------- Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ price | 1 | 4564.5 263.1901 4039.312 5089.688 2 | 5967.625 1192.433 3588.161 8347.089 3 | 6429.233 637.4178 5157.286 7701.181 4 | 6071.5 394.4743 5284.339 6858.661 5 | 5913 757.488 4401.456 7424.544 -------------------------------------------------------------- ***** END: When -suest- combines the results from multiple model fits, it resets the scores for out of sample observations to zero, effectively treating the estimation sample from each model fit as its own subpopulation. This is exactly what is required in order to reproduce whan -mean, over()- is doing. --Jeff jpitblado@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: RE: RE: calculating standard deviation of age at death from period life tables** - Next by Date:
**st: missing standard errors in ML estimation** - Previous by thread:
**st: Re: correct confidence intervals of -mean- ?** - Next by thread:
**st: re: correct confidence intervals of -mean- ?** - Index(es):