Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: correct confidence intervals of -mean- ?

 From jpitblado@stata.com (Jeff Pitblado, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: correct confidence intervals of -mean- ? Date Mon, 08 Mar 2010 11:10:40 -0600

Dirk Enzmann <dirk.enzmann@uni-hamburg.de> is comparing the confidence
intervals from -mean, over()- with those produced by hand using -collapse-:

> Very carefully I want to ask: Are the confidence intervals given by
> -mean- really correct?
>
> Below I compare the results of -mean- with the results of a different
> procedure:
>
> * --------------------------------------
> sysuse auto, clear
> mean price, over(rep78)
>
> local df = e(df_r)
> display "degrees of freedom = n-1 = df'"
> * Note that here df is the number of the total sample - 1!
>
> * alternative route:
> collapse (mean) mprice=price (sd) sdprice=price (count) nprice=price if
> (rep78 < .), by(rep78)
> * calculate the confidence intervals the -mean- way:
> gen ci95la = mprice - invttail(df',.025)*sdprice/sqrt(nprice)
> gen ci95ua = mprice + invttail(df',.025)*sdprice/sqrt(nprice)
> * calculate the confidence intervals a different way:
> gen ci95lb = mprice - invttail(nprice-1,.025)*sdprice/sqrt(nprice)
> gen ci95ub = mprice + invttail(nprice-1,.025)*sdprice/sqrt(nprice)
> * compare both sets of confidence intervals (..a vs ..b):
> list
> * --------------------------------------
>
> My question: Which procedure is correct?

Tirthankar Chakravarty <tirthankar.chakravarty@gmail.com> replied:

> I assume the behaviour you are looking for is given by:
> mean price if rep78==1
> etc.
>
> The reason for the CIs being the way they are in the output of -mean-
> is because the degrees of freedom takes into account the calculation
> of a covariance matrix with observations in other subpopulations
> (albeit with covariance restricted to zero). The manual makes this
> clear for the estimation of means, but not for the covariance
> estimates. Just a guess though.

-mean, over()- computes multiple subpopulation means and estimates their
standard errors simultaneously, posting the means in -e(b)- and the
variance-covariance matrix of the estimated means in -e(V)-.

The degrees of freedom here are those associated with -e(V)-, which was
computed using the entire estimation sample.  With the auto dataset, this
number is 68 = 69-1, since there are 5 missing values in -rep78- 69 = 74-5.

Tirthankar point out that the covariance is restricted to zero.  Actually the
covariance are zero by construction in the above example.  Non-zero
covariances are only possible with clustered data for this situation, provided
the subpopulations are not nested within the clusters or vice versa.  With
clustered data, the degrees of freedom would become C-1, where C is the number
of clusters.

Kit Baum <baum@bc.edu> also replied:

> These are the same standard errors of mean reported by
>
>  tabstat price,by(rep78) stat(mean sd n semean)
>
> He wonders whether the DF used in calculating s.e.(mean) should be that of
> the full sample. I think that -mean- and -tabstat- are both using the
> notion that you have a model y = mu + \epsilon, where var(\epsilon} is a
> population parameter. Thus the variance of \epsilon is a constant for all
> subsamples, and when you calculate s.e. mean, you use the sqrt of that
> common variance and divide by the sqrt(sample size) of the subpopulation.
> You can see that is being done by -tabstat- by comparing the sd, n and
> semean columns.
>
> What does surprise me is that the CIs generated by these methods differ so
> widely from those computed by
>
> reg price i.rep78
> margins rep78
>
> The differences are not just a small-sample/large-sample adjustment of the
> Root MSE. If you take apart the VCE of a regression of price on all five
> dummies, no constant term, you find a diagonal matrix containing the
> inverses of the respective sample sizes, so the difference has to lie in the
> computation of \hat{sigma2} which multiplies inv(X'X).

Each subpopulation has its own variance, so multiple -regress- fits are
required.

The only way to get -regress- to reproduce the results from -mean, over()- is
to use -suest-, which requires us to use -[pw=1]- in order to trick -mean-
into producing the robust version of -e(V)-.

***** BEGIN:
. sysuse auto
(1978 Automobile Data)

. svyset _n

pweight: <none>
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>

. gen byte touse = !missing(rep78)

. forval i = 1/5 {
2.         quietly regress price if rep78 == i'
3.         est store ri'
4. }

. suest r1 r2 r3 r4 r5

Simultaneous results for r1, r2, r3, r4, r5

Number of obs   =         69

------------------------------------------------------------------------------
|               Robust
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
r1_mean      |
_cons |     4564.5   263.1901    17.34   0.000     4048.657    5080.343
-------------+----------------------------------------------------------------
r1_lnvar     |
_cons |   12.51745   .3561436    35.15   0.000     11.81942    13.21548
-------------+----------------------------------------------------------------
r2_mean      |
_cons |   5967.625   1192.433     5.00   0.000     3630.499    8304.751
-------------+----------------------------------------------------------------
r2_lnvar     |
_cons |   16.36588   .6505966    25.16   0.000     15.09073    17.64102
-------------+----------------------------------------------------------------
r3_mean      |
_cons |   6429.233   637.4178    10.09   0.000     5179.917    7678.549
-------------+----------------------------------------------------------------
r3_lnvar     |
_cons |   16.33535    .284271    57.46   0.000     15.77819    16.89251
-------------+----------------------------------------------------------------
r4_mean      |
_cons |     6071.5   394.4743    15.39   0.000     5298.345    6844.655
-------------+----------------------------------------------------------------
r4_lnvar     |
_cons |   14.88804   .2665945    55.85   0.000     14.36552    15.41055
-------------+----------------------------------------------------------------
r5_mean      |
_cons |       5913    757.488     7.81   0.000     4428.351    7397.649
-------------+----------------------------------------------------------------
r5_lnvar     |
_cons |   15.73862   .4663947    33.75   0.000     14.82451    16.65274
------------------------------------------------------------------------------

. est store suest

. mean price [pw=1], over(rep78)

Mean estimation                     Number of obs    =      69

1: rep78 = 1
2: rep78 = 2
3: rep78 = 3
4: rep78 = 4
5: rep78 = 5

--------------------------------------------------------------
Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
price        |
1 |     4564.5   263.1901      4039.312    5089.688
2 |   5967.625   1192.433      3588.161    8347.089
3 |   6429.233   637.4178      5157.286    7701.181
4 |     6071.5   394.4743      5284.339    6858.661
5 |       5913    757.488      4401.456    7424.544
--------------------------------------------------------------
***** END:

When -suest- combines the results from multiple model fits, it resets the
scores for out of sample observations to zero, effectively treating the
estimation sample from each model fit as its own subpopulation.  This is
exactly what is required in order to reproduce whan -mean, over()- is doing.

--Jeff
`