The distribution of the standard errors will depend on both the
distribution of the error terms and the distribution of the
explanatory variables (design measure, to wit). But in terms of
working with just the first two moments (means and variances), nothing
says error must be Gaussian, and the explanatory variables have to be
uniform, to ensure that the estimates are unbiased, and that s^2
(X'X)^{-1} is an unbiased estimator of variance. In your simulation
example, if you looked at two-sided coverage (and a sample size of
100), you will probably see that rejections outside the nominal 90% CI
will be 3% on one side and 12% on the other.
The distribution of the residuals is closer to normality than that of
the errors. In each residual, all other errors are added up (through e
= (I-H)errors formula), although with unequal weights. For points of
low leverage, when no such weight dominates too much, some sort of the
CLT argument will show that the residuals will be approximately
normal. So to see notable non-normality in residuals, you need to make
quite big departures from normality in errors, and/or points of high
leverage (that would most likely produce small residuals for the
leverage points themselves, but will also skew the distribution of all
other terms a little bit).
On Tue, Aug 12, 2008 at 10:17 AM, Maarten buis <[email protected]> wrote:
> --- Gaul� Patrick <[email protected]> wrote:
>> >You should be careful however that
>> >the assumption behind -regress- is not that BMI is normally
>> >distributed, but that the residuals are normally distributed.
>>
>> My understanding is that the desirable properties of ordinary least
>> squares hold without the normality assumption. Moreover, the
>> assumption would be that the error term, not the residuals, is
>> normally distributed.
>
> -regress- will always give you the line/(hyper)plane that minimizes the
> sum of squared errors, regardless of the distrubtion of the error term.
> In that sense you are correct. I have always learned that the standard
> errors depend on the distribution of the error term. However, when I
> simulated this with a skewed error term (log-normal with mean zero),
> the p values seem ok: approximately uniformly distributed and
> approximately 500 rejections of the true null hypothesis out of 10,000
> draws. Regarding your second comment: The distribution of the residuals
> gives you an estimate of the distribution of the error term.
>
> -- Maarten
>
> *-------------------- begin simulation -------------------------
> capture program drop sim
> program sim, rclass
> drop _all
> set obs 1000
> gen x = invnorm(uniform())
> gen y = 1 + x + exp(invnormal(uniform())) - exp(.5)
> reg y x
> tempname t
> scalar `t' = (_b[x]-1)/_se[x]
> return scalar p = 2*ttail(`e(df_r)', abs(`t'))
> end
>
> simulate p=r(p), reps(10000) : sim
> hist p
> count if p < .05
> *----------------------- end simulation ------------------------
>
>
> -----------------------------------------
> Maarten L. Buis
> Department of Social Research Methodology
> Vrije Universiteit Amsterdam
> Boelelaan 1081
> 1081 HV Amsterdam
> The Netherlands
>
> visiting address:
> Buitenveldertselaan 3 (Metropolitan), room Z434
>
> +31 20 5986715
>
> http://home.fsw.vu.nl/m.buis/
> -----------------------------------------
>
> Send instant messages to your online friends http://uk.messenger.yahoo.com
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/