Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Constructing a variable from standard deviations


From   Stas Kolenikov <[email protected]>
To   [email protected]
Subject   Re: st: Constructing a variable from standard deviations
Date   Mon, 22 Nov 2010 09:11:56 -0600

On Mon, Nov 22, 2010 at 6:29 AM, Maarten buis <[email protected]> wrote:
> --- On Mon, 22/11/10, M.P.J. van Zaal wrote:
>> You state that the residual variance is assumed to be
>> constant. This is actually not the case. I have 106
>> different residual stand deviations. I achieved this by
>> using "predict "nameocc" if "dummyoccupation"==1, resid"
>> to predict the residuals. Now I have 106 different
>> residuals, and when i check tabstat their standard
>> deviations are quite different (varying from 0.18-0.8).
>
> If you use -regress- than you assume that the residual
> variance is constant. The fact that you find differences
> in the residual variance across groups just means that
> you estimated a misspecified model. Normally I would be
> pretty relaxed about this heteroskedasticity, but not so
> in your case, because now this residual variance is a
> key parameter of substantive interest. If you estimate the
> model I proposed you solve that problem.

I disagree. Mathijs can run any regression he likes, can't he? It is
just a matter of doing the inference right, if he needs to. If he
needed to do inference with this regression, then of course without
-robust- or -cluster(occupation)- option the results may be
meaningless. Maarten is right: the basic assumption of OLS is that
error variances are constant (and Mathijs cannot argue with that; he
can report the finding that in his actual data this assumption does
not hold, but this does not change the underlying assumption of the
model). But if all Mathijs needs out of this regression is a
reasonable line to take deviations from, then OLS is  pretty much as
good as a line by any other sophisticated method.

Maarten's solution will give asymptotically efficient estimates in
presence of heteroskedasticity, i.e., will be slightly more accurate
in large samples when heteroskedasticity is indeed present. I
personally don't believe you can gain much from modeling
heteroskedasticity unless the differences in variances are huge, like
a factor of 20 or so, although I cannot ground my belief in anything
outside the common statistical sense. In small samples, however,
excessive modeling of difficult-to-identify phenomena (like
heteroskedasticity here) usually leads to notable small sample biases,
so in the end the estimates from the solution that Maarten suggested
may not be of much greater accuracy unless the sample sizes are well
into thousands (Mathijs did not give his original sample size for us
to make a judgement).

So I would stick with a simple solution:

regress depvar whatever
predict res
egen sd_by_occup = sd( res ), by( occupation )

-- 
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index