[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Variance estimation with clusters

From   Steven Joel Hirsch Samuels <>
Subject   Re: st: Variance estimation with clusters
Date   Thu, 8 Nov 2007 10:43:06 -0500



I would would only add to Austin's good advice:

1. If you are doing regressions and hypothesis tests, do not use the fpc terms. Imagined you had studied 100% of establishments and workers in a population; with the fpc's, all standard errors would be zero.

2. Stata's panel data and multi-level model -xt- commands will not respond to -svyset-. For panel data analysis, the options accommodating the survey design vary by command.

3. You should probably use the survey weights from year 1; but the study documentation may have other advice. Obviously these weights will not sum to the population size in either year 1 or year 2. If the survey deliberately over-sampled a class of workers which is the subject of your analysis (e.g. you wish to compare a minority to a majority group, and the survey over-sampled the minority group), you should probably ignore the survey weights altogether.


On Nov 8, 2007, at 10:16 AM, Austin Nichols wrote:

Maury Gittleman <>:
Just clustering on establishment is probably sufficient.

You can also specify two levels of clustering with -svyset- e.g.

webuse stage5a
svyset su1 [pweight=pw], fpc(fpc1) || su2

where su1 is your establishment id, fpc1 the number of distinct
employees in both years, and su2 is a person id.

Usually the second level of clustering is largely irrelevant. But not always...

svyset su1 [pweight=pw], fpc(fpc1) strat(strat)
svy: reg yreg x?
est sto c1lev
svyset su1 [pw=pw], fpc(fpc1) str(str) || su2, fpc(fpc2)
svy: reg yreg x?
est sto c2lev
esttab *, mti

On 11/8/07, Gittleman, Maury - BLS <> wrote:


I'm have a question concerning stata's approach to estimating standard
errors in the presence of clustered survey data. The survey I'm using
collects information on individual wages, by first selecting
establishments at random, and then collecting information on multiple
workers within each establishment. So, it is clear that, when I'm
running regressions, I need to cluster on establishment.

My question arises when I use two years of data from the same survey.
For about 4/5 of the individuals, there will be data for two years, and
I would expect that the correlation between the errors for any given
individual will be higher than the correlation between the errors for
two different individuals at the same establishment. My thinking is
that I still want to define clusters by establishments, as the variance
estimation is said to be robust to any arbitrary intra-cluster

Is this the right way to go or is there an alternative approach that
might be superior?

Thanks very much.

*   For searches and help try:
Steven  Samuels
18 Cantine's Island
Saugerties, NY 12477
Phone: 845-246-0774
EFax: 208-498-7441

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index