[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: Re: st: Clustering of secondary units in sampling design |

Date |
Fri, 9 Oct 2009 09:36:25 -0500 |

On Fri, Oct 9, 2009 at 4:04 AM, <angelrlaso@gmail.com> wrote: > "total variance = V1[E2(statistic|stage1)] + E1[V2(statistic|stage1)]" > > I don't know what V1 V2 E1 E2 stand for. Oops, should've defined these better. E1/V1 are the expected value and variance with respect to the first stage sampling, and E2/V2 are the expected value and variance with respect to the second stage sampling, conditional on the PSUs selected in the first stage. What book on sampling do you have at hand? I will refer you to the specific formula. > "What you do in your > design work is to use the theoretical formula with the population > variances (known or accurately estimated from census or CPS data, for > instance). While they may not show up in the estimator of variance, > you will see the differences in different designs if you run > simulations." > > If one doesn't see the difference in the estimator of variance, where does > one see the difference when running simulations? > > I imagine the design with one census tract with 100 individuals will have > less variance than the one with 10 census tracts with 10 individuals each. > The first option should be corrected more strongly in the analysis because > it is farther from a simple random sampling. This is exactly what I would expect to see in the simulations. Simulations with finite populations are somewhat tricky. 1. Create a large finite population, roughly of the size and structure of the one you will be working with. Compute your parameters of interest (true-theta: means, ratios, correlations, regression parameters) -- those would be the targets of inference. 2. For r=1, ... R = #replications, repeat: 2a. Take a finite sample from your population according to your design(s). That would be some sort of -sample- command in Stata, although with several stages of selection, that's quite a bit more tricky. I had to use a sequence of -sample- and -merge- when producing my multiple stage samples. 2b. svyset your data according to the design (you may have to compute probabilities of selection; 1/Prob[selection] = pweight). 2c. Estimate your parameters of interest; for each parameter, store both its estimate theta-hat_r and its estimated standard error se-theta-hat_r. 3. Once you are done with your sampling, compute the summaries. 3a. Bias[ theta-hat ] = average theta-hat - true-theta Rationale: we know that even the means can be biased in designs with random number of observations. 3b. Var[ theta-hat ] = 1/R times sum over r ( theta-hat_r - average theta-hat )^2 3c. MSE[ theta-hat ] = 1/R times sum over r ( theta-hat_r - true-theta )^2 Rationale: we want to compare the designs in terms of efficiency. Those are Monte Carlo approximations to the true design variance or MSE. 3d. Bias of the variance estimator[ se-theta-hat ] = average (se-theta-hat_r^2) - Var[ theta-hat ] Rationale: we know that the variance estimators are almost always biased. Even the direction of bias is not always known. 3e. Stability of the variance estimator [se-theta-hat] = 1/R times sum over r ( se-theta-hat_r -sqrt{ MSE[ theta-hat ] } )^2 / MSE[ theta-hat ] Rationale: linearization and the jackknife estimators tend to be more stable than BRR or the bootstrap estimators. For the same amount of bias, less stable estimators usually produce confidence intervals that will lack coverage. 3f. Coverage = 1/R times sum over r of indicator { theta-hat_r + 1.96 se-theta-hat_r < true-theta} This is coverage of the upper limit; define the coverage at the lower limit and two-sided coverage accordingly. Indicator is the function that takes value of 1 if the event in { } is true and 0 otherwise; in Stata, you can simply write gen byte coverage_UL = ( theta_hat + 1.96*se_theta_hat > true_theta ) (provided the quantities on the RHS are defined, of course). Then sum coverage_UL will give you the percentage of misses by the upper limit. Of course 1.96 here stands for the 95% normal CI; if you prefer to use t-distribution, code something different. Rationale: in the end, we want to use the standard errors to decide if something is "significant" by conducting a formal test. We want to know whether what we think to be a 10% test (and 90% confidence interval) really produces 10% of the samples with type I error. In practice, one would often see coverages like 8% on the upper limit and 3% on the lower limit that roughly balance to the two-sided 10%. Those would be the (minimal set of) summaries I would keep track of in my simulations. Performance of your estimators (both the point estimates and standard errors) is a complicated function of both the design and the particular estimation procedure. For instance, if you are estimating a total of Y, you can have a direct estimator, the sum of [Y*sampling weight], or you can compute a ratio estimator if you have additional information and a reliable covariate (known total of X / sample total of X * sample total of Y). They will have different properties in terms of bias and variance, and different estimators may work better for some designs than for others. Likewise, to estimate variances, you can use linearization or the jackknife or the bootstrap, etc., depending on what your design allows you to do. Now, the differences in the designs that you asked about should be reflected in Var[ theta-hat ] and MSE[ theta-hat ]. As I said, you can have the same approximation used in the variance estimator for different designs, but it does not mean that the underlying variances being estimated are the same; they most likely aren't. I would too think that the design with SRS at SSU level would produce smaller variance than a design with an additional sampling stage. But what I think will happen is that the difference in those variances will be of the same order of magnitude as the amount of error you make by using the particular variance esimator: abs( MSE[ theta-hat; design 1] - MSE[ theta-hat; design 2] ) approximately equal or less than abs( Bias[ se-theta-hat ; design 1 ] ) approximately equal abs( Bias[ se-theta-hat; design 2] ) where I explicitly add dependence on the design to my simulation summaries. Hope this clarifies things a little bit. (Should I write a SJ article on this? :)) -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Clustering of secondary units in sampling design***From:*Stas Kolenikov <skolenik@gmail.com>

- Prev by Date:
**Re: st: AW: egen(mean or suchlike) for a string variable** - Next by Date:
**st: Issues of Normality for Multiple Imputation with ICE** - Previous by thread:
**Re: st: Clustering of secondary units in sampling design** - Next by thread:
**st: a person-period multi-episode event history file** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |