[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Steven Samuels <sjhsamuels@earthlink.net> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Definition of strata and PSUs when svysetting |

Date |
Wed, 2 Apr 2008 17:29:06 -0400 |

Angel, for sampling with replacement, the probability of selection is zi = Mi/M0, where Mi the measure of size for PSU i and M0 is the total of the M's over PSU's. . The hallmark of probabilities is that they add to 1 over the population, and this is true of the Zi's. You need to multiply zi by K= the number of PSU's (in a stratum) only in the formula for estimating sample totals. See WG Cochran, Sampling Techniques 3rd ED, Wiley Books, 1977, p. 252. For estimating means, proportions, correlations, regression coefficients, only relative weights are needed and K is not needed.

-Steven

On Apr 2, 2008, at 4:25 AM, Angel Rodriguez Laso wrote:

Steven,

In the formula you give for the current sample weight for an interviewed

person, shouldn't the number of PSUs chosen in the sample design be included

in the denominator?

I say so because the selection probability of a PSUi is:

#PSUs in the sample design x (#dwellings in PSUi)/ dwellings in all PSUs)

And being the weight the inverse of the selection probabilities, #PSUs would

go to the denominator.

PS The list of dwellings per census tract was very up-to-date and only very

minor changes in the actual measure of size were expected.

Ángel

-----Mensaje original-----

De: owner-statalist@hsphsun2.harvard.edu

[mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Steven Samuels

Enviado el: martes, 01 de abril de 2008 23:47

Para: statalist@hsphsun2.harvard.edu

Asunto: Re: st: Definition of strata and PSUs when svysetting

Angel,

As as only one person was taken per household. you were quite right

to exclude the dwelling stage in your -svyset- command.

I am not sure that your weights are correct. You state that your

weighting computation simplifies, because the number of dwelling

units in a census tract cancels out in numerator and denominator.

Yet rarely does the advance measure of size for a PSU match the

actual measure of size. (L Kish, Survey Sampling, Wiley Books, 1965,

p. 239)

Let Z be your advance count of the number of dwellings in all census

tracts. If you anticipated 200 dwellings in a sampled census tract,

you selected the tract with probability equal to 200/Z. Suppose when

you got to the census tract, you discovered the actual number of

dwellings was 210. Your target number of dwellings is 12. If you

maintain the intended probability of 12/200 (so that the 200 cancels

in the weight computation), the attained sample size will be random,

n= 12 or 13). (Kish, p. 239). If you select exactly 12 dwellings,

with probability 12/210, your current sampling weight for an

interviewed person (Z x (# hh members)/12 should be multiplied by

210/200.

This assumes that you obtained interviews in all 12 selected

dwellings. If you reached the quota of 7 younger and 3 older people

after interviewing in n = 10 or 11 dwellings, I suggest that you

change '12' in the weight computation to the value of n.

-Steven

On Apr 1, 2008, at 3:57 AM, Angel Rodriguez Laso wrote:

Steven,

1. Because only one person was interviewed in each dwelling, I

don't see the

need to include a third stage in the design (there is no clustering of

individuals by dwelling, only by census tract).

2. I agree with dropping the age stratum.

3. I appreciate your advice on oversampling of the elderly. When

listing and

selecting separately younger and elderly people in each dwelling, I

see the

need to include the dwelling variable, because then you can have two

participants living in the same dwelling.

4. and 5. Census tracts were randomly selected with probabilities

proportional to the number of dwellings in them:

(#PSUs x #dwellings in PSUi)/ dwellings in all PSUs.

As probability of selection of each dwelling is:

12/#dwellings in PSUi,

#dwellings in PSUi cancels out and the result of these two

components of the

weight is constant for all individuals in the stratum and can be

dropped.

The only weights used were then: a) #people in the dwelling; b)

post-stratification weights to make age proportions match those of the

census.

Many thanks for your help.

Ángel Rodríguez Laso

Institute of Public Health of the Region of Madrid

-----Mensaje original-----

De: owner-statalist@hsphsun2.harvard.edu

[mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Steven

Samuels

Enviado el: lunes, 31 de marzo de 2008 19:30

Para: statalist@hsphsun2.harvard.edu

Asunto: Re: st: Definition of strata and PSUs when svysetting

Angel

"Gender" in point 2 should have been "age"-fixed below. I apologize

for the confusion.

-Steven

On Mar 31, 2008, at 9:32 AM, Steven Samuels wrote:

--

Angel, you had a three-stage, not a two stage design

1. The proper -svyset- should include the stage of selecting

dwellings.

-svyset censustract [pweight=???], strata(area) || dwelling || _n

For the proper pweight, see point 4 below.

2. You did not really stratify on AGE, so drop all reference to an

AGE stratum.

3. Your design, selecting one person at random, and hoping to get

enough elderly people, is not one I recommend. There are standard

approaches for oversampling sub-populations in household surveys.

At the least, one can list older and younger people in each

dwelling and select separately from each list.

4. The design makes it very difficult to calculate the sampling

weights. You appear to be saying that you stopped interviewing

when you had enough elderly and younger people ( or when you ran

out of dwellings). This is a version of 'sequential

sampling' (Sharon Lohr, Sampling: Design and Analysis, Duxbury, p.

403)

Here are my best guesses at sample weights.

4a. person weight =

1/(prob sel tract) x (no. dwellings in tract)/(no. of dwellings

where you obtained interviews) x (no. of people in the person's

dwelling)

4b. If you listed the ages of all people in the 12 selected

dwellings, not just those where you did interviewed, you can do more:

weight for younger person =

1/(prob sel tract) x (no. dwellings in tract)/12 x (no. younger

people in the 12 sampled dwellings)/(no. of younger people

interviewed)

weight for older person =

1/(prob sel tract) x (no. dwellings in tract)/12 x (no. older

people in the 12 sampled dwellings)/(no. of older people interviewed)

4c. If you have ages of all people in the sampled dwellings,

substitute 'no. of dwellings where you obtained interviews' for

'12 sampled dwellings' in the formulas in 4b. These weights may

slightly over-estimate the proportion of elderly people.

5. If there are census figures available for your target

population, apply a post-stratification weighting to make the

ratio of 'elderly' and 'younger' people match that in the census.

See Lohr, Chapter 8.

-Steven

On Mar 31, 2008, at 6:27 AM, Angel Rodriguez Laso wrote:

Thank you, Steven, for your interest.

Answering to your questions, I didn’t go into more details on the

sampling

procedure because I didn’t think they were needed for the

definition of

strata and PSUs. There was intermediate sampling of dwellings.

There was a

list of all dwellings in census tracts and from this list 12

dwellings in

each selected census tract were chosen at random. From each

dwelling one

person was taken at random (and his/her weight calculated from the

number of

people living in the dwelling). People were interviewed until a

sample of 7

bellow 65 and 3 over 65 was obtained in each census tract. The

reason why 12

dwellings were selected initially is that it was expected that

taking only

10 would not yield the final 7/3 proportion desired. Nevertheless,

not in

all census tracts 7 and 3 individuals could be selected and that's

the

reason (more than the existence of missing items) why there are

census

tracts with only one individual over 65.

I'm trying to check if following your advice (merging strata in

single PSU

per stratum census tracts) or just dropping the second stage

specification,

would give very different results, but when I run a svy: prop

under the

first specification:

svyset censustract [pweight=pondef], strata(area) fpc

(#censustractsinarea)||

identificationvariable, strata(agegroupscorrected)

I get the message: 'Missing standard error due to stratum with

single

sampling unit; see help svydes.', but when I

svydes variable, single stage(2)

no single PSUs are displayed. Do you know why?

Ángel Rodríguez Laso

Institute of Public Health of the Region of Madrid

-----Mensaje original-----

De: owner-statalist@hsphsun2.harvard.edu

[mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Steven

Samuels

Enviado el: viernes, 28 de marzo de 2008 22:25

Para: statalist@hsphsun2.harvard.edu

Asunto: Re: st: Definition of strata and PSUs when svysetting

Angel-

I'm sorry that I missed your initial post; I was on vacation and

canceled my Statalist subscription. I agree with Stas's suggestion

for the first specification.

I have some questions

1. Your description implies that you created a list of ALL people in

each selected tract, stratified by age. Then selected by simple

random sampling: 7 from the below 65 list; 3 from the over 65 list.

Is that a correct description? Or, was there intermediate sampling

of dwellings?

2. Your PSU's are census tracts, not people. ("Primary" refers only

to the first stage.) You are saying that in some of the census

tracts, you had only one person either under or 'over' 65. Is that

correct?

For those tracts, I suggest that you go with option 1, but ignore

the stratification, but keep the sampling probabilities. That is,

create a single stratum for those tracts by recoding.

You may still analyze your outcomes by age. The analysis age groups

need not match the stratum age-groups.

-Steven

On Mar 28, 2008, at 10:40 AM, Angel Rodriguez Laso wrote:

Thank you for your answer, Stas.

I´ve tried both specifications and the first surprise was that

Stata 9

ignores further stages when stage 1 is sampled with replacement. It

was good

to come across this warning because in our survey sampling was

without

replacement and the sampling fraction of the census tracts was

quite high

(more than one third in some strata) what precludes assuming that

selection

was with replacement.

The problem with using age groups as second stage strata is that

being 3 the

number of people over 65 selected per census tract, whenever

there are

missing values in the variables some strata become single-PSU

(person)

strata, what prevents Stata from calculating standard errors. So,

the two

specifications I´ve tried are:

svyset censustract [pweight=pondef], strata(area) fpc

(#censustractsinarea)

svyset censustract [pweight=pondef], strata(area-by-age) fpc

(#censustractsin

area)

Not surprisingly standard errors with both specifications differ

only in

some hundreths. I believe this is mainly due to the fact that in

both cases

degrees of freedom are very large. This is something I want to

check with

you: From the reading of Korn and Graubard "Analysis of health

surveys" I´ve

understood that in complex surveys degrees of freedom are

calculated as

#PSUs - #strata (624 for the first specification and 1244 for the

second,

because Stata duplicates the number of census tracts because each

of them

belongs to two different strata). I do not follow you very well

when you

recommend doing a small simulation with census or simulated data to

ascertain degrees of freedom or when you state that Taylor series

expansion

standard errors might be badly off with small samples. It´s usual

practice

to work with such low numbers of individuals per PSU (10 in my

case) and

I´ve never heard that there was a problem of a small sample size

then.

Unfortunately, I don´t have enough knowledge to go for option 3.

To conclude, although both specifications yield similar results, I

agree

with you that the second one implies linked selection of PSUs while

the

first one is conceptually sounder.

Ángel Rodríguez Laso

Institute of Public Health of the Region of Madrid

-----Mensaje original-----

De: owner-statalist@hsphsun2.harvard.edu

[mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Stas

Kolenikov

Enviado el: jueves, 27 de marzo de 2008 20:06

Para: statalist@hsphsun2.harvard.edu

Asunto: Re: st: Definition of strata and PSUs when svysetting

I would say your first specificaiton makes better sense, even

though

the design it produces is quite weird, and the degrees of

freedom in

that design are strange (and 7 initial strata won't get you very

far,

anyway). In Stata 10, that's doable with

svyset tract, strata(area) || person, strata(age_group)

if I am getting your design right.

In the second specification with region by age strata, you have

some

sort of coupled sampling when selecting a PSU in one stratum

implies

selecting a certain PSU in the another stratum linked by geography.

You could still analyze that, but you would need to get accurate

pairwise probabilities of selection to compute Horwitz-Thompson

estimator, and Grundy-Yates-Sen estimator of its variance (which I

don't think is implemented anywhere commercially as those higher

order

probabilities of selection are rarely known; Jeff P, that might

produce a cutting edge addition to Stata's set of -svy- tools,

although I've no idea how to input and parse those :)). Any

reasonably

high level book would have it (Kish, Cochran, Mary Thompson's books

spring to mind). For special cases, I think that can be

programmed in

Mata. Let's call that option 3. Note that the naive

implementation as

svyset tract, strata(area X age) || person

produces wrong probabilities of selection, and the variances are

likely to be understated, as there is more variability in this

specification than in your actual design.

If I were in your shoes, I would try both specifications you

described

and see whether they are producing comparable substantive results.

Keep in mind that either way you are getting asymptotic Taylor

series

expansion standard errors, and they might be badly

off with small samples like those you have. And I think you need to

worry about your degrees of freedom, not your number of PSUs; I

would

do a small simulation to determine the approximate d.f.s for your

main

variables -- from census data if you have it, or from simulated

data

resembling the actual population. If I had infinite time to work on

that project (meaning, a week or two of devoted programming), I

would

implement option 3 as the most proper.

On 3/25/08, Angel Rodriguez Laso <angel.rodriguez@salud.madrid.org>

wrote:

Greetings to all members of the list,

I have the following questions on svysetting for an analysis of a

complex

survey:

We have carried out a regional health population survey. We

defined

stratainitially as geographic areas in the region (n=7) and allocated to eachofthem a sample proportional to their population. But because we wanted to over-represent the elderly, we set that the number of people over 65yearssampled in all areas had to reach a minimum number. We didn't

change the

sample size of people bellow 65 obtained through the proportional

allocation. Therefore the sampling fractions (and consequently

the

weights)are different for each area by age group (bellow/over 65) category. Then we selected census tracts in each geographic area with probabilities proportional to their total population, and randomly sampled 10individualsin those selected, always keeping the proportion 7 bellow 65 years/3 over65years, which was the regional overall age distribution after the oversampling explained above. My first question is if strata should be defined as geographic regions alone or as geographic area by age groups (bellow/ over 65 years) (n=14) when svysetting. The first possibilitylooksmore reasonable, because census tracts were selected within

geographic

areas, not within geographic-age groups areas. If this is

correct, then

probably the way to svyset would be declaring geographic areas as

first

stage strata, census tracts as first stage PSUs and age groups as

second

stage strata.

Alternatively, if the answer is that strata should be defined as

region

bytwo age-groups categories, then the same census tract can belong to two different strata (for example area A bellow 65/ area A over 65) dependingonthe age of the individual considered. If I svyset: strata (region

by age

group categories) and PSU= census tracts, STATA interprets that

there are

twice the number of PSUs than real census tracts are. Is that

correct?

Many thanks.

Ángel Rodríguez Laso

Institute of Public Health of the Region of Madrid

--

Stas Kolenikov, also found at http://stas.kolenikov.name

Small print: Please do not reply to my Gmail address as I don't

check

it regularly.

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

__________________________________________________________________ _

_

_

Mensaje analizado y protegido por Telefonica Empresas

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

___________________________________________________________________ _

_

Mensaje analizado y protegido por Telefonica Empresas

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

Steven Samuels 845-246-0774 18 Cantine's Island Saugerties, NY 12477 EFax: 208-498-7441 * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ _____________________________________________________________________ Mensaje analizado y protegido por Telefonica Empresas * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ _____________________________________________________________________ Mensaje analizado y protegido por Telefonica Empresas * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Definition of strata and PSUs when svysetting***From:*"Angel Rodriguez Laso" <angel.rodriguez@salud.madrid.org>

**References**:**RE: st: Definition of strata and PSUs when svysetting***From:*"Angel Rodriguez Laso" <angel.rodriguez@salud.madrid.org>

- Prev by Date:
**Re: st: RE: RE: searchable list of commands/options for various Stata versions** - Next by Date:
**Re: st: substringing long, varying length text variables into individual variables** - Previous by thread:
**RE: st: Definition of strata and PSUs when svysetting** - Next by thread:
**RE: st: Definition of strata and PSUs when svysetting** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |