[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Definition of strata and PSUs when svysetting

From   "Angel Rodriguez Laso" <>
To   <>
Subject   RE: st: Definition of strata and PSUs when svysetting
Date   Mon, 31 Mar 2008 12:27:07 +0200

Thank you, Steven, for your interest.

Answering to your questions, I didn’t go into more details on the sampling
procedure because I didn’t think they were needed for the definition of
strata and PSUs. There was intermediate sampling of dwellings. There was a
list of all dwellings in census tracts and from this list 12 dwellings in
each selected census tract were chosen at random. From each dwelling one
person was taken at random (and his/her weight calculated from the number of
people living in the dwelling). People were interviewed until a sample of 7
bellow 65 and 3 over 65 was obtained in each census tract. The reason why 12
dwellings were selected initially is that it was expected that taking only
10 would not yield the final 7/3 proportion desired. Nevertheless, not in
all census tracts 7 and 3 individuals could be selected and that's the
reason (more than the existence of missing items) why there are census
tracts with only one individual over 65.

I'm trying to check if following your advice (merging strata in single PSU
per stratum census tracts) or just dropping the second stage specification,
would give very different results, but when I run a svy: prop under the
first specification:

svyset censustract [pweight=pondef], strata(area) fpc(#censustractsinarea)||
identificationvariable, strata(agegroupscorrected)

I get the message: 'Missing standard error due to stratum with single
sampling unit; see help svydes.', but when I 

svydes variable, single stage(2)

no single PSUs are displayed. Do you know why?

Ángel Rodríguez Laso
Institute of Public Health of the Region of Madrid 

-----Mensaje original-----
[] En nombre de Steven Samuels
Enviado el: viernes, 28 de marzo de 2008 22:25
Asunto: Re: st: Definition of strata and PSUs when svysetting

I'm sorry that I missed your initial post; I was on vacation and  
canceled my Statalist subscription.  I agree with Stas's suggestion  
for the first specification.

I have some questions

1. Your description implies that you created a list of ALL people in  
each selected tract,  stratified by age. Then selected by simple  
random sampling: 7 from the below 65 list; 3 from the over 65 list.   
Is that a correct description?  Or, was there intermediate sampling  
of dwellings?

2. Your PSU's are census tracts, not people. ("Primary" refers only  
to the first stage.) You are saying that in some of the census  
tracts, you had only one person either under or 'over' 65. Is that  

  For those tracts, I suggest that you go with option 1, but ignore  
the stratification, but keep the sampling probabilities. That is,  
create a single stratum for those tracts by recoding.

You may still analyze your outcomes by age.  The analysis age groups  
need not match the stratum age-groups.


On Mar 28, 2008, at 10:40 AM, Angel Rodriguez Laso wrote:

> Thank you for your answer, Stas.
> I´ve tried both specifications and the first surprise was that Stata 9
> ignores further stages when stage 1 is sampled with replacement. It  
> was good
> to come across this warning because in our survey sampling was without
> replacement and the sampling fraction of the census tracts was  
> quite high
> (more than one third in some strata) what precludes assuming that  
> selection
> was with replacement.
> The problem with using age groups as second stage strata is that  
> being 3 the
> number of people over 65 selected per census tract, whenever there are
> missing values in the variables some strata become single-PSU (person)
> strata, what prevents Stata from calculating standard errors. So,  
> the two
> specifications I´ve tried are:
> svyset censustract [pweight=pondef], strata(area) fpc 
> (#censustractsinarea)
> svyset censustract [pweight=pondef], strata(area-by-age) fpc 
> (#censustractsin
> area)
> Not surprisingly standard errors with both specifications differ  
> only in
> some hundreths. I believe this is mainly due to the fact that in  
> both cases
> degrees of freedom are very large. This is something I want to  
> check with
> you: From the reading of Korn and Graubard "Analysis of health  
> surveys" I´ve
> understood that in complex surveys degrees of freedom are  
> calculated as
> #PSUs - #strata (624 for the first specification and 1244 for the  
> second,
> because Stata duplicates the number of census tracts because each  
> of them
> belongs to two different strata). I do not follow you very well  
> when you
> recommend doing a small simulation with census or simulated data to
> ascertain degrees of freedom or when you state that Taylor series  
> expansion
> standard errors might be badly off with small samples. It´s usual  
> practice
> to work with such low numbers of individuals per PSU (10 in my  
> case) and
> I´ve never heard that there was a problem of a small sample size then.
> Unfortunately, I don´t have enough knowledge to go for option 3.
> To conclude, although both specifications yield similar results, I  
> agree
> with you that the second one implies linked selection of PSUs while  
> the
> first one is conceptually sounder.
> Ángel Rodríguez Laso
> Institute of Public Health of the Region of Madrid
> -----Mensaje original-----
> De:
> [] En nombre de Stas  
> Kolenikov
> Enviado el: jueves, 27 de marzo de 2008 20:06
> Para:
> Asunto: Re: st: Definition of strata and PSUs when svysetting
> I would say your first specificaiton makes better sense, even though
> the design it produces is quite weird, and the degrees of freedom in
> that design are strange (and 7 initial strata won't get you very far,
> anyway). In Stata 10, that's doable with
> svyset tract, strata(area) || person, strata(age_group)
> if I am getting your design right.
> In the second specification with region by age strata, you have some
> sort of coupled sampling when selecting a PSU in one stratum implies
> selecting a certain PSU in the another stratum linked by geography.
> You could still analyze that, but you would need to get accurate
> pairwise probabilities of selection to compute Horwitz-Thompson
> estimator, and Grundy-Yates-Sen estimator of its variance (which I
> don't think is implemented anywhere commercially as those higher order
> probabilities of selection are rarely known; Jeff P, that might
> produce a cutting edge addition to Stata's set of -svy- tools,
> although I've no idea how to input and parse those :)). Any reasonably
> high level book would have it (Kish, Cochran, Mary Thompson's books
> spring to mind). For special cases, I think that can be programmed in
> Mata. Let's call that option 3. Note that the naive implementation as
> svyset tract, strata(area X age) || person
> produces wrong probabilities of selection, and the variances are
> likely to be understated, as there is more variability in this
> specification than in your actual design.
> If I were in your shoes, I would try both specifications you described
> and see whether they are producing comparable substantive results.
> Keep in mind that either way you are getting asymptotic Taylor series
> expansion standard errors, and they might be badly
> off with small samples like those you have. And I think you need to
> worry about your degrees of freedom, not your number of PSUs; I would
> do a small simulation to determine the approximate d.f.s for your main
> variables -- from census data if you have it, or from simulated data
> resembling the actual population. If I had infinite time to work on
> that project (meaning, a week or two of devoted programming), I would
> implement option 3 as the most proper.
> On 3/25/08, Angel Rodriguez Laso <>  
> wrote:
>> Greetings to all members of the list,
>>  I have the following questions on svysetting for an analysis of a  
>> complex
>>  survey:
>>  We have carried out a regional health population survey. We defined
> strata
>>  initially as geographic areas in the region (n=7) and allocated  
>> to each
> of
>>  them a sample proportional to their population. But because we  
>> wanted to
>>  over-represent the elderly, we set that the number of people over 65
> years
>>  sampled in all areas had to reach a minimum number. We didn't  
>> change the
>>  sample size of people bellow 65 obtained through the proportional
>>  allocation. Therefore the sampling fractions (and consequently the
> weights)
>>  are different for each area by age group (bellow/over 65) category.
>>  Then we selected census tracts in each geographic area with  
>> probabilities
>>  proportional to their total population, and randomly sampled 10
> individuals
>>  in those selected, always keeping the proportion 7 bellow 65  
>> years/3 over
> 65
>>  years, which was the regional overall age distribution after the
>>  oversampling explained above. My first question is if strata  
>> should be
>>  defined as geographic regions alone or as geographic area by age  
>> groups
>>  (bellow/ over 65 years) (n=14) when svysetting. The first  
>> possibility
> looks
>>  more reasonable, because census tracts were selected within  
>> geographic
>>  areas, not within geographic-age groups areas. If this is  
>> correct, then
>>  probably the way to svyset would be declaring geographic areas as  
>> first
>>  stage strata, census tracts as first stage PSUs and age groups as  
>> second
>>  stage strata.
>>  Alternatively, if the answer is that strata should be defined as  
>> region
> by
>>  two age-groups categories, then the same census tract can belong  
>> to two
>>  different strata (for example area A bellow 65/ area A over 65)  
>> depending
> on
>>  the age of the individual considered. If I svyset: strata (region  
>> by age
>>  group categories) and PSU= census tracts, STATA interprets that  
>> there are
>>  twice the number of PSUs than real census tracts are. Is that  
>> correct?
>>  Many thanks.
>>  Ángel Rodríguez Laso
>>  Institute of Public Health of the Region of Madrid
> -- 
> Stas Kolenikov, also found at
> Small print: Please do not reply to my Gmail address as I don't check
> it regularly.
> *
> *   For searches and help try:
> *
> *
> *
> _____________________________________________________________________
> Mensaje analizado y protegido por Telefonica Empresas
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

Mensaje analizado y protegido por Telefonica Empresas

*   For searches and help try:

© Copyright 1996–2019 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index