Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: How to svyset when strata are used in some groups and not others

 From Steve Samuels To statalist@hsphsun2.harvard.edu Subject Re: st: How to svyset when strata are used in some groups and not others Date Mon, 5 Jul 2010 12:29:22 -0400

```Louise-

I agree with Stas. (I say that often!)

As you sampled in a possibly different time interval for each
hospital, you do not have a well-defined population (such as: all
patients admitted between x and y). Therefore you do not have a true
"probability sample", in the sense that every admission has a known
probability of selection. Unfortunately this criticism applies to the
design for the sampled strata as well. A better design would have been
to define the same time interval for all units. In the US at least,
admission patterns differ by day of week, so we would design the
sample or post-stratify the results to reflect this.

If you sampled patients, rather than admissions, you also have a
potential problem of "length biased sampling", in which the
probability of selection increases with length of stay.

To solve your original problem. The original -svyset- applied to
estimates for the entire "population". For inference in certainty
PSUs, one would ordinarily create a separate -svyset- command confined
to these , with PSUs being now the units sampled at the second stage.

A final remark: A hypothesis test of the kind you showed would never
true in a finite population: why should subgroup means of any variable
be _exactly_ the same? A better approach is form confidence intervals
for the differences.

Steve

On Mon, Jul 5, 2010 at 11:54 AM, Stas Kolenikov <skolenik@gmail.com> wrote:
> The biggest problem you have is that of the units and populations of
> sampling and analysis.
>
> What you told Stata was: "I have a population of patients admitted to
> these hospitals in the given period of time".
> You really did sample close to 100% of the population, which is the
> patients in the hospitals in the first quarter of 2010, say. When you
> "The average age in the [most common type of] hospitals is 45.3 with
> the standard error due to sampling equal to 1.7, while the average age
> in the [other three types of hospitals] is 48.7, 51.3 and 50.6, with
> the standard error due to sampling equal to zero." When you talk about
> finite populations, it is kinda naive to expect that their means will
> be exactly the same, so H_0: means are the same is not a very sensible
> one.
>
> But what you, most likely, want to do is to generalize your findings
> somehow to all potential patients in these hospitals, or to a longer
> period of time. You can do this it two ways.
>
> First, you can try to stay within the formal sampling paradigm, and
> say, "I sampled d days out of one year" (which is what you indicate in
> the end of your email). This will be the second stage of your sampling
> procedure, with the first stage being sampling the hospitals (and
> obtaining fpc = 0 in that stage). If you list all the patients in the
> hospital on these days, without sampling them within the hospital
> but did not describe if that was random sampling, or what), then your
> weight should be 365.2422/d. FPCs are only applicable to simple random
> sampling, where they reflect how joint probabilities of selection
> necessary to compute unbiased variance estimates simplify. Your
> design, however, is rather of systematic sampling (which is a special
> case of cluster sampling): you have d consecutive days, which can
> hardly be thought of as a simple random sample of days in the year.
> estimable, since, from survey statistics perspective, you have a
> sample of size 1 cluster (i.e., only 1 d.f.).
>
> This leaves you, I believe, with one option remaining, and that is to
> talk about a hypothetical super-population of all patients who could
> have been admitted to these hospitals in an unspecified period of
> time. You have not observed that sort of population, so you have to
> make a leap of faith and say, "I believe that these d days are
> representative of what's going on in these hospitals, in general".
> That's a more natural mode to be in when you talk about say logistic
> regression, and that's the reason Steve S asked about what you want to
> do with the data. So... you are now saying that there is a data
> generating process that (i) created patients, (ii) allocated them to
> hospitals. I would still tend to think that you want to treat
> hospitals as fixed, though. Then when you conduct your inference with
> respect to that process (rather than to a specific finite population),
> you have to say that your population is potentially infinite, and you
> took a sample from this population. This makes fpc's equal to 1: fpc =
> 1 - n/N = 1 - n/infinity = 1. (Why did you have the square root in
> your fpc formula above?) The weights still need to be modified, as is
> it still more likely to find a patient in the hospital if you've
> observed this hospital for a longer period of time. So 365/d is still
> a good weight to use... or 1000/d can be used if your projected period
> to which you want to generalize is 1000 days. It does not matter in
> the end when you analyze the means; the scale of weights is only
> important when you estimate the total (i.e., the total number of
> patients admitted to each type of the hospitals, or the total costs of
> care), in which case your weights should be linked to the specific
> period of time over which you calculate your admissions or costs.
>
> On Mon, Jul 5, 2010 at 10:17 AM, Louise Linsell
> <Louise.Linsell@npeu.ox.ac.uk> wrote:
>> This is the complete design for the partially stratified dataset:
>>
>> There are 4 types of hospital and (for example) we are testing the hypothesis that mean age is equal across hospital type.
>>
>> For the first type of hospital, we divided the 180 national units into 6 strata (North/South x large/medium/small size) and selected 37 units (with the probability of selection proportional to size within strata).
>>
>> For the other 3 types of hospital we selected all national units.
>>
>> We then sampled patients for d consecutive days, where d varied by unit.
>>
>>
>> The commands we have used so far are:
>>
>> svyset hospid [pweight=weight], strata(strata) fpc(strata_fract)
>> svy: mean age, over(hosptype)
>>
>> Where:
>>
>> hospid = hospital identifier 1...435
>> weight = probability sampling weight (number of days recruited in unit/number of days recruited in units of same hospital type)
>> strata = strata number 1...9 (1-6 for strata within 1st hospital type,7 for 2nd hospital type, 8 for 3rd hospital type and 9 for 4th hospital type)
>> strata_frac = n/N - number of units selected in stratum/total number of units in stratum (=1 for last 3 types of hospital)
>> age = patient age in years
>> hosptype = type of hospital 1...4
>>
>> When this model is fitted we get zero estimates for the standard errors in the last 3 types of hospital.
>> I think this is because strata_frac=1 for these hospitals, so the model thinks we have sampled the whole population,
>> when in fact we have just sampled a number of consecutive days. I was thinking about specifying a second level of
>> sampling - number of days sampled out of one whole year and setting fpc's for the secondary sampling units (days).
>>
>> LL
>>
>>
>>
>>
>>
>>>>> Steve Samuels <sjsamuels@gmail.com> 05/07/2010 12:42 >>>
>>
>> 1. the complete design, including subsequent stages of sampling
>> 2.  the purposes of the analyses--descriptive?  estimating regression
>> coefficients?  testing hypotheses?
>>
>> What -svyset- commands have you tried to issue so far?
>>
>> Steve
>>
>>
>>
>> On Mon, Jul 5, 2010 at 5:06 AM, Louise Linsell
>> <Louise.Linsell@npeu.ox.ac.uk> wrote:
>>> Thank you for suggestions. We have already tried defining 9 strata; 6 for the common type of hospital, for which we used stratified random sampling with 6 strata,  and 1 stratum each for the other 3 types of hospital, for which we took all units.
>>>
>>> However, in the model we had to specify a finite population correction (FPC=sqrt(1-n/N)) as we sampled 28 out of 87 units for the most common type of hospital.
>>>
>>> Because we sampled ALL the units from the other 3 types of hospital we had to set the FPC to zero since n=N (which is specified as 1 in Stata as it requires you to specify n/N). This means that there are no variance estimates when we summarise any outcomes in the 3 less common types of hospital, because it thinks we have sampled the whole population within these hospitals (when in fact we took a consecutive number of patients over a period of 3 months).
>>>
>>> LL
>>>
>>>>>> Stas Kolenikov <skolenik@gmail.com> 02/07/2010 20:36 >>>
>>> If Louise sampled other 3 types lumping them together, then Steve's
>>> recommendation is appropriate. If sampling was performed within each
>>> of those remaining types, then the strata variable will have 6 (strata
>>> in the most common type of hospitals) + 3 (other types of hospitals) =
>>> 9 levels.
>>>
>>> On Fri, Jul 2, 2010 at 11:18 AM, Steve Samuels <sjsamuels@gmail.com> wrote:
>>>> Louise-- create a stratum variable with 7 values: 1-6 for the
>>>> hospitals of the first type, and 7 for the other three types, and use
>>>> that in the strata() option of -svyset-
>>>>
>>>> Steve
>>>>
>>>> On Fri, Jul 2, 2010 at 12:00 PM, Louise Linsell
>>>> <Louise.Linsell@npeu.ox.ac.uk> wrote:
>>>>> I have a dataset with 4 different types of hospital, and would like to compare binary outcomes between them using logistic regression.  However for the first type  (the most common), hospitals were divided into 6 strata (based on size and SES) and a random sample was taken from each strata.  For the other 3 types of hospital we sampled all hospitals. My question is, how to use the svyset command when a different sampling strategy was used in one group?
>>>>>
>>>>> LL
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/statalist/faq
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Steven Samuels
>>>> sjsamuels@gmail.com
>>>> 18 Cantine's Island
>>>> Saugerties NY 12477
>>>> USA
>>>> Voice: 845-246-0774
>>>> Fax:    206-202-4783
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>>
>>>
>>> --
>>> Stas Kolenikov, also found at http://stas.kolenikov.name
>>> Small print: I use this email account for mailing lists only.
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>>
>>
>> --
>> Steven Samuels
>> sjsamuels@gmail.com
>> 18 Cantine's Island
>> Saugerties NY 12477
>> USA
>> Voice: 845-246-0774
>> Fax:    206-202-4783
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
>
> --
> Stas Kolenikov, also found at http://stas.kolenikov.name
> Small print: I use this email account for mailing lists only.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--
Steven Samuels
sjsamuels@gmail.com
18 Cantine's Island
Saugerties NY 12477
USA
Voice: 845-246-0774
Fax: 206-202-4783

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```