[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Question about svyset command

From   "Michael I. Lichter" <>
Subject   Re: st: Question about svyset command
Date   Thu, 19 Feb 2009 12:38:19 -0500


You are dealing with two misconceptions. The first is one that Steven didn't mention, and the second is one that Steven mentioned but did not relate as directly to your situation as he could have.

1. A multi-stage sampling design is a design in which sampling takes place multiple times. E.g., if you sample 20% of the districts in a state, sample 10% of the schools in each district, that's a two-stage sample. Let's say we conduct interviews of all of the students in all of the classrooms within each selected school. Children are nested within classrooms within schools within districts--that's four levels of nesting. But the design is still two-stage, despite the four levels of nesting.

*You* have a *single-stage* stratified cluster design. The fact that cases are nested within the clusters you select is what makes it a cluster design. Your telling Stata that the cases were selected at the "second stage" with 100% probability is the same as telling Stata that there is no second stage. That's why the estimates look the same whether you tell Stata about the "second stage" or not.

2. Also, Steven is definitely correct if your goal is not to generalize to the 10,500 trials you mentioned, but rather to this type of trial in general, wherever and whenever it takes place. In fact, what he says is more obviously applicable to your study then it is to most sample surveys of people. If the 10,500 trials are taken to be representative of some larger population of trials, then you are dealing with superpopulation parameters, like he said. From my limited reading, however, I think that the consensus on this topic among statisticians is less widespread than he suggests, and the consensus about what to do about it is almost nonexistent (except for what he said about the FPC). Korn and Graubard (1999) say, for example, that aside from ignoring the FPC there is no agreement on what to do in order to reduce the bias in estimates of superpopulation parameters from complex sample designs (p. 228).

On the other hand, if what you really want to know about is those 10,500 trials because there is something special about this specific population of trials then Steven is wrong and you should use the FPC. With regard to the value of FPC1, it should either be the size of the stratum -- and you didn't say anything about what your strata were -- or the number of cases selected from the stratum divided by the size of the stratum. I suggest the former because you're less likely to make an error in its calculation.


Steven Samuels wrote:

1. The finite population corrections should affect only standard errors and confidence intervals, not estimates of means, proportions, or confidence intervals.

2. fpc's should be employed only for descriptive analyses (proportions, means). These analyses describe the specific finite population that you sampled: tort, contract, and real property trials in the 75 counties.

If the purpose of your model is analystic: to develop predictions, estimate odds ratios, compare proportions, or otherwise test hypotheses, you should *omit* the finite population corrections. The reasoning is interesting (Cochran, 1977, p.39): It is seldom of scientific interest to ask if a null hypothesis (e.g. that two proportions are equal) is exactly true in a finite population . Except by a very rare chance, a null hypothesis will never be true. You would discover this by enumerating the entire population. This leads to the adoption of a "superpopulation" viewpoint, which is taken by almost all statisticians these days. See also Deming(1966) pp 247-261 "Distinction between enumerative and analystic studies"; Korn and Graubard (1999), p. 227.

In other words, you should use one -svyset- for describing the target population and another for the logistic regression.

Two questions came to mind:
1. If a trial had >1 plaintiff or >1 defendant, would that not increase the probability of a post trial motion? How are you going to account for that? 2. For descriptive analyses, counties selected with certainty need special treatment. Look up the "singleunit" option for -svyset-.

Good luck!


Cochran, W. G. (1977). Sampling techniques (3ded.). New York: Wiley.
Deming, W. E. (1966). Some theory of sampling. New York: Dover Publications. Korn, E. L., & Graubard, B. I. (1999). Analysis of health surveys (Wiley series in probability and statistics). New York: Wiley.

On Feb 19, 2009, at 12:04 AM, wrote:

Iâm a beginner Stata user and have a question about the svyset command in Stata that I hope someone can help me with.

For some background, I'm engaged in a logistic regression model that examines the likelihood of either a plaintiff or defendant filing a post trial motion. The database I'm working with is the Civil Justice Survey of State Courts (CJSSC). The CJSSC provides case level data for all t conclude in a sample of 46 of the nation's 75 most populous counties in 2005. Data are collected on about 8,000 trials in these 46 counties which are weighted to represent about 10,500 trials concluded in the nation's 75 most populous counties. I understand that one of the nice features of Stata is that it allows you to take into account the sampling structure of a dataset when doing logistic regression modeling. Here is the Stata code that I used to take in account the sampling structure of these civil trial data:

svyset sitecode [pweight=bwgt0], strata(strata) fpc(fpc1) || su2, fpc(fpc2)

Sitecode = County where the civil trial took place
Bwgt0 = Weights to weight the data from 46 to the 75 most populous counties
Strata = Strata where the counties are located. The dataset has 5 strata
fpc1 = The probability of a county appearing in the sample. For example, a county with a weight of 2 would have a 50% probability of appearing in the sampl
su2 = Unique identifier that identifies the trials that occurred in each of the 46 counties Fpc2 = 1 for all 8,000 trials disposed in the 46 counties. I gave fpc2 a value of 1 because I wanted to tell Stata that the trials had a 100% probability of showing up in these 46 counties. I think that I got the part of this programming that deals with the first level of the sample design correct. It’s the second level that I’m having some problems with At the second level of the sample design, I'm trying to correct for the fact that I have data for every civil trial concluded in the 46 counties. Basically, I want to tell Stata that part of this sample is actually a census of all trials concluded in the 46 counties in 2005. I understand Stata has a finite population correction command that takes into account the census like format of these data. The logistic regression results were the same irrespective of whether I used the 1st or 2nd stages in the sample design. I think this is telling me that Stata is not correcting for the census like aspect of this sample. Can anyone give me some guidance as to whether I'm correctly taking into account the sampling structure of these data. In particular, I would like to know whether I'm using the fpc2 factor correctly. Any assistance you could give on this matter would be very much appreciated.
Thomas Cohen

* For searches and help try:

* For searches and help try:

Michael I. Lichter, Ph.D.
Research Assistant Professor & NRSA Fellow
UB Department of Family Medicine / Primary Care Research Institute
UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
Office: CC 125 / Phone: 716-898-4751 / E-Mail:

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index