Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"G. Dai" <dgecon@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Sample Wegihts |

Date |
Tue, 9 Mar 2010 14:45:44 -0800 |

hi Jason, the weighting strategy you proposed might not be appropriate. The function of weight is to represent the population and to satisfy the need of sampling method adjusting. Given you dataset, I tentatively assume the sampling is random, that is every individual in the population is chosen in the dataset with equal probability. Even with this assumption, the simple inverse weighting strategy still can't represent the population since the sample size of each city only corresponds to the population of EACH CITY,NOT the Canada population. For example, assume Canada's population is 100, while Montreal 20, Vancouver 50. You can easily see it doesn't make sense. Overall, without any knowledge of sampling strategy in the dataset and other population information, it seems impossible to impute any reasonable weight. thus, you might just use these data without weights and keep it in mind that your data over weights individuals from these cities. HTH, Guang On Tue, Mar 9, 2010 at 1:33 PM, Jason Dean, Mr <jason.dean@mail.mcgill.ca>wrote: > Hi Michael and Guang, thank you very much for providing help. > > Here is some more information about my samples: > > - 5% random sample of the entire country (census data 1901 Canada). > Then: > - 8% sample of the urban areas of Montreal > - 6% sample of the urban areas of Toronto > - 24% sample of the urban areas of Vancouver > - 17% sample of the urban areas of Winnipeg > > Now these last 4 city samples are clustered by census subdistrict (sorry I > forgot to mention this in my post) - essentially they were sampled by > selecting every 5th sub district and every household in those subdistricts > where sampled. Many of the socio-economic characteristics of these samples > match the census population figures. > > - As well I have an additional 10% random sample of Vancouver. > > I have taken out duplicated observations by using the page and line number > of the census and left them in the 5% random sample (originally for what I > was doing that was fine.) Sorry I did not mean between cities in my last > post. I meant I dropped duplicates between the Toronto 5% sample and the > Toronto extra sample (and the other three cities) > > So if I ignore the clustering and duplicates issue, I would have a weight > of 20 for the 5 percent random samples and then a weight of 12.5, 16.7, 4.1 > and 5.8 respectively for the city over samples. Plus the extra Vancouver > sample would have a weight of 10. I would apply these weights using svyset > and have a strata id as per your previous response. Is this correct? > > Then to deal with cluster could I use the ,cluster(id) command after > regress where id is a variable that identifies the subdistirct in all > samples? > > For the duplicates could I adjust the weight for the city in which I remove > the duplicated observations? > > Please let me know if you need any more info. Again I really appreciate the > help. > > Thanks, > > Jason > > > ________________________________________ > From: owner-statalist@hsphsun2.harvard.edu [ > owner-statalist@hsphsun2.harvard.edu] On Behalf Of Michael I. Lichter [ > mlichter@buffalo.edu] > Sent: Tuesday, March 09, 2010 3:07 PM > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: Sample Wegihts > > Jason, > > In general, probability weights are equal to 1/(probability of inclusion > in the sample), so your 5% sample gets a weight 20 and if you sampled > the 4 urban areas at a 10% rate, the weight for those cases should be > 10. This is a stratified design and should ideally be analyzed as such > using -svyset [pw=your-pweight], strata(your-stratum-id)- where > your-pweight is the weight you construct and your-stratum-id is a > variable with a category for each stratum. If the sampling rate differs > between the cities; e.g., if you sampled 1000 people regardless of the > city size, you would need a different weight for each city and a > different stratum ID. > > Now, I wonder what you mean about having dropped "duplicate > observations". Do you mean that you dropped the observations of Toronto > from your first sample and are substituting those from the second, or do > you mean that you combined the two samples and literally dropped only > those observations that appeared in both? (And I wonder what kind of > data you have that you would know they were duplicates.) If the former, > what I said above applies; if the latter ... you probably shouldn't. > > The other alternative is simply to combine the samples without dropping > observations. In that case, you would need to decide how much relative > weight to give to the "regular" sample vs. the "oversample"; if you want > each to be weighed equally, you just divide their "natural "weights by > two; that is, your-pweight = 10 instead of 20 for the 5% sample, and > your-pweight = 5 instead of 10 for the oversample. Somebody who knows > more than me can comment on the advisability of this course; it means > that a sampling without replacement design (which is what I assume you > have in each of the two datasets) becomes sampling with (limited) > replacement. > > I agree with Guang Dai (I saw his message after writing this) that how > your samples are designed is important; you haven't given us a lot of > information to go on. > > Michael > > Jason Dean, Mr wrote: > > I have a quick question. I currently have a 5% random sample of Canada. I > also have 4 extra random samples of only the four largest urban cities (I > have dropped duplicate observations between samples). > > > > What is the best strategy to include these extra samples and keep the > sample representative of the country. I intend to conditon on these cities > with dummy variable in my regression. However, I would prefer to use sample > weights but I am not sure the best way to go about creating them. Any > suggestions would be greatly appreciated. > > > > Jason > > > > > > * > > * For searches and help try: > > * http://www.stata.com/help.cgi?search > > * http://www.stata.com/support/statalist/faq > > * http://www.ats.ucla.edu/stat/stata/ > > > > -- > Michael I. Lichter, Ph.D. <mlichter@buffalo.edu> > Research Assistant Professor & NRSA Fellow > UB Department of Family Medicine / Primary Care Research Institute > UB Clinical Center, 462 Grider Street, Buffalo, NY 14215 > Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Sample Wegihts***From:*Stas Kolenikov <skolenik@gmail.com>

**References**:**st: Sample Wegihts***From:*"Jason Dean, Mr" <jason.dean@mail.mcgill.ca>

**Re: st: Sample Wegihts***From:*"Michael I. Lichter" <mlichter@buffalo.edu>

**RE: st: Sample Wegihts***From:*"Jason Dean, Mr" <jason.dean@mail.mcgill.ca>

- Prev by Date:
**Re: st: R: Survival Data - Compare hazards by matching on age (flag: Stata 9.2/SE)** - Next by Date:
**Re: st: AW: Copying Stata (Table) results to Excel** - Previous by thread:
**Re: st: Sample Wegihts** - Next by thread:
**Re: st: Sample Wegihts** - Index(es):