Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: Sample Wegihts

 From "Michael I. Lichter" To statalist@hsphsun2.harvard.edu Subject Re: st: Sample Wegihts Date Tue, 09 Mar 2010 15:07:41 -0500

```Jason,

```
In general, probability weights are equal to 1/(probability of inclusion in the sample), so your 5% sample gets a weight 20 and if you sampled the 4 urban areas at a 10% rate, the weight for those cases should be 10. This is a stratified design and should ideally be analyzed as such using -svyset [pw=your-pweight], strata(your-stratum-id)- where your-pweight is the weight you construct and your-stratum-id is a variable with a category for each stratum. If the sampling rate differs between the cities; e.g., if you sampled 1000 people regardless of the city size, you would need a different weight for each city and a different stratum ID.
```
```
Now, I wonder what you mean about having dropped "duplicate observations". Do you mean that you dropped the observations of Toronto from your first sample and are substituting those from the second, or do you mean that you combined the two samples and literally dropped only those observations that appeared in both? (And I wonder what kind of data you have that you would know they were duplicates.) If the former, what I said above applies; if the latter ... you probably shouldn't.
```
```
The other alternative is simply to combine the samples without dropping observations. In that case, you would need to decide how much relative weight to give to the "regular" sample vs. the "oversample"; if you want each to be weighed equally, you just divide their "natural "weights by two; that is, your-pweight = 10 instead of 20 for the 5% sample, and your-pweight = 5 instead of 10 for the oversample. Somebody who knows more than me can comment on the advisability of this course; it means that a sampling without replacement design (which is what I assume you have in each of the two datasets) becomes sampling with (limited) replacement.
```
```
I agree with Guang Dai (I saw his message after writing this) that how your samples are designed is important; you haven't given us a lot of information to go on.
```
Michael

Jason Dean, Mr wrote:
```
```I have a quick question. I currently have a 5% random sample of Canada. I also have 4 extra random samples of only the four largest urban cities (I have dropped duplicate observations between samples).

What is the best strategy to include these extra samples and keep the sample representative of the country. I intend to conditon on these cities with dummy variable in my regression.  However, I would prefer to use sample weights but I am not sure the best way to go about creating them. Any suggestions would be greatly appreciated.

Jason

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```
```
--
Michael I. Lichter, Ph.D. <mlichter@buffalo.edu>
Research Assistant Professor & NRSA Fellow
UB Department of Family Medicine / Primary Care Research Institute
UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```