Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)


From   Steven Samuels <sjsamuels@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: semi-random sampling (how to impose properties of one population onto a subsample of a different population)
Date   Mon, 15 Aug 2011 17:13:44 -0400

You are very welcome, Ekaterina, but note Austin's follow-up.  If you wish to make these two groups comparable for analysis, there are better, more comprehensive, and more defensible approaches. Austin mentioned propensity score weighting, but -margins- might be superior if you are not interested in causal effects of income.

Steve


On Aug 15, 2011, at 4:12 PM, Ekaterina Hertog wrote:

Thank you very much! Sorry for posing the original question imprecisely!
ekaterina

On 07/08/2011 18:32, Steven Samuels wrote:
> 
> Sorry, I misunderstood. Here's code that you can adapt. Note that you set the sample size you want in the first line
> 
> *************CODE BEGINS*************
> scalar sampsize = 500
> set seed 842655
> 
> clear
> /* Input Frequencies for External Population
> You can get these from -contract-
> in the original external data set:
> "contract agegp region, freq(freq1)"
> */
> input agegp region freq1
> 1 1 501
> 1 2 415
> 2 1 1809
> 2  2 3003
> 3  1 1288
> 3  2 1400
> end
> egen tot1 = total(freq1)
> gen ssize = round(sampsize*freq1/tot1)
> /* Check Frequencies */
> tab agegp region [fw=freq1], cell
> tab agegp region [fw=ssize], cell
> 
> sort agegp region
> tempfile t1
> save `t1'
> /*  Create Data set to be sampled from the auto data */
> 
> sysuse auto, clear
> expand 100
> rename rep78 agegp
> rename foreign region
> 
> recode agegp 2=1 5=1 .=1 3=2 4=3  // values 1,2,3
> replace region = region +1        // values 1,2
> 
> 
> /* Merge with external counts */
> sort agegp region
> merge m:1 agegp region using `t1'
> tab _merge
> drop _merge
> 
> egen stratum = group(agegp region)
> levelsof stratum, local(levels)
> tempfile t2
> save `t2'
> foreach x of local levels{
> use `t2'
> keep if stratum==`x'
> gen u = uniform()
> sort u
> keep if _n<=ssize
> tempfile td`x'
> save `td`x''
> }
> 
> clear
> tempfile t0 //empty data set to append to
> gen dummy=1
> save `t0'
> foreach x of local levels{
> append using `td`x''
> }
> drop dummy
> /* Check frequencies again */
> tab agegp region , cell missing
> save sample1, replace
> **************CODE ENDS**************
> 
> On Aug 7, 2011, at 5:05 AM, Ekaterina Hertog wrote:
> 
> Dear Steven,
> thank you for your help, however it does not fully solve my problem. Your proposed solution will allow me to roughly preserve the population percentages from the whole sample into a subsample. What I need however, is to impose populations percentages found in a different dataset on a subsample I am creating. Essentially i have two datasets: one of high income women and one of middle income women. High income women tend to be older and are more likely to live in the capital. I need to create a subsample of a dataset of middle income woemn which would match the high income women dataset on age and location characteristics.
> Does anyone know how to do this in Stata 11?
> Ekaterina
> 
> On 07/08/2011 09:08, Steven Samuels wrote:
>> The following code shows how to take a 10% sample within categories formed by two variables. The sample and whole population percentages will be approximately the same, with the agreement better for larger within-cell sample sizes.
>> 
>> Steve
>> 
>> *************CODE BEGINS*************
>> sysuse auto, clear
>> expand 6
>> set seed 842655
>> recode rep78 1/2=5 .=5
>> tab rep78 foreign, cell
>> sample 10, by(foreign rep78)
>> tab rep78 foreign, cell
>> **************CODE ENDS**************
>> 
>> 
>> 
>> On Aug 6, 2011, at 4:23 PM, Ekaterina Hertog wrote:
>> 
>> Dear all,
>> I need to take a subsample of observations from a big dataset making sure that the people in the subsample have a given geographic and age profile. I need to make sure that, say, 50% of people in the subsample come from the capital and 50% from other towns. Within each of these 2 locations I want to preserve a certain age structure: say in a city: 3 people ages 23, 4 people aged 24 …
>> Within those geographic and age profiles I want to select the observations randomly. Is it possible to do that in Stata 11? Any thoughts on how I would go about it?
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index