Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating a smaller dataset from a larger one.


From   Richard Williams <richardwilliams.ndu@gmail.com>
To   statalist@hsphsun2.harvard.edu, statalist@hsphsun2.harvard.edu
Subject   Re: st: Creating a smaller dataset from a larger one.
Date   Mon, 13 Aug 2012 16:04:00 -0500

At 10:47 AM 8/13/2012, Le Wang wrote:
Dear Amal,

Stata has a built-in program called -sample- to draw a random sample.
See the link below for a detailed tutorial for this command.

http://www.ats.ucla.edu/stat/stata/faq/sample.htm

Hope that helps.

Le

I'll add a caution here -- if the data are -svyset-, I don't think you are supposed to create extracts. Stata needs all the cases in order to get the standard errors right. I've never fully understood why, but Statalist has had various threads explaining why you should use -subpop- rather than -if- for selecting cases (and presumably the same logic applies to extracts).


On Mon, Aug 13, 2012 at 10:31 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:
> Hello all,
>
> I have a very large dataset with almost 3 million subjects - great to work with, but however a bit difficult to transport or carry with me. I prefer to create a smaller sub-dataset with say 100,000 subjects chosen at random. As I'm interested in studying ethnic differences, I use the variable 'Motherland' that denotes country of birth in the code below to help create my sub-dataset. However, the code I'm currently using, I get (I think) the first 100,000 subjects, which is then not at random. How may I change the code below, to choose 100,000 (or say any number I wish) subjects at random?
>
> I use the following code to create a subset of my original dataset:
>
> *Creating a subsample of the dataset with say 100,000 subjects*
>
> // create random variable
> gen x = runiform()
>
> // sort by country and x
> sort motherland x
>
> // create a variable within country identifying the first 10% (change this proprtion as you wish)
>
> by motherland: gen subsamp = _n <= (_N+0.5)*0.10
>
> tab motherland subsamp, col
>
> tab motherland kon, col, if magecat!=. & education!=. & famsit_new!=. & smoke1!=. & parity!=. & zscore_gest!=. & MBMI2!=. & mlangd!=. & multibirth==2 & subsamp==1
>
>
> Thanks for any help,
>
> /Amal.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/



--

~~~~~~~~~~~~~~~~~~~~~~~~
Le Wang, Ph.D
Assistant Professor
Department of Economics
University of New Hampshire

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  Richard.A.Williams.5@ND.Edu
WWW:    http://www.nd.edu/~rwilliam

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index