Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: dividing a data set into estimation and validation sets


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: dividing a data set into estimation and validation sets
Date   Sun, 3 Apr 2005 15:54:28 +0100

I don't think you need any section
of the manual as support here, but 
FWIW Stata's -sample- doesn't do this. 

The unofficial -swor- (-search swor-)
will do it. 

But best of all is to think from first 
principles. Suppose we decide on 
a validation sample of 500: then we 
should be explicit about a random 
number seed for reproducibility. 
Your seed choice may naturally differ, 
but here's one 

set seed 280352 

Then we pick some random numbers 
and shuffle: 

gen random = uniform()
sort random 

The first whatever observations
are one sample: 

gen byte validation = _n <= 500 

Your validation sample has 
-validation- 1 and the other sample has 
validation 0. Subsequent analyses
can be done 

... if validation
... if !validation 

Having written that down, I now 
remember that this is already an FAQ: 

How can I take random samples from an existing dataset? 
http://www.stata.com/support/faqs/stat/sampling.html

Nick 
n.j.cox@durham.ac.uk 

Richard Hiscock

> I would be grateful for some direction to the area in the 
> stata manual 
> that explains how to do the following
> I am trying to split a dataset (n ~1500) into an estimation 
> sample and a 
> validation sample  by random sampling (n = 400-500) from the dataset
> 
> Later I wish to compare results with that using bstrap techniques

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index