Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: selecting obs while reading in huge data set


From   Daniel Feenberg <feenberg@nber.org>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: selecting obs while reading in huge data set
Date   Wed, 18 Aug 2004 11:14:55 -0400 (EDT)

On Wed, 18 Aug 2004, Sascha O. Becker wrote:

> Dear stata users,
> 
> I have a huge data set A (2 GB in ASCII) with 40 mio. observations 
> (workers) but only 10 variables. I have another data set B containing 
> information on (a sub-set of) employers and want to select only workers 
> from data set A that are employed in firms in data set B (firm IDs are 
> one variable in data set A).
> 

Perhaps you can read the employee and firm ID only? 

   .insheet empid firmid using mydata

This is only 1/5th the variables, so it might fit in your computer memory.
Then merge the result with the firm dataset, keeping only matched records,
then merge again with employee dataset, keeping only matched records.

Alternatively, if you used "infile" instead of "insheet" you could
use the "if exp" clause to input only employees at eligible firms. But I
suppose there might be some limit on the complexity of the "exp" that
would limit the number of firms you could list. [Actually, insheet
might support an if clause, but it isn't mentioned in the help file, as
it is with the infile command].

 .infile varlist using mydata if firmid==123 | firmid==456 ...




*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index