Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: selecting obs while reading in huge data set

From   Stas Kolenikov <>
Subject   Re: st: selecting obs while reading in huge data set
Date   Thu, 19 Aug 2004 14:33:22 -0400 (EDT)

--- In statalist, "Sascha O. Becker" <sascha.becker@g...> wrote:
> I have a huge data set A (2 GB in ASCII) with 40 mio. observations
> (workers) but only 10 variables. I have another data set B containing
> information on (a sub-set of) employers and want to select only workers
> from data set A that are employed in firms in data set B (firm IDs are
> one variable in data set A).
> Is there also a way to select observations while reading in data set A,
> i.e. something like
> -insheet datasetA using mydata if firmid in datasetB- ?
> and if yes, is it likely to be faster?

You can try this: Assuming your data set B is already in Stata format, and
the ID in that data set is -tid-; assuming data set A has the employer ID
variable called -firmid-; I want to make this distinction just in case
those are different variable names in two data sets; let's see...

set memory ...m
* load the firm data set into memory
use B
keep firmid
* make sure there are unique observations for each tid
bysort tid : assert _N == 1
local iflist
local N = _N
forvalues n=1/`N' {
   local thistid=tid[`n']
   local iflist `iflist'|firmid==`thistid'

* local macro `iflist' now has the full list of IDs
* provided we did not hit Stata limits on the length
* of a macro (67784 in Intercooled, 1081511 in SE)
* it starts with "|" so we might want to strip it out

gettoken slash iflist : iflist , parse("|")

* we might have done this with the string functions, too,
* but this is probably faster and more reliable
* in terms of Stata limits on the length of strings
* alternatively, we might have used something like
* ... if 0`iflist'
* which would evaluate in the appropriate way

* now, try to get our data in
set memory ...m
infile <list of variables> using A if `iflist'
* note that -insheet- does not support -if- qualifiers,
* so we would have to use -infile-


I don't know how much faster this is going to be, but it seemed workable
to me. Unless Nick J. Cox finds a better solution -- say through his
-listutil- suit of commands :))

And yes, you need to be sure you don't hit Stata limits on the length of
the macro before your file B ends. If you have say 10 digit IDs, then you
are limited to about 65k/20 = 3000 firms. How many have you got?

 ---                                    Stas Kolenikov
 --       Ph.D. student in Statistics at UNC-Chapel Hill
 -  --

* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recipients unless they really
* enjoy communicating with the author :). Other restrictions apply.

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index