[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Stas Kolenikov <skolenik@email.unc.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: selecting obs while reading in huge data set |

Date |
Thu, 19 Aug 2004 14:33:22 -0400 (EDT) |

--- In statalist, "Sascha O. Becker" <sascha.becker@g...> wrote: > I have a huge data set A (2 GB in ASCII) with 40 mio. observations > (workers) but only 10 variables. I have another data set B containing > information on (a sub-set of) employers and want to select only workers > from data set A that are employed in firms in data set B (firm IDs are > one variable in data set A). > > Is there also a way to select observations while reading in data set A, > i.e. something like > > -insheet datasetA using mydata if firmid in datasetB- ? > > and if yes, is it likely to be faster? You can try this: Assuming your data set B is already in Stata format, and the ID in that data set is -tid-; assuming data set A has the employer ID variable called -firmid-; I want to make this distinction just in case those are different variable names in two data sets; let's see... clear set memory ...m * load the firm data set into memory use B keep firmid * make sure there are unique observations for each tid bysort tid : assert _N == 1 local iflist local N = _N forvalues n=1/`N' { local thistid=tid[`n'] local iflist `iflist'|firmid==`thistid' } * local macro `iflist' now has the full list of IDs * provided we did not hit Stata limits on the length * of a macro (67784 in Intercooled, 1081511 in SE) * it starts with "|" so we might want to strip it out gettoken slash iflist : iflist , parse("|") * we might have done this with the string functions, too, * but this is probably faster and more reliable * in terms of Stata limits on the length of strings * alternatively, we might have used something like * ... if 0`iflist' * which would evaluate in the appropriate way * now, try to get our data in clear set memory ...m infile <list of variables> using A if `iflist' * note that -insheet- does not support -if- qualifiers, * so we would have to use -infile- count exit I don't know how much faster this is going to be, but it seemed workable to me. Unless Nick J. Cox finds a better solution -- say through his -listutil- suit of commands :)) And yes, you need to be sure you don't hit Stata limits on the length of the macro before your file B ends. If you have say 10 digit IDs, then you are limited to about 65k/20 = 3000 firms. How many have you got? --- Stas Kolenikov -- Ph.D. student in Statistics at UNC-Chapel Hill - http://www.komkon.org/~tacik/ -- Stas.Kolenikov@unc.edu * This e-mail and all attachments to it are not intended to provide any * reasonable point of view and was transmitted to you in error. It * should be immediately deleted by all recipients unless they really * enjoy communicating with the author :). Other restrictions apply. * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: how to get standard errors from VC** - Next by Date:
**Re: st: question on GLLAMM** - Previous by thread:
**RE: st: selecting obs while reading in huge data set** - Next by thread:
**Re: st: selecting obs while reading in huge data set** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |