[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Working with large data set like a database

From	Eric Booth <[email protected]>
To	[email protected]
Subject	Re: st: Working with large data set like a database
Date	Fri, 1 Jan 2010 11:26:50 -0600

<>

On Jan 1, 2010, at 10:37 AM, David Souther wrote:
> I have a question about working with  very large data sets (combined
> sizes ~ 40 gig) to run analysis when only 6 gig of memory is
> available.

.What format are the data stored in?  If they are already in Stata format (.dta), then you can use the -use- command and specify the variables and observations you want to use, e.g.

use price mpg for in 1/25 if for<1 using auto.dta, clear

This would "query" (as you put it) out only those 3 variables in rows 1/25 where foreign<1.

However, if you were starting from a very large comma/tab-delimited file (and you weren't able to convert the file to Stata format via Stat Transfer & the file was too large to simply -insheet- ), you could use -chunky- (from SSC) to bring in the vars/rows you needed.  The help file for -chunky- has a loop for this process already spelled out.

> I cannot get ""merge"" in stata to accept these kinds
> of date ranges.

What commands did you attempt?  
You can use -nearmrg- (from SSC) to do some date range matches.  Here's an example of how to get it to work with your example data; however, please note a few things:  
(1) I am assuming you want to "update" (see the -merge- help file about this option) extravar1 since its value changes in the "big" dataset; 
(2) my example matches the "sub" dataset record that is closest to & greater than the nearest match in the "big" dataset, then I remove records where the "sub" date is > 90 days from the "big" date

**************start
clear
**"sub" dataset**
input str10 date1 var1 extravar extravar1
"10/22/2008" 3 44 44
"02/01/2001" 5 44 44 
"05/24/2005" 9 44 44 
"12/12/2012" 99 44 44
"12/29/2012" 100 44 44
end
gen date2  = date(date1, "MDY")
sort  date2
drop date1 extravar  
compress
save "using.dta", replace

********
**"big" dataset**
clear
input str10 date1 var2 extravar extravar1
"10/20/2008" 500 44 44
"02/07/2001" 500 44 44
"05/20/2005" 900 44 44
"12/12/2015" 990 44 44
"01/01/1999" 1000 44 44
"01/01/1970" 2000 44 44
"01/01/1970" 2222 44 44
"12/01/2012" 7777 44 55
end
gen date2  = date(date1, "MDY")
sort  date2
drop date1 
compress
save "master.dta", replace

********
**merge**
cap ssc install nearmrg
**using only what's needed**
clear
use date2  extravar1 using "master.dta"
nearmrg  using "using.dta", upper nearvar(date2) genmatch(match)  _merge(merge) update nokeep
compress
rename date2 date1
format date1 %td
format match %td
li date1 match var* extravar1
**keep only within certain date range of "x"=90 days**
g diff = match-date1
li date1 diff
keep if diff<=90     //-matching on 90 days from "big" date
drop diff merge
li *
save "combined.dta", replace
**your analysis here**
reg var1 ext

**************end

Note:   From your post it sounds like you do not want to load all of the 40G of data into memory at one time for analysis (that is, you only want some of the variables loaded in), but depending on the number of observations, your dataset may still be quite large, so you may want to consider the OS, physical memory, and virtual memory on your machine.  See these links:  
http://www.stata.com/statalist/archive/2007-05/msg00326.html
http://www.stata.com/statalist/archive/2009-12/msg00980.html

Good luck, 

Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754

On Jan 1, 2010, at 10:37 AM, David Souther wrote:

> I have a question about working with  very large data sets (combined
> sizes ~ 40 gig) to run analysis when only 6 gig of memory is
> available.
> A second complicating factor is that I need to join some of these data
> sets together based on a date range or similar join rule.  In Oracle,
> I could query out only the columns I need and then join them to other
> files using a rule- such as the dates being within "x" number of days
> of each other.  I cannot get ""merge"" in stata to accept these kinds
> of date ranges.  Here are an example of two datasets to join
> 
> ***subdataset***
> date1 var1 extravar extravar1
> 10/22/2008 3 44 44
> 02/01/2001 5 44 44
> 05/24/2005 9 44 44
> 12/12/2012 99 44 44
> 12/29/2012 100 44 44
> 
> ***big dataset***
> date1 var2 extravar extravar1
> 10/20/2008 500 44 44
> 02/07/2001 500 44 44
> 05/20/2005 900 44 44
> 12/12/2015 990 44 44
> 01/01/1999 1000 44 44
> 01/01/1970 2000 44 44
> 01/01/1970 2222 44 44
> 12/01/2012 7777 44 55
> 
> 
> I need to join by ""date1"" and load up a data set for analysis with
> ONLY ""date1"", ""var1"", and ""extravar1"".  Thanks for helping.
> 
> DFS
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Chow Test for IV estimators in STATA
  - From: Sungbok Lee <[email protected]>

References:
- st: Working with large data set like a database
  - From: David Souther <[email protected]>

Prev by Date: st: GB2 question
Next by Date: st: Esttab question
Previous by thread: st: Working with large data set like a database
Next by thread: st: Chow Test for IV estimators in STATA
Index(es):
- Date
- Thread