[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Eric Booth <ebooth@ppri.tamu.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Working with large data set like a database |

Date |
Fri, 1 Jan 2010 11:26:50 -0600 |

<> On Jan 1, 2010, at 10:37 AM, David Souther wrote: > I have a question about working with very large data sets (combined > sizes ~ 40 gig) to run analysis when only 6 gig of memory is > available. .What format are the data stored in? If they are already in Stata format (.dta), then you can use the -use- command and specify the variables and observations you want to use, e.g. use price mpg for in 1/25 if for<1 using auto.dta, clear This would "query" (as you put it) out only those 3 variables in rows 1/25 where foreign<1. However, if you were starting from a very large comma/tab-delimited file (and you weren't able to convert the file to Stata format via Stat Transfer & the file was too large to simply -insheet- ), you could use -chunky- (from SSC) to bring in the vars/rows you needed. The help file for -chunky- has a loop for this process already spelled out. > I cannot get ""merge"" in stata to accept these kinds > of date ranges. What commands did you attempt? You can use -nearmrg- (from SSC) to do some date range matches. Here's an example of how to get it to work with your example data; however, please note a few things: (1) I am assuming you want to "update" (see the -merge- help file about this option) extravar1 since its value changes in the "big" dataset; (2) my example matches the "sub" dataset record that is closest to & greater than the nearest match in the "big" dataset, then I remove records where the "sub" date is > 90 days from the "big" date **************start clear **"sub" dataset** input str10 date1 var1 extravar extravar1 "10/22/2008" 3 44 44 "02/01/2001" 5 44 44 "05/24/2005" 9 44 44 "12/12/2012" 99 44 44 "12/29/2012" 100 44 44 end gen date2 = date(date1, "MDY") sort date2 drop date1 extravar compress save "using.dta", replace ******** **"big" dataset** clear input str10 date1 var2 extravar extravar1 "10/20/2008" 500 44 44 "02/07/2001" 500 44 44 "05/20/2005" 900 44 44 "12/12/2015" 990 44 44 "01/01/1999" 1000 44 44 "01/01/1970" 2000 44 44 "01/01/1970" 2222 44 44 "12/01/2012" 7777 44 55 end gen date2 = date(date1, "MDY") sort date2 drop date1 compress save "master.dta", replace ******** **merge** cap ssc install nearmrg **using only what's needed** clear use date2 extravar1 using "master.dta" nearmrg using "using.dta", upper nearvar(date2) genmatch(match) _merge(merge) update nokeep compress rename date2 date1 format date1 %td format match %td li date1 match var* extravar1 **keep only within certain date range of "x"=90 days** g diff = match-date1 li date1 diff keep if diff<=90 //-matching on 90 days from "big" date drop diff merge li * save "combined.dta", replace **your analysis here** reg var1 ext **************end Note: From your post it sounds like you do not want to load all of the 40G of data into memory at one time for analysis (that is, you only want some of the variables loaded in), but depending on the number of observations, your dataset may still be quite large, so you may want to consider the OS, physical memory, and virtual memory on your machine. See these links: http://www.stata.com/statalist/archive/2007-05/msg00326.html http://www.stata.com/statalist/archive/2009-12/msg00980.html Good luck, Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu Office: +979.845.6754 On Jan 1, 2010, at 10:37 AM, David Souther wrote: > I have a question about working with very large data sets (combined > sizes ~ 40 gig) to run analysis when only 6 gig of memory is > available. > A second complicating factor is that I need to join some of these data > sets together based on a date range or similar join rule. In Oracle, > I could query out only the columns I need and then join them to other > files using a rule- such as the dates being within "x" number of days > of each other. I cannot get ""merge"" in stata to accept these kinds > of date ranges. Here are an example of two datasets to join > > ***subdataset*** > date1 var1 extravar extravar1 > 10/22/2008 3 44 44 > 02/01/2001 5 44 44 > 05/24/2005 9 44 44 > 12/12/2012 99 44 44 > 12/29/2012 100 44 44 > > ***big dataset*** > date1 var2 extravar extravar1 > 10/20/2008 500 44 44 > 02/07/2001 500 44 44 > 05/20/2005 900 44 44 > 12/12/2015 990 44 44 > 01/01/1999 1000 44 44 > 01/01/1970 2000 44 44 > 01/01/1970 2222 44 44 > 12/01/2012 7777 44 55 > > > I need to join by ""date1"" and load up a data set for analysis with > ONLY ""date1"", ""var1"", and ""extravar1"". Thanks for helping. > > DFS > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: Chow Test for IV estimators in STATA***From:*Sungbok Lee <javanfish@tamu.edu>

**References**:**st: Working with large data set like a database***From:*David Souther <davidsoutheremail@gmail.com>

- Prev by Date:
**st: GB2 question** - Next by Date:
**st: Esttab question** - Previous by thread:
**st: Working with large data set like a database** - Next by thread:
**st: Chow Test for IV estimators in STATA** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |