Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: joinby command and memory issues


From   Eric Booth <ebooth@ppri.tamu.edu>
To   "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: RE: joinby command and memory issues
Date   Mon, 11 Oct 2010 14:03:05 +0000

<>

Yes, I'm still convinced that your issue is the lack of memory on your computer.

As mentioned in my last post, -joinby- needs more memory to operate than -merge-.   While you might be able to do a -merge- with 1100m and your data, you will not be able to do a join (see this thread for more on memory and -joinby-:http://www.stata.com/statalist/archive/2003-08/msg00539.html).  
Also, I suggested in my previous post that you should try to break up your dataset, join the data, and then append them together (and Austin Nichols echoed this suggestion).  

I think the central point here is that you are still not convinced that the issue is the memory limit of your system   (by the way, you never mention what system configuration are you using  e.g., what version and flavor of Stata?, what OS (Windows, Mac?), how much physical RAM on your machine?, 32-bit or 64-bit Stata?).   
In the example below, I extend upon my previous post's example in order to find the minimum amounts of memory you will need to perform the suggested solutions with your datasets.  First, I try to replicate the properties of your dataset as I understand them from your posts (including the same number & size of variables, the same number of observations, the same number of unique study ids for the join, etc).  
Next, I  find  (1) the minimum amount of memory needed to perform a -joinby- with this data,  (2) the minimum amount of memory to perform a m:m merge (though this doesn't produce the result you want), and  (3) the minimum amount of memory needed to break up the data into pieces, join them and then append them.   

Spoiler:  I find that you need at least 3900m to use -joinby- to combine your data, you can perform a m:m merge with the 1050m that you indicate is available for your system, and the process in (3) requires slightly more memory than you already have (about 1280m).  


(For the example below, I am using Stata 11.1 MP for Mac OSX)

******************!

//THIS 1st PART IS FROM LAST TIME//
clear
**create master (hgb0209)**
inp study_id str11(ord_date) result
1  "01/02/2009" 1 
2  "01/02/2009" 0
2  "01/04/2009" 0 
3  "01/05/2009" 2 
3  "01/06/2009" 1 
3  "01/07/2009" 1 
3  "01/08/2009" 0 
4  "01/02/2009" 1 
5  "01/01/2009" 1 
6  "01/07/2009" 0 
7  "01/07/2009" 0 
end
g ord_date2 = date(ord_date, "MDY")
format ord_date2 %td
drop ord_date
unique study_id
sa hgb.dta, replace

**create using (epo0209)**
clear
inp study_id str11(rec_date)
1  "01/02/2009" 
1  "01/04/2009" 
2  "01/05/2009" 
2  "01/06/2009" 
3  "02/24/2009" 
3  "01/25/2009" 
4  "01/12/2009" 
5  "01/05/2009" 
5  "01/10/2009" 
98  "01/20/2009" 
99  "01/20/2009" 
100  "01/20/2009" 
end
unique study_id
g rec_date2 = date(rec_date, "MDY")
format rec_date2 %td
drop rec_date
compress

sa epo.dta, replace




*******************
//NEW CODE -> 

clear
set virtual off

//EXPAND DATA TO MATCH DESCRIPTION//
set mem 1050m
u "hgb.dta", clear
memory
expand 676246
replace study_id =  1+int((26000-1+1)*runiform())
unique study_id
memory
desc, sh
sa "hgb_expanded.dta", replace


u "epo.dta", clear
memory
desc, sh
expand 33867
replace study_id =  1+int((36000-1+1)*runiform())
unique study_id
memory
desc, sh
sa "epo_expanded.dta", replace


*******************(1)
// (1)   JOINBY //
clear
	**none of these worked**
	*set mem 2000m
	*set mem 2500m
	*set mem 3000m
	*set mem 3500m
	*set mem 3700m
	*set mem 3800m


set mem 3900m
u "hgb_expanded.dta", clear
memory
joinby  study_id using "epo_expanded.dta", unmatched(none)
**this works**
*******************(1)




*******************(2)
//  (2) Merge  //
clear
set mem 1050m
u "hgb_expanded.dta", clear
memory
merge m:m  study_id using "epo_expanded.dta"
**this works**
*******************(2)



*******************(3)
// (3) BREAK UP DATA & APPEND AFTER JOINBY //

**create new empty MASTER**
clear*
	**these do not work**
	*set mem 1050m
	*set mem 1080m
	*set mem 1100m
	*set mem 1150m
	*set mem 1200m
	*set mem 1250m
	*set mem 1270m

set mem 1280m


sa "MASTER.dta", replace emptyok
**this is your rangelist of study_ids
**new rangelist using -numlist-**
u "epo_expanded.dta", clear
qui su study_id
local step = `r(max)'/20
numlist "`r(min)'(`step')`r(max)'"
   display "`r(numlist)'"
local rangelist `r(numlist)'

token "`rangelist'"
while "`1'" != "" {
	di "Range:  `1' -- `2'"
	foreach d in hgb epo {
	**create subset data**
	u "`d'_expanded.dta", clear
	keep if inrange(study_id, `1', `2')
	*ta study_id
	sa "`d'_`1'_`2'.dta", replace emptyok
	}
	**joinby**
	u hgb_`1'_`2'.dta, clear
	joinby study_id using "epo_`1'_`2'.dta", unmatched(none) _merge(_merge)
	ta _merge
	sort stu
	drop _merge
	sa "joined_`1'_`2'.dta", replace
	**append to MASTER**
	u "MASTER.dta", clear
	append using "joined_`1'_`2'.dta"
	sa "MASTER.dta", replace
	erase "joined_`1'_`2'.dta"
**i added -erase- to remove the pieces**
	mac shift 2
	}

u MASTER.dta, clear
memory
desc, sh
******************!
This entire code takes a while to run (about 25 minutes).  
Also, be sure to watch for wrapping issues.


- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu
Office: +979.845.6754



On Oct 8, 2010, at 4:32 PM, Weichle, Thomas wrote:

> Does this demonstrate that using this method is limited by my system?
> 
> The max memory appears to be right around 1050m.  I read in the original
> datasets, drop unnecessary variables, compress the data, and then save
> them.  After that, I perform the joinby and still see the error code.
> 
<snip>
> 
> Tom Weichle
> Math Statistician
> Center for Management of Complex Chronic Care (CMC3)
> Hines VA Hospital, Bldg 1, C202
> 708-202-8387 ext. 24261
> Thomas.Weichle@va.gov 
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index