Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: joinby command and memory issues

From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: RE: RE: joinby command and memory issues
Date   Mon, 11 Oct 2010 14:03:05 +0000


Yes, I'm still convinced that your issue is the lack of memory on your computer.

As mentioned in my last post, -joinby- needs more memory to operate than -merge-.   While you might be able to do a -merge- with 1100m and your data, you will not be able to do a join (see this thread for more on memory and -joinby-:  
Also, I suggested in my previous post that you should try to break up your dataset, join the data, and then append them together (and Austin Nichols echoed this suggestion).  

I think the central point here is that you are still not convinced that the issue is the memory limit of your system   (by the way, you never mention what system configuration are you using  e.g., what version and flavor of Stata?, what OS (Windows, Mac?), how much physical RAM on your machine?, 32-bit or 64-bit Stata?).   
In the example below, I extend upon my previous post's example in order to find the minimum amounts of memory you will need to perform the suggested solutions with your datasets.  First, I try to replicate the properties of your dataset as I understand them from your posts (including the same number & size of variables, the same number of observations, the same number of unique study ids for the join, etc).  
Next, I  find  (1) the minimum amount of memory needed to perform a -joinby- with this data,  (2) the minimum amount of memory to perform a m:m merge (though this doesn't produce the result you want), and  (3) the minimum amount of memory needed to break up the data into pieces, join them and then append them.   

Spoiler:  I find that you need at least 3900m to use -joinby- to combine your data, you can perform a m:m merge with the 1050m that you indicate is available for your system, and the process in (3) requires slightly more memory than you already have (about 1280m).  

(For the example below, I am using Stata 11.1 MP for Mac OSX)


**create master (hgb0209)**
inp study_id str11(ord_date) result
1  "01/02/2009" 1 
2  "01/02/2009" 0
2  "01/04/2009" 0 
3  "01/05/2009" 2 
3  "01/06/2009" 1 
3  "01/07/2009" 1 
3  "01/08/2009" 0 
4  "01/02/2009" 1 
5  "01/01/2009" 1 
6  "01/07/2009" 0 
7  "01/07/2009" 0 
g ord_date2 = date(ord_date, "MDY")
format ord_date2 %td
drop ord_date
unique study_id
sa hgb.dta, replace

**create using (epo0209)**
inp study_id str11(rec_date)
1  "01/02/2009" 
1  "01/04/2009" 
2  "01/05/2009" 
2  "01/06/2009" 
3  "02/24/2009" 
3  "01/25/2009" 
4  "01/12/2009" 
5  "01/05/2009" 
5  "01/10/2009" 
98  "01/20/2009" 
99  "01/20/2009" 
100  "01/20/2009" 
unique study_id
g rec_date2 = date(rec_date, "MDY")
format rec_date2 %td
drop rec_date

sa epo.dta, replace

//NEW CODE -> 

set virtual off

set mem 1050m
u "hgb.dta", clear
expand 676246
replace study_id =  1+int((26000-1+1)*runiform())
unique study_id
desc, sh
sa "hgb_expanded.dta", replace

u "epo.dta", clear
desc, sh
expand 33867
replace study_id =  1+int((36000-1+1)*runiform())
unique study_id
desc, sh
sa "epo_expanded.dta", replace

// (1)   JOINBY //
	**none of these worked**
	*set mem 2000m
	*set mem 2500m
	*set mem 3000m
	*set mem 3500m
	*set mem 3700m
	*set mem 3800m

set mem 3900m
u "hgb_expanded.dta", clear
joinby  study_id using "epo_expanded.dta", unmatched(none)
**this works**

//  (2) Merge  //
set mem 1050m
u "hgb_expanded.dta", clear
merge m:m  study_id using "epo_expanded.dta"
**this works**


**create new empty MASTER**
	**these do not work**
	*set mem 1050m
	*set mem 1080m
	*set mem 1100m
	*set mem 1150m
	*set mem 1200m
	*set mem 1250m
	*set mem 1270m

set mem 1280m

sa "MASTER.dta", replace emptyok
**this is your rangelist of study_ids
**new rangelist using -numlist-**
u "epo_expanded.dta", clear
qui su study_id
local step = `r(max)'/20
numlist "`r(min)'(`step')`r(max)'"
   display "`r(numlist)'"
local rangelist `r(numlist)'

token "`rangelist'"
while "`1'" != "" {
	di "Range:  `1' -- `2'"
	foreach d in hgb epo {
	**create subset data**
	u "`d'_expanded.dta", clear
	keep if inrange(study_id, `1', `2')
	*ta study_id
	sa "`d'_`1'_`2'.dta", replace emptyok
	u hgb_`1'_`2'.dta, clear
	joinby study_id using "epo_`1'_`2'.dta", unmatched(none) _merge(_merge)
	ta _merge
	sort stu
	drop _merge
	sa "joined_`1'_`2'.dta", replace
	**append to MASTER**
	u "MASTER.dta", clear
	append using "joined_`1'_`2'.dta"
	sa "MASTER.dta", replace
	erase "joined_`1'_`2'.dta"
**i added -erase- to remove the pieces**
	mac shift 2

u MASTER.dta, clear
desc, sh
This entire code takes a while to run (about 25 minutes).  
Also, be sure to watch for wrapping issues.

- Eric

Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754

On Oct 8, 2010, at 4:32 PM, Weichle, Thomas wrote:

> Does this demonstrate that using this method is limited by my system?
> The max memory appears to be right around 1050m.  I read in the original
> datasets, drop unnecessary variables, compress the data, and then save
> them.  After that, I perform the joinby and still see the error code.
> Tom Weichle
> Math Statistician
> Center for Management of Complex Chronic Care (CMC3)
> Hines VA Hospital, Bldg 1, C202
> 708-202-8387 ext. 24261
> [email protected] 
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index