Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: joinby command and memory issues

From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: RE: RE: joinby command and memory issues
Date   Fri, 8 Oct 2010 20:33:14 +0000


On Oct 8, 2010, Weichle, Thomas wrote:
>   Maybe this is beyond the capabilities of
> Stata.  

No, Stata can easily handle this size join/merge.  It is only limited by your system.

If you've dropped all the variables that you can drop, compressed all your variables, and set your mem to it's max (see this thread for some tips on maxing your mem: ), then the most likely solution is to get a system with more memory.

> Any other suggestions would be welcome.  would doing a
> many-to-many merge result in the same dataset (all pairwise
> combinations)?

No, here's an example of how -joinby- and -merge- would merge your datasets (as you've described them):


**create master (hgb0209)**
inp study_id str11(ord_date) result
1  "01/02/2009" 1 
2  "01/02/2009" 0
2  "01/04/2009" 0 
3  "01/05/2009" 2 
3  "01/06/2009" 1 
3  "01/07/2009" 1 
3  "01/08/2009" 0 
4  "01/02/2009" 1 
5  "01/01/2009" 1 
6  "01/07/2009" 0 
7  "01/07/2009" 0 
 g ord_date2 = date(ord_date, "MDY")
 format ord_date2 %td
 drop ord_date
 unique study_id
sa hgb.dta, replace

**create using (epo0209)**
inp study_id str11(rec_date)
1  "01/02/2009" 
1  "01/04/2009" 
2  "01/05/2009" 
2  "01/06/2009" 
3  "02/24/2009" 
3  "01/25/2009" 
4  "01/12/2009" 
5  "01/05/2009" 
5  "01/10/2009" 
98  "01/20/2009" 
99  "01/20/2009" 
100  "01/20/2009" 
 unique study_id
 g rec_date2 = date(rec_date, "MDY")
 format rec_date2 %td
 drop rec_date
sa epo.dta, replace

**joinby v. merge m:m**

u hgb.dta, clear
	joinby study_id using "epo.dta", unmatched(none) _merge(_merge)
	ta _merge
	g datediff = rec_date2 - ord_date2
	sort stu

u hgb.dta, clear
merge m:m study_id using "epo.dta"
ta _m
	g datediff = rec_date2 - ord_date2
sort stu

Notice how it treats study_id's 2 and 3 in -joinby- vs. -merge-.  My guess is that you want to use -joinby-.

Finally, if just by looking at my system performance (e.g. Task Manager in Windows or Activity Monitor in OSX) during a -joinby-, it looks like Stata takes a lot of (nearly double) overhead during this process (I think it puts the "using" dataset into memory, in addition to the already open "master" dataset, and that's when you get the r(910) error -- other can correct me if I'm wrong about what Stata is doing during the -joinby-).  
So, if you've got enough memory to open a joined hgb & epo dataset, but not enough memory to open hgb and then join epo to it, then a work-around  *might* be to  
(1)  break hgb and epo up into sub-sections based on study_id  
(2) join these smaller sections by study_id,   and then
(3) append the joined datasets.  



*** continued from example data above ***

**create new empty MASTER**
sa "MASTER.dta", replace emptyok

**this is your rangelist of study_ids
**so your first range is 0 to 2
**next is 3 to 4, and then 5 to 100

local rangelist 0 2 3 4 5 100

token "`rangelist'"
while "`1'" != "" {
	di "Range:  `1' -- `2'"
foreach d in hgb epo {
	**create subset data**
	u "`d'.dta", clear
	keep if inrange(study_id, `1', `2')
	ta study_id
	sa "`d'_`1'_`2'.dta", replace emptyok
	u hgb_`1'_`2'.dta, clear
	joinby study_id using "epo_`1'_`2'.dta", unmatched(none) _merge(_merge)
	ta _merge
	sort stu
	drop _merge
	sa "joined_`1'_`2'.dta", replace
	**append to MASTER**
	u "MASTER.dta", clear
	append using "joined_`1'_`2'.dta"
	sa "MASTER.dta", replace
	mac shift 2

u MASTER.dta, clear
g datediff = rec_date2 - ord_date2
**this matches the -joinby- above**
You'll want to think about how to select subsets of study_id's from your datasets...if there are lot's of study_id's, then it's probably better to create your ranges using  -numlist- or a -forvalues- loop.  Again, this will only work if your system has enough memory to open a completely joined version of your dataset (plus overhead).  

- Eric

Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754
Fax: +979.845.0249

On Oct 8, 2010, at 12:02 PM, Weichle, Thomas wrote:
> Sorry for the confusion.  I was successful in dropping the other
> variables in epo.dta resulting in 2 variables that remain (study_id,
> receipt date).  Then, I tried to join epo.dta with hgb.dta (which is in
> memory) containing 3 variables (study_id, test date, test result).
> After trying the joinby command is where I still received the error
> saying that there is no room to add more observations.
> --- From Nick Cox ---
> This is unclear to me: 
> 1. What does "try to drop" mean? Did -drop- work? If not, what error
> messages did you get? Or why say "try"? 
> 2. This description doesn't seem to match your earlier one just a few
> minutes ago. Here epo.dta has 2 variables. A few minutes ago it included
> other variables you didn't want and wanted to know how to ignore. 
> I don't think people can easily give good advice if the precise problem
> is not clear. 
> As in an earlier thread today, the limits of Stata are given by -help
> limits-. Almost always, the memory available on your machine is what
> really bites. 
> Nick 
> [email protected] 
> Tom Weichle
> Math Statistician
> Center for Management of Complex Chronic Care (CMC3)
> Hines VA Hospital, Bldg 1, C202
> 708-202-8387 ext. 24261
> [email protected] 
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index