Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: joinby command and memory issues


From   Eric Booth <ebooth@ppri.tamu.edu>
To   "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: RE: joinby command and memory issues
Date   Mon, 11 Oct 2010 14:30:08 +0000

<>

One more idea.  In your first post, you mention that the rationale for the pairwise join was to make sure that receipts were within 7 days of the test.  You can perform a -joinby- within your 1050m limit if you first aggregate some of the information in epo.dta before the -joinby- (assuming this still gives you what you need for your analysis ).
There are two ways that I can think of to do this:  
(1) if you are only interested in the latest/max receipt date, you could take the max by study_id and then drop the other cases, or 
(2) if you need to keep all the receipt dates, you can still reduce the size & memory needed for the join if you -reshape- your data wide, thereby keeping your receipt dates in one row.  Then you can compare receipt and test dates across the row.
Both of these solutions greatly reduce the size of the epo.dta and allow for a -joinby- with 1050m.

Again, building on the previous example, here are these 2 solutions using only 1050m:

*********************!
******************!

//THIS 1st PART IS FROM LAST TIME//
clear
**create master (hgb0209)**
inp study_id str11(ord_date) result
1  "01/02/2009" 1 
2  "01/02/2009" 0
2  "01/04/2009" 0 
3  "01/05/2009" 2 
3  "01/06/2009" 1 
3  "01/07/2009" 1 
3  "01/08/2009" 0 
4  "01/02/2009" 1 
5  "01/01/2009" 1 
6  "01/07/2009" 0 
7  "01/07/2009" 0 
end
g ord_date2 = date(ord_date, "MDY")
format ord_date2 %td
drop ord_date
unique study_id
sa hgb.dta, replace

**create using (epo0209)**
clear
inp study_id str11(rec_date)
1  "01/02/2009" 
1  "01/04/2009" 
2  "01/05/2009" 
2  "01/06/2009" 
3  "02/24/2009" 
3  "01/25/2009" 
4  "01/12/2009" 
5  "01/05/2009" 
5  "01/10/2009" 
98  "01/20/2009" 
99  "01/20/2009" 
100  "01/20/2009" 
end
unique study_id
g rec_date2 = date(rec_date, "MDY")
format rec_date2 %td
drop rec_date
compress

sa epo.dta, replace

clear
set virtual off

//EXPAND DATA TO MATCH DESCRIPTION//
set mem 1050m
u "hgb.dta", clear
memory
expand 676246
replace study_id =  1+int((26000-1+1)*runiform())
unique study_id
memory
desc, sh
sa "hgb_expanded.dta", replace




*******************
//CHANGED  CODE -> 



*******************(1)
// (1) KEEPING THE MAX RECORD ONLY //
u "epo.dta", clear
memory
desc, sh
expand 33867
replace study_id =  1+int((36000-1+1)*runiform())
unique study_id
memory
desc, sh
rename rec_date2 rec_date
bys study_id: g number_rec = _N
bys study_id: egen max_rec_date = max(rec_date)
format max_rec_date  %td
bys study_id: g obs = 1==_n
drop if obs!=1
drop obs
drop rec_date
memory 
desc, sh
sa "epo_expanded.dta", replace

//  JOINBY //
clear
set mem 1050m
u "hgb_expanded.dta", clear
memory
joinby  study_id using "epo_expanded.dta", unmatched(none)
**THIS WORKS**
*******************(1)




*******************(2)
// (2) RESHAPE //
u "epo.dta", clear
memory
desc, sh
expand 33867
replace study_id =  1+int((36000-1+1)*runiform())
unique study_id
memory
desc, sh
rename rec_date2 rec_date
bys study_id: g i = _n
bys study_id: g number_rec = _N
reshape wide rec_date, i(study_id) j(i)
memory 
desc, sh
sa "epo_expanded.dta", replace

//  JOINBY //
clear
set mem 1050m
u "hgb_expanded.dta", clear
memory
joinby  study_id using "epo_expanded.dta", unmatched(none)
*this works*
*******************(2)

*********************!

- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu
Office: +979.845.6754
Fax: +979.845.0249
http://ppri.tamu.edu




On Oct 11, 2010, at 9:03 AM, Eric Booth wrote:

> <>
> 
> Yes, I'm still convinced that your issue is the lack of memory on your computer.
> 
> As mentioned in my last post, -joinby- needs more memory to operate than -merge-.   While you might be able to do a -merge- with 1100m and your data, you will not be able to do a join (see this thread for more on memory and -joinby-:http://www.stata.com/statalist/archive/2003-08/msg00539.html).  
> Also, I suggested in my previous post that you should try to break up your dataset, join the data, and then append them together (and Austin Nichols echoed this suggestion).  
> 
> I think the central point here is that you are still not convinced that the issue is the memory limit of your system   (by the way, you never mention what system configuration are you using  e.g., what version and flavor of Stata?, what OS (Windows, Mac?), how much physical RAM on your machine?, 32-bit or 64-bit Stata?).   
> In the example below, I extend upon my previous post's example in order to find the minimum amounts of memory you will need to perform the suggested solutions with your datasets.  First, I try to replicate the properties of your dataset as I understand them from your posts (including the same number & size of variables, the same number of observations, the same number of unique study ids for the join, etc).  
> Next, I  find  (1) the minimum amount of memory needed to perform a -joinby- with this data,  (2) the minimum amount of memory to perform a m:m merge (though this doesn't produce the result you want), and  (3) the minimum amount of memory needed to break up the data into pieces, join them and then append them.   
> 
> Spoiler:  I find that you need at least 3900m to use -joinby- to combine your data, you can perform a m:m merge with the 1050m that you indicate is available for your system, and the process in (3) requires slightly more memory than you already have (about 1280m).  
> 
> 
> (For the example below, I am using Stata 11.1 MP for Mac OSX)

> <snip>

> - Eric
> 
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> ebooth@ppri.tamu.edu
> Office: +979.845.6754
> 
> 
> 
> On Oct 8, 2010, at 4:32 PM, Weichle, Thomas wrote:
> 
>> Does this demonstrate that using this method is limited by my system?
>> 
>> The max memory appears to be right around 1050m.  I read in the original
>> datasets, drop unnecessary variables, compress the data, and then save
>> them.  After that, I perform the joinby and still see the error code.
>> 
> <snip>
>> 
>> Tom Weichle
>> Math Statistician
>> Center for Management of Complex Chronic Care (CMC3)
>> Hines VA Hospital, Bldg 1, C202
>> 708-202-8387 ext. 24261
>> Thomas.Weichle@va.gov 
>> 
>> 



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index