Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: m:1 merge with string function, data set too large?

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: RE: m:1 merge with string function, data set too large?
Date	Fri, 23 Aug 2013 17:05:34 +0000

David,

I know you didn't actually ask for help, but you got my curiosity up.  I am very skeptical that Stata had a problem with this merge because you had too much data or because you were using a string variable.

What do you mean by "deleted all the data parameters from the master file"?

Also, how is the variable "round" defined in the -using- dataset?  If you do not have an observation in the -using- dataset for each household(uni)-round combination you could get strange results like the ones you posted.  However, this would be an odd thing to have (i.e., a village linkage file with household-village linkage that is duplicated for each round), so I suspect what you really want is:

. merge m:1 uni  using "filename"

I'm not sure how what you did solved the problem, but I suspect you may have similar problems in the future if you are not adequately accounting for the structure of your files when you do a merge.

Regards,
Joe Canner
Johns Hopkins University School of Medicine

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of David Fredericks
Sent: Friday, August 23, 2013 2:27 AM
To: [email protected]
Subject: st: m:1 merge with string function, data set too large?

Dear all

I just spent a frustrating morning trying to undertake a m:1 merge using a string function (uni = unique household identifier) using for  Stata 11.2m on an Asus laptop with i7 processor, 4 gig of RAM, a 64-bit operating system
running Windows 7.    

I have a large household data set with 7 rounds of data for each household
(master) and wished to merge this with another file that linked the unique identifier for each household to a village name (using file).

I used the command:
merge m:1 uni round using "filename"

And that produced some funny results. 


    Result                           # of obs.
    -----------------------------------------
    not matched                           146
        from master                       126  (_merge==1)
        from using                         20  (_merge==2)

    matched                             2,145  (_merge==3)
    -----------------------------------------

I should have a village name for 2,145 households.  However, I only got 331 village names matched for one round of data.

Village		Freq.	Percnt	Cum.		
Vname1	12	3.70	3.70
Vname2	23	7.10	10.80
Vname3	22	6.79	17.59
Vname4	22	6.79	24.38
Vname5	16	4.94	29.32
Vname6	22	6.79	36.11
Vname7	16	4.94	41.05
Vname8	18	5.56	46.60
Vname9	40	12.35	58.95
Vname10	16	4.94	63.89
Vname11	22	6.79	70.68
Vname12	30	9.26	79.94
Vname13 	53	16.36	96.30
Vname14	12	3.70	100.00
Total	324	100.00

It was not a problem with leading or trailing spaces.

It seems to have been a problem with the size of the master data set and the use of a m:1 merge (and possibly the fact it was string merge and memory allocation).  
When I deleted all the data parameters from the master file I was able to successfully merge the two data sets using a m:1 merge . 
After that I was able to merge the original large data set with the file contain the round, the unique household identifier and village name using a
1:1 merge.

The data set was large (for me)

obs:         2,271                          
vars:           301                          23 Aug 2013 11:12
size:     3,987,876 (99.2% of memory free)

I could not find any other reference to this problem on the net, so have posted the problem/sollution (for me) here.  Of course I would be much better off without the string identifier for the household.

I'm sorry I can't post/share the data files to replicate this problem but this may help someone at some stage.

df

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: m:1 merge with string function, data set too large?
  - From: "David Fredericks" <[email protected]>

Prev by Date: Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Next by Date: Re: st: Looping over a range of observations
Previous by thread: st: m:1 merge with string function, data set too large?
Next by thread: st: Thread-Index: Ac6gCEcXwi5L5glZSNei/hI5vS1E9g==
Index(es):
- Date
- Thread