Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: label / macro problem


From   evan roberts <[email protected]>
To   [email protected]
Subject   st: Re: label / macro problem
Date   Sat, 12 Aug 2006 09:23:53 -0400

Jeph Herrin asked how to generate a consistently encoded string across a large number of files, where he knew in advance that the number of unique strings was substantially less than the total number of cases. One solution to this kind of problem is to
(1) Collapse or contract each file to a smaller file that contains only the unique strings.
(2) Merge these smaller files into a larger file of the unique strings and encode.
(3) Collapse the dataset of unique strings so it contains two variables -- string and code, all of which are unique. Some people call this a "data dictionary"
(4) Sequentially merge the dictionary with the original data to apply the codes

Here is an example with the auto data.

* Generate some test data
forvalues i=1(1)3 {
2. sysuse auto, clear
3. expand 3
4. save bigauto`i', replace
5. }
(1978 Automobile Data)
(148 observations created)
file bigauto1.dta saved
(1978 Automobile Data)
(148 observations created)
file bigauto2.dta saved
(1978 Automobile Data)
(148 observations created)
file bigauto3.dta saved

* (1) Contract to the unique strings in each file

forvalues i=1(1)3 {
2. use bigauto`i', clear
3. sort make
4. contract make
5. keep make
6. save littleauto`i', replace
7. }


* (2-3) Merge into one file to create the dictionary
* After each merge we contract to the unique strings (make in this case) so that the saved dictionary never has any duplicates.
forvalues i=2(1)3 {
use littleauto1, clear
append using littleauto`i'
contract make
keep make
save littleauto1, replace
erase littleauto`i'
}

encode make, gen(make_code) label(makelbl)
lab values make_code

list, abbreviate(12)
| make make_code |
|-------------------------------|
1. | AMC Concord 1 |
2. | AMC Pacer 2 |
3. | AMC Spirit 3 |
4. | Audi 5000 4 |
5. | Audi Fox 5 |
|-------------------------------|
6. | BMW 320i 6 |
7. | Buick Century 7 |

sort make
save auto_dict

* (4) Re-merge with the original data to encode consistently across the files
forvalues i=1(1)3 {
2. use bigauto`i', clear
3. sort make
4. merge make using auto_dict, nokeep
5. save bigauto`i', replace
6. assert _merge==3
7. }

* Note that the nokeep option is required on the merge statement so that cases of the string variable that are in the dictionary but not in this file (exist in one of the other 83 files) do not get added to the original data.

Hope that helps
Evan Roberts


Date: Fri, 11 Aug 2006 19:32:23 -0400
From: Jeph Herrin <[email protected]>
Subject: st: label / macro problem

I'm using 9.2, latest update.

My programming problem is to combine a large number of large files;
approximately 84 files of 500k obs each. I only need three variables
from these files, but one of them, -mystring- is str64, which means that
as is, I can't combine these files via appending because my RAM (4GB)
runs out.

However, -mystring- only takes about 5500k different values. So
the solution I am using is to open each file, encode(mystring), save
the label, and then append all prior opened files. The values of
- -mystring- are not constant over all the files - new values are added
over time, so I have to update the value labels each time I add a file.
My code looks like this :

u file1, clear
encode mystring, gen(myint)
local myintlab : value label myint
save temp, replace
foreach F of numlist 2/84 {
	u file`F', clear
	keep ID mystring
	encode mystring, gen(myint) label("`myintlab'")
	local myintlab : value label myint
	append using temp
	save temp, replace
}

This seems to work fine until a point. But after about 30 files,
*something* runs out of space, and the value label ceases to be
updated with new values; -myint- simply holds integers with no
corresponding labels. Now, I understand that 64k value label
values should be allowed, so I don't see a problem there. And
- -myintlab- is just a macro holding the name of the set of value
labels. So what else could be going wrong? Or, is there another
way to do this?

NB: The close reader will note that I mention 3 variables in the preamble
but only have two in my code fragment. In fact, I *also* encode a
second string variable; it takes many fewer values, however, and
turns out fine in the end.

In particular, I would appreciate any tips on how to debug what
is happening.

cheers,
Jeph


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

------------------------------

End of statalist-digest V4 #2426
********************************

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

--
***************
Evan Roberts
Minnesota Population Center and Department of History
University of Minnesota
[email protected]
http://www.pop.umn.edu/~eroberts
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index