Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Encoding and matching string values


From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: RE: Encoding and matching string values
Date   Fri, 24 Sep 2010 23:05:13 +0000

<>


In an attempt to reduce the size of the final/appended dataset, Florian wants to encode "citation" in each of the un-appended datasets first, remove the long string variable and have an encoded numeric variable in it's place, and then append the files to create the large, final dataset. 

The problems are (1) you will need to remove the original string version of "citation" before appending or the  -encode- didn't save you any space (as Martin mentions)  and (2) if the appended datasets have the same "citations", then -encode- may have assigned it one value in one dataset and a different value in a different dataset (I think this is what Martin was asking about in his response).  
It's easier to -encode- "citation" in the final, appended dataset so that the encoding is consistent, but in Florian's case this is undesirable because of space limitations.

One solution is to create a look up table containing the string variable "citation" and an assigned code/number for each value in citation (citation_number).  Then you can merge this citation_number to each individual, un-appended file & drop the string "citation" (in the un-appended files before appending them ) to save space.
After appending all these files , you can apply the "citations" as value labels to the "citation_number" in the large/appended dataset.  

You'll need -labmask- (from findit labutil on SSC) and -fre- (from SSC) to use the example below:

************************!

//fake "using" dataset//
clear
inp id patent_number str5(citation)
1 12 "one"
2 13 "two"
3 99 "three"
4 98 "four"
end
sa using.dta, replace

encode citation, g(citation2)
cap which fre
if _rc ssc install fre, replace
fre citation2
sa using_encoded.dta, replace


//fake "master" dataset//
clear
inp id patent_number str5(citation)
5 19 "four"
6 17 "five"
7 89 "six"
8 88 "seven"
end
sa master.dta, replace

encode citation, g(citation2)
fre citation2

/*
this is what Florian is running into, it doesnt work because encode assigned
different values to the same labels across datasets
*/
	append using "using_encoded.dta"


**solution**

//1.  mk lookup table of values//
clear
save "lookuptable.dta", emptyok replace
foreach file in using master   /* put all your files here */  {
	append using "`file'.dta", keep(citation)
	}
	duplicates drop
	g citation_number = _n
	l
	save "lookuptable.dta", emptyok replace
	cap which labmask
	if _rc ssc install labutil, replace
	labmask citation_number, value(citation) lblname(cit)
	la save cit  using "citationlabels.do" , replace
	
	
	
//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace

foreach file in using master  /* put all your files here */  {
	u "`file'.dta", clear
	merge 1:1 citation using "lookuptable.dta"
	drop if _m!=3
	drop _m
	drop citation
	sa "`file'_encoded.dta", replace
	append using "final.dta"
	sa "final.dta", replace
	}

//3.  apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation

************************!



- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754



On Sep 24, 2010, at 4:53 PM, Martin Weiss wrote:

> 
> <>
> 
> I am not sure the description here is clear enough: -encode- forces you to -generate()- the new numeric variable, so that both the string and its -encode-d counterpart coexist afterwards. So it is hard to see how a) your dataset is supposed to decrease in size via -encode- b) how the "original string values" are no longer there...
> 
> 
> How does Stata (_not STATA_) "...mess up the the numerical values after appending the dataset"?
> 
> HTH
> Martin
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Florian Seliger
> Sent: Freitag, 24. September 2010 20:57
> To: [email protected]
> Subject: st: Encoding and matching string values
> 
> Hi,
> we have about 300 individual company files, each file with up to 100,000 patents. To each patent, up to 500 patent_numbers and citations (string values) are assigned.  In the next step, we would like to put all files together and match the values to each other.
> 
> First, we  want to decrease the enormous sizes of the datasets by using the encode command on the strings.
> 
> However, after encoding each individual file’s variables and using the append command, the numerical values cannot be decoded correctly at all so that the string values become wrong.
> 
> The reason is that STATA messes up the the numerical values after appending the dataset.
> Therefore, we search for a possibility to use the encode command, but still keep the original string values after appending the datasets in a way that a matching is possible.
> 
> Thank you in advance,
> Florian
> -- 
> Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!  
> Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index