Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Encoding and matching string values


From   "Pavlos C. Symeou" <p.symeou@lmu.de>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Encoding and matching string values
Date   Mon, 04 Oct 2010 12:47:31 +0200

thank you for your input.

@ Martin, we thought about using the -encode- command and then replace the original variable with the encoded one to reduce the data size. We managed to reduce some files from nearly 1Gb down to 10 Mb. If you consider that we have 250 such files with some of a size of more than 10Gb each (due to string variables and thousands of observations) you can understand the degree of difficulty we are facing. Of course we used - compress - as well as clearing the string values as much as we could.

@ Eric. I ran the code below after having done some minor adjustments and works just fine. This was an excellent exercise to see how our data behave. The final task we want to carry out though is even more complex in terms of programming and as we are lacking that competence we would like to ask for your input once more.

The problem has as follows. We have about 250 company files with their patents (see below a sample). For example, company "Acer" which operates in industrial sector 3456 (mother_SIC) has 100,000 patents published between 1960-2009 (year). Certain years may have multiple patents. Every patent is assigned multiple patent numbers (patent_number) which uniquely identify it. Each of that patent can be used in at least one industrial sector (patent_Sic). Last, every patent may cite multiple patents (citation).

The data below tell that, ACER in year 1994 published two patents which were assigned the names TW231391-A, TW231391-B,....., TW231391-C and DR231342-A, DRE231342-B,......,TA231342-C, respectively. Each patent can be applied in a respective number of industries (patent_Sic). In each of the two patents, ACER is citing a number of other patents, which may belong to ACER or other companies, which themselves are described by similar attributes (patent_number, patent_Sic etc).

Company mother_sic Year patent_Sic_1 patent_Sic_2 patent_Sic_3 patent_number_1 patent_number_2 patent_number_40 citation_1 citation_2 citation_500 ACER 3456 1994 3661 TW231391-A TW231391-B TW231391-C US231391-A GB231391-A CY231391-A ACER 3456 1994 3417 5472 5571 DR231342-A DRE231342-B TA231342-C FR231342-A GG231342-D CY2634542-B ACER 3456 1995 3577 3572 3571 BR231342-B LTE231342-A PAT231342-A GR231342-A TW231342-A SE231342-A
..........
..........
.........

We want to estimate a new variable ("convergence") for ACER which will measure how much its patents' SIC sectors and these patents' citations' SIC sectors deviate from ACER's industrial sector (mother_SIC) based on the following formula. Take for example the two patents in 1994 above.

The value of "convergence" for the year 1994 should be: {[*0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)*] ** + [*0.90 * (a2 +b2 +c2) + 0.10 * (d2 + e2 + f2)*] *} / n* *
*
where 1,2,...,n is the number of patents that ACER published in 1994 and a,b,c,d,e,f are:

For every ACER patent published in 1994:
a) take the proportion of all patent_SICs whose 1st digit is different than the 1st digit of mother_SIC and multiply it by 3; b) take the proportion of all patent_SICs whose 1st digit is the same as the 1st digit of mother_SIC but the 2nd digit is different than the 2nd digit of mother_SIC and multiply it by 2; c) take the proportion of all patent_SICs whose first 2 digits are the same as the first 2 digits of mother_SIC but they have a different 3rd digit and multiply it by 1; d) take the proportion of all cited patents' patent_SICs whose 1st digit is different than the 1st digit of ACER's mother_SIC and multiply it by 3; e) take the proportion of all cited patents' patent_SICs whose 1st digit is the same as the 1st digit of ACER's mother_SIC but the 2nd digit is different than the 2nd digit of mother_SIC and multiply it by 2; f) take the proportion of all cited patents' patent_SICs whose first 2 digits are the same as the first 2 digits of ACER's mother_SIC but they have a different 3rd digit and multiply it by 1;

For a, b, c the search will be done inside ACER's file. For d, e, f the search will be done inside all available companies' files, including ACER and only for the years 1994 and earlier. This is because cited patents are published before the focal citing patent (of course).

The output should look like this:

Company    Year        convergence
ACER            1994      2.3
ACER            1995      2.1
ACER            1996      2.5
.........................................
........................................

In retrospect, the use of -encode- aimed to reduce the sizes of consisting company datasets so that the files can be joined together (if this is going to help anyway) and the "search and match" time is reduced.

I know that this is not a trivial task, but unfortunately my programming knowledge is minimal. Any help will be very appreciated.

Best,

Pavlos




**solution**

//1.  mk lookup table of values//
clear
set more off
cd "MY DIRECTORY WITH ALL MY DTA FILE"
save "lookuptable.dta", emptyok replace
foreach file in att ameritech aol {
    append using "`file'.dta", keep(citation)
    }
    duplicates drop
    g citation_number = _n
    l
    save "lookuptable.dta", emptyok replace
    cap which labmask
    if _rc ssc install labutil, replace
    labmask citation_number, value(citation) lblname(cit)
    la save cit  using "citationlabels.do" , replace


//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace

foreach file in att ameritech aol   {
    u "`file'.dta", clear
merge m:1 citation using "lookuptable.dta" //Here I changed the 1:1 to m:1 because I have multiple patents which cite the same patent
    drop if _m!=3
    drop _m
    drop citation
    sa "`file'_encoded.dta", replace
    append using "final.dta"
    sa "final.dta", replace
    }

//3.  apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation

************************!



*Von: *Eric Booth <ebooth@ppri.tamu.edu <mailto:ebooth@ppri.tamu.edu>>
*Datum: *24. September 2010 16:05:13 GMT-07:00
*An: *"<statalist@hsphsun2.harvard.edu <mailto:statalist@hsphsun2.harvard.edu>>" <statalist@hsphsun2.harvard.edu <mailto:statalist@hsphsun2.harvard.edu>>
*Betreff: **Re: st: RE: Encoding and matching string values*
*Antwort an: *statalist@hsphsun2.harvard.edu <mailto:statalist@hsphsun2.harvard.edu>

<>


In an attempt to reduce the size of the final/appended dataset, Florian wants to encode "citation" in each of the un-appended datasets first, remove the long string variable and have an encoded numeric variable in it's place, and then append the files to create the large, final dataset.

The problems are (1) you will need to remove the original string version of "citation" before appending or the -encode- didn't save you any space (as Martin mentions) and (2) if the appended datasets have the same "citations", then -encode- may have assigned it one value in one dataset and a different value in a different dataset (I think this is what Martin was asking about in his response). It's easier to -encode- "citation" in the final, appended dataset so that the encoding is consistent, but in Florian's case this is undesirable because of space limitations.

One solution is to create a look up table containing the string variable "citation" and an assigned code/number for each value in citation (citation_number). Then you can merge this citation_number to each individual, un-appended file & drop the string "citation" (in the un-appended files before appending them ) to save space. After appending all these files , you can apply the "citations" as value labels to the "citation_number" in the large/appended dataset.

You'll need -labmask- (from findit labutil on SSC) and -fre- (from SSC) to use the example below:

************************!

//fake "using" dataset//
clear
inp id patent_number str5(citation)
1 12 "one"
2 13 "two"
3 99 "three"
4 98 "four"
end
sa using.dta, replace

encode citation, g(citation2)
cap which fre
if _rc ssc install fre, replace
fre citation2
sa using_encoded.dta, replace


//fake "master" dataset//
clear
inp id patent_number str5(citation)
5 19 "four"
6 17 "five"
7 89 "six"
8 88 "seven"
end
sa master.dta, replace

encode citation, g(citation2)
fre citation2

/*
this is what Florian is running into, it doesnt work because encode assigned
different values to the same labels across datasets
*/
append using "using_encoded.dta"


**solution**

//1.  mk lookup table of values//
clear
save "lookuptable.dta", emptyok replace
foreach file in using master   /* put all your files here */  {
append using "`file'.dta", keep(citation)
}
duplicates drop
g citation_number = _n
l
save "lookuptable.dta", emptyok replace
cap which labmask
if _rc ssc install labutil, replace
labmask citation_number, value(citation) lblname(cit)
la save cit  using "citationlabels.do" , replace



//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace

foreach file in using master  /* put all your files here */  {
u "`file'.dta", clear
merge 1:1 citation using "lookuptable.dta"
drop if _m!=3
drop _m
drop citation
sa "`file'_encoded.dta", replace
append using "final.dta"
sa "final.dta", replace
}

//3.  apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation

************************!



- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu <mailto:ebooth@ppri.tamu.edu>
Office: +979.845.6754



On Sep 24, 2010, at 4:53 PM, Martin Weiss wrote:


<>

I am not sure the description here is clear enough: -encode- forces you to -generate()- the new numeric variable, so that both the string and its -encode-d counterpart coexist afterwards. So it is hard to see how a) your dataset is supposed to decrease in size via -encode- b) how the "original string values" are no longer there...


How does Stata (_not STATA_) "...mess up the the numerical values after appending the dataset"?

HTH
Martin




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index