Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Pavlos C. Symeou" <p.symeou@lmu.de> |
To | statalist@hsphsun2.harvard.edu |
Subject | st: Estimating new variable from multiple datasets |
Date | Thu, 28 Oct 2010 12:07:00 +0200 |
Dear Statalisters,Some time ago I wrote to Statalist about the problem below but I have been unsuccessful in receiving any suggestions. I am trying herein to simplify my problem in the hope that you can help. I have 250 company files with their patents of total size of 12Gbytes (see below a sample). I want to use information from each company's patents and their citations and create a new dataset which will consist of all companies in a panel format with a new variable whose estimation I describe below.
Let me give you an example. Company "Acer" which operates in industrial sector 3456 (mother_SIC) has 100,000 patents published between 1960-2009 (year). Certain years may have multiple patents. Every patent is assigned multiple patent numbers (patent_number) which uniquely identify it. Each of that patent can be used in at least one industrial sector (patent_Sic). Every patent may cite multiple patents (citation).
The data below tell that, ACER in year 1994 published two patents which were assigned 20 numbers. Each patent is used in 20 industries (patent_Sic). In each of the two patents, ACER is citing 20 other patents, which may belong to ACER or other companies, which themselves appear in a similar fashion as observations (in ACER's or) another company's file.
name mother_sic Year patent_Sic_1 patent_Sic_20 patent_number_1 patent_number_20 citation_1 citation_20 ACER 3456 1994 3661 TW231391-A TW231391-B US231391-A CY231391-A ACER 3456 1994 3417 5472 DR231342-A TA231342-C FR231342-A CY2634542-B ACER 3456 1995 3577 3572 BR231342-B PAT231342-A TW231342-A SE231342-A
I want to estimate a new numerical variable ("convergence") for ACER which will measure how much its patents' SIC sectors and the related cited patents' SIC sectors deviate from ACER's industrial sector (mother_SIC) based on the following formula. Take for example the two patents in 1994 above. The value of "convergence" for the year 1994 should be:
{[0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)] + [0.90 * (a2 +b2 +c2) + 0.10 * (d2 + e2 + f2)] } / n
where 1,2,...,n is the number of patents that ACER published in 1994 and a, b, c, d, e, f are:
For every ACER patent published in 1994:a) the proportion of all patent_SICs whose 1st digit is different than the 1st digit of mother_SIC, multiplied by 3; b) the proportion of all patent_SICs whose 1st digit is the same as the 1st digit of mother_SIC but the 2nd digit is different than the 2nd digit of mother_SIC, multiplied by 2; c) the proportion of all patent_SICs whose first 2 digits are the same as the first two digits of mother_SIC but they have a different 3rd digit, multiplied it 1; d) the proportion of all cited patents' patent_SICs whose 1st digit is different than the 1st digit of ACER's mother_SIC, multiplied by 3; e) the proportion of all cited patents' patent_SICs whose 1st digit is the same as the 1st digit of ACER's mother_SIC but the 2nd digit is different than the 2nd digit of mother_SIC, multiplied by 2; f) the proportion of all cited patents' patent_SICs whose first 2 digits are the same as the first 2 digits of ACER's mother_SIC but they have a different 3rd digit, multiplied by 1;
Notes:A) For a, b, c the search will be done inside ACER's file. For d, e, f the search will be done inside all available companies' files, including ACER and only for the years 1994 and earlier. This is because cited patents are already published. B) A citation number in ACER's file will appear as a patent number in another company's file (or in ACER's file if the company is citing another patent it owns). C) Since a patent can be assigned multiple numbers, the search intended to match the citation with the patent number must go through all columns that have patent numbers. D) It is possible that a cited patent does not belong to our sample companies. This should not terminate the loop but go on with the next citation.
The output should look like this: Company Year convergence ACER 1994 2.3 ACER 1995 2.1 ACER 1996 2.5 ...... ...... Any help will be very appreciated. Best, Pavlos * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/