Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Estimating new variable from multiple datasets


From   "Pavlos C. Symeou" <p.symeou@lmu.de>
To   statalist@hsphsun2.harvard.edu
Subject   st: Estimating new variable from multiple datasets
Date   Thu, 28 Oct 2010 12:07:00 +0200


Dear Statalisters,

Some time ago I wrote to Statalist about the problem below but I have been unsuccessful in receiving any suggestions. I am trying herein to simplify my problem in the hope that you can help. I have 250 company files with their patents of total size of 12Gbytes (see below a sample). I want to use information from each company's patents and their citations and create a new dataset which will consist of all companies in a panel format with a new variable whose estimation I describe below.

Let me give you an example. Company "Acer" which operates in industrial sector 3456 (mother_SIC) has 100,000 patents published between 1960-2009 (year). Certain years may have multiple patents. Every patent is assigned multiple patent numbers (patent_number) which uniquely identify it. Each of that patent can be used in at least one industrial sector (patent_Sic). Every patent may cite multiple patents (citation).

The data below tell that, ACER in year 1994 published two patents which were assigned 20 numbers. Each patent is used in 20 industries (patent_Sic). In each of the two patents, ACER is citing 20 other patents, which may belong to ACER or other companies, which themselves appear in a similar fashion as observations (in ACER's or) another company's file.

name mother_sic Year patent_Sic_1 patent_Sic_20 patent_number_1 patent_number_20 citation_1 citation_20 ACER 3456 1994 3661 TW231391-A TW231391-B US231391-A CY231391-A ACER 3456 1994 3417 5472 DR231342-A TA231342-C FR231342-A CY2634542-B ACER 3456 1995 3577 3572 BR231342-B PAT231342-A TW231342-A SE231342-A

I want to estimate a new numerical variable ("convergence") for ACER which will measure how much its patents' SIC sectors and the related cited patents' SIC sectors deviate from ACER's industrial sector (mother_SIC) based on the following formula. Take for example the two patents in 1994 above. The value of "convergence" for the year 1994 should be:

{[0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)] + [0.90 * (a2 +b2 +c2) + 0.10 * (d2 + e2 + f2)] } / n

where 1,2,...,n is the number of patents that ACER published in 1994 and a, b, c, d, e, f are:

For every ACER patent published in 1994:
a) the proportion of all patent_SICs whose 1st digit is different than the 1st digit of mother_SIC, multiplied by 3; b) the proportion of all patent_SICs whose 1st digit is the same as the 1st digit of mother_SIC but the 2nd digit is different than the 2nd digit of mother_SIC, multiplied by 2; c) the proportion of all patent_SICs whose first 2 digits are the same as the first two digits of mother_SIC but they have a different 3rd digit, multiplied it 1; d) the proportion of all cited patents' patent_SICs whose 1st digit is different than the 1st digit of ACER's mother_SIC, multiplied by 3; e) the proportion of all cited patents' patent_SICs whose 1st digit is the same as the 1st digit of ACER's mother_SIC but the 2nd digit is different than the 2nd digit of mother_SIC, multiplied by 2; f) the proportion of all cited patents' patent_SICs whose first 2 digits are the same as the first 2 digits of ACER's mother_SIC but they have a different 3rd digit, multiplied by 1;

Notes:
A) For a, b, c the search will be done inside ACER's file. For d, e, f the search will be done inside all available companies' files, including ACER and only for the years 1994 and earlier. This is because cited patents are already published. B) A citation number in ACER's file will appear as a patent number in another company's file (or in ACER's file if the company is citing another patent it owns). C) Since a patent can be assigned multiple numbers, the search intended to match the citation with the patent number must go through all columns that have patent numbers. D) It is possible that a cited patent does not belong to our sample companies. This should not terminate the loop but go on with the next citation.

The output should look like this:

Company    Year        convergence
ACER            1994      2.3
ACER            1995      2.1
ACER            1996      2.5
......
......

Any help will be very appreciated.

Best,

Pavlos
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index