Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Estimating new variable from multiple datasets

From	"Pavlos C. Symeou" <[email protected]>
To	[email protected]
Subject	st: Estimating new variable from multiple datasets
Date	Thu, 28 Oct 2010 12:07:00 +0200


Dear Statalisters,

Some time ago I wrote to Statalist about the problem below but Ihave been unsuccessful in receiving any suggestions. I am tryingherein to simplify my problem in the hope that you can help. I have250 company files with their patents of total size of 12Gbytes (seebelow a sample). I want to use information from each company'spatents and their citations and create a new dataset which willconsist of all companies in a panel format with a new variablewhose estimation I describe below.

Let me give you an example. Company "Acer" which operates inindustrial sector 3456 (mother_SIC) has 100,000 patents publishedbetween 1960-2009 (year). Certain years may have multiple patents.Every patent is assigned multiple patent numbers (patent_number)which uniquely identify it. Each of that patent can be used in atleast one industrial sector (patent_Sic). Every patent may citemultiple patents (citation).

The data below tell that, ACER in year 1994 published two patentswhich were assigned 20 numbers. Each patent is used in 20industries (patent_Sic). In each of the two patents, ACER is citing20 other patents, which may belong to ACER or other companies,which themselves appear in a similar fashion as observations (inACER's or) another company's file.

name mother_sic Year patent_Sic_1 patent_Sic_20 patent_number_1patent_number_20 citation_1 citation_20ACER 3456 19943661 TW231391-ATW231391-B US231391-A CY231391-AACER 3456 1994 3417 5472DR231342-A TA231342-C FR231342-A CY2634542-BACER 3456 1995 3577 3572BR231342-B PAT231342-A TW231342-A SE231342-A

I want to estimate a new numerical variable ("convergence") forACER which will measure how much its patents' SIC sectors and therelated cited patents' SIC sectors deviate from ACER's industrialsector (mother_SIC) based on the following formula. Take forexample the two patents in 1994 above. The value of "convergence"for the year 1994 should be:

{[0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)] + [0.90 * (a2 +b2+c2) + 0.10 * (d2 + e2 + f2)] } / n

where 1,2,...,n is the number of patents that ACER published in1994 and a, b, c, d, e, f are:


For every ACER patent published in 1994:

a) the proportion of all patent_SICs whose 1st digit is differentthan the 1st digit of mother_SIC, multiplied by 3;b) the proportion of all patent_SICs whose 1st digit is the same asthe 1st digit of mother_SIC but the 2nd digit is different than the2nd digit of mother_SIC, multiplied by 2;c) the proportion of all patent_SICs whose first 2 digits are thesame as the first two digits of mother_SIC but they have adifferent 3rd digit, multiplied it 1;d) the proportion of all cited patents' patent_SICs whose 1st digitis different than the 1st digit of ACER's mother_SIC, multiplied by 3;e) the proportion of all cited patents' patent_SICs whose 1st digitis the same as the 1st digit of ACER's mother_SIC but the 2nd digitis different than the 2nd digit of mother_SIC, multiplied by 2;f) the proportion of all cited patents' patent_SICs whose first 2digits are the same as the first 2 digits of ACER's mother_SIC butthey have a different 3rd digit, multiplied by 1;


Notes:

A) For a, b, c the search will be done inside ACER's file. For d,e, f the search will be done inside all available companies' files,including ACER and only for the years 1994 and earlier. This isbecause cited patents are already published.B) A citation number in ACER's file will appear as a patent numberin another company's file (or in ACER's file if the company isciting another patent it owns).C) Since a patent can be assigned multiple numbers, the searchintended to match the citation with the patent number must gothrough all columns that have patent numbers.D) It is possible that a cited patent does not belong to our samplecompanies. This should not terminate the loop but go on with thenext citation.


The output should look like this:

Company    Year        convergence
ACER            1994      2.3
ACER            1995      2.1
ACER            1996      2.5
......
......

Any help will be very appreciated.

Best,

Pavlos
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Estimating new variable from multiple datasets
  - From: Mitch Abdon <[email protected]>

Prev by Date: Re: st: Making Cohorts
Next by Date: Re: st: Estimating new variable from multiple datasets
Previous by thread: st: Making Cohorts
Next by thread: Re: st: Estimating new variable from multiple datasets
Index(es):
- Date
- Thread