Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Estimating new variable from multiple datasets

From	Mitch Abdon <[email protected]>
To	[email protected]
Subject	Re: st: Estimating new variable from multiple datasets
Date	Thu, 28 Oct 2010 18:29:11 +0800

this is a lot... maybe start with the -substr- string function...
"help string_functions"

On Thu, Oct 28, 2010 at 6:07 PM, Pavlos C. Symeou <[email protected]> wrote:
>
> Dear Statalisters,
>
> Some time ago I wrote to Statalist about the problem below but I have been
> unsuccessful in receiving any suggestions. I am trying herein to simplify my
> problem in the hope that you can help. I have 250 company files with their
> patents of total size of 12Gbytes (see below a sample). I want to use
> information from each company's patents and their citations and create a new
> dataset which will consist of all companies in a panel format with a new
> variable whose estimation I describe below.
>
> Let me give you an example. Company "Acer" which operates in industrial
> sector 3456 (mother_SIC) has 100,000 patents published between 1960-2009
> (year). Certain years may have multiple patents. Every patent is assigned
> multiple patent numbers (patent_number) which uniquely identify it. Each of
> that patent can be used in at least one industrial sector (patent_Sic).
> Every patent may cite multiple patents (citation).
>
> The data below tell that, ACER in year 1994 published two patents which were
> assigned 20 numbers. Each patent is used in 20 industries (patent_Sic). In
> each of the two patents, ACER is citing 20 other patents, which may belong
> to ACER or other companies, which themselves appear in a similar fashion as
> observations (in ACER's or) another company's file.
>
> name mother_sic Year patent_Sic_1 patent_Sic_20 patent_number_1
> patent_number_20 citation_1 citation_20
> ACER   3456          1994   3661
> TW231391-A         TW231391-B            US231391-A  CY231391-A
> ACER   3456          1994   3417 5472                           DR231342-A
>        TA231342-C             FR231342-A   CY2634542-B
> ACER   3456          1995   3577 3572                           BR231342-B
>         PAT231342-A          TW231342-A  SE231342-A
>
> I want to estimate a new numerical variable ("convergence") for ACER which
> will measure how much its patents' SIC sectors and the related cited
> patents' SIC sectors deviate from ACER's industrial sector (mother_SIC)
> based on the following formula. Take for example the two patents in 1994
> above. The value of "convergence" for the year 1994 should be:
>
> {[0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)] + [0.90 * (a2 +b2 +c2) + 0.10
> * (d2 + e2 + f2)] } / n
>
> where 1,2,...,n is the number of patents that ACER published in 1994 and a,
> b, c, d, e, f are:
>
> For every ACER patent published in 1994:
> a) the proportion of all patent_SICs whose 1st digit is different than the
> 1st digit of mother_SIC, multiplied by 3;
> b) the proportion of all patent_SICs whose 1st digit is the same as the 1st
> digit of mother_SIC but the 2nd digit is different than the 2nd digit of
> mother_SIC, multiplied by 2;
> c) the proportion of all patent_SICs whose first 2 digits are the same as
> the first two digits of mother_SIC but they have a different 3rd digit,
> multiplied it 1;
> d) the proportion of all cited patents' patent_SICs whose 1st digit is
> different than the 1st digit of ACER's mother_SIC, multiplied by 3;
> e) the proportion of all cited patents' patent_SICs whose 1st digit is the
> same as the 1st digit of ACER's mother_SIC but the 2nd digit is different
> than the 2nd digit of mother_SIC, multiplied by 2;
> f) the proportion of all cited patents' patent_SICs whose first 2 digits are
> the same as the first 2 digits of ACER's mother_SIC but they have a
> different 3rd digit, multiplied by 1;
>
> Notes:
> A) For a, b, c the search will be done inside ACER's file. For d, e, f the
> search will be done inside all available companies' files, including ACER
> and only for the years 1994 and earlier. This is because cited patents are
> already published.
> B) A citation number in ACER's file will appear as a patent number in
> another company's file (or in ACER's file if the company is citing another
> patent it owns).
> C) Since a patent can be assigned multiple numbers, the search intended to
> match the citation with the patent number must go through all columns that
> have patent numbers.
> D) It is possible that a cited patent does not belong to our sample
> companies. This should not terminate the loop but go on with the next
> citation.
>
> The output should look like this:
>
> Company    Year        convergence
> ACER            1994      2.3
> ACER            1995      2.1
> ACER            1996      2.5
> ......
> ......
>
> Any help will be very appreciated.
>
> Best,
>
> Pavlos
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
Best,
Mitch


Arnelyn Abdon
Mobile: +639178034402
http://statadaily.wordpress.com

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Estimating new variable from multiple datasets
  - From: "Pavlos C. Symeou" <[email protected]>

Prev by Date: st: Estimating new variable from multiple datasets
Next by Date: st: Re: Making Cohorts
Previous by thread: st: Estimating new variable from multiple datasets
Index(es):
- Date
- Thread