Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Mitch Abdon <mitchaabdon@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Estimating new variable from multiple datasets |

Date |
Thu, 28 Oct 2010 18:29:11 +0800 |

this is a lot... maybe start with the -substr- string function... "help string_functions" On Thu, Oct 28, 2010 at 6:07 PM, Pavlos C. Symeou <p.symeou@lmu.de> wrote: > > Dear Statalisters, > > Some time ago I wrote to Statalist about the problem below but I have been > unsuccessful in receiving any suggestions. I am trying herein to simplify my > problem in the hope that you can help. I have 250 company files with their > patents of total size of 12Gbytes (see below a sample). I want to use > information from each company's patents and their citations and create a new > dataset which will consist of all companies in a panel format with a new > variable whose estimation I describe below. > > Let me give you an example. Company "Acer" which operates in industrial > sector 3456 (mother_SIC) has 100,000 patents published between 1960-2009 > (year). Certain years may have multiple patents. Every patent is assigned > multiple patent numbers (patent_number) which uniquely identify it. Each of > that patent can be used in at least one industrial sector (patent_Sic). > Every patent may cite multiple patents (citation). > > The data below tell that, ACER in year 1994 published two patents which were > assigned 20 numbers. Each patent is used in 20 industries (patent_Sic). In > each of the two patents, ACER is citing 20 other patents, which may belong > to ACER or other companies, which themselves appear in a similar fashion as > observations (in ACER's or) another company's file. > > name mother_sic Year patent_Sic_1 patent_Sic_20 patent_number_1 > patent_number_20 citation_1 citation_20 > ACER 3456 1994 3661 > TW231391-A TW231391-B US231391-A CY231391-A > ACER 3456 1994 3417 5472 DR231342-A > TA231342-C FR231342-A CY2634542-B > ACER 3456 1995 3577 3572 BR231342-B > PAT231342-A TW231342-A SE231342-A > > I want to estimate a new numerical variable ("convergence") for ACER which > will measure how much its patents' SIC sectors and the related cited > patents' SIC sectors deviate from ACER's industrial sector (mother_SIC) > based on the following formula. Take for example the two patents in 1994 > above. The value of "convergence" for the year 1994 should be: > > {[0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)] + [0.90 * (a2 +b2 +c2) + 0.10 > * (d2 + e2 + f2)] } / n > > where 1,2,...,n is the number of patents that ACER published in 1994 and a, > b, c, d, e, f are: > > For every ACER patent published in 1994: > a) the proportion of all patent_SICs whose 1st digit is different than the > 1st digit of mother_SIC, multiplied by 3; > b) the proportion of all patent_SICs whose 1st digit is the same as the 1st > digit of mother_SIC but the 2nd digit is different than the 2nd digit of > mother_SIC, multiplied by 2; > c) the proportion of all patent_SICs whose first 2 digits are the same as > the first two digits of mother_SIC but they have a different 3rd digit, > multiplied it 1; > d) the proportion of all cited patents' patent_SICs whose 1st digit is > different than the 1st digit of ACER's mother_SIC, multiplied by 3; > e) the proportion of all cited patents' patent_SICs whose 1st digit is the > same as the 1st digit of ACER's mother_SIC but the 2nd digit is different > than the 2nd digit of mother_SIC, multiplied by 2; > f) the proportion of all cited patents' patent_SICs whose first 2 digits are > the same as the first 2 digits of ACER's mother_SIC but they have a > different 3rd digit, multiplied by 1; > > Notes: > A) For a, b, c the search will be done inside ACER's file. For d, e, f the > search will be done inside all available companies' files, including ACER > and only for the years 1994 and earlier. This is because cited patents are > already published. > B) A citation number in ACER's file will appear as a patent number in > another company's file (or in ACER's file if the company is citing another > patent it owns). > C) Since a patent can be assigned multiple numbers, the search intended to > match the citation with the patent number must go through all columns that > have patent numbers. > D) It is possible that a cited patent does not belong to our sample > companies. This should not terminate the loop but go on with the next > citation. > > The output should look like this: > > Company Year convergence > ACER 1994 2.3 > ACER 1995 2.1 > ACER 1996 2.5 > ...... > ...... > > Any help will be very appreciated. > > Best, > > Pavlos > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- Best, Mitch Arnelyn Abdon Mobile: +639178034402 http://statadaily.wordpress.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Estimating new variable from multiple datasets***From:*"Pavlos C. Symeou" <p.symeou@lmu.de>

- Prev by Date:
**st: Estimating new variable from multiple datasets** - Next by Date:
**st: Re: Making Cohorts** - Previous by thread:
**st: Estimating new variable from multiple datasets** - Index(es):