Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Extracting parts of string variable


From   Robert Picard <[email protected]>
To   "Pavlos C. Symeou" <[email protected]>
Subject   Re: st: RE: Extracting parts of string variable
Date   Thu, 8 Apr 2010 18:04:33 -0400

Pavlos,

I've adjusted the code to take care of the additional examples you
provided. I added some comments to explain what I have done. In this
case, it seemed easier to first remove additional patent codes at the
beginning of the line. I made the pattern that searches for patent
codes more general to take the issuing country into account.

*---------------- example start ----------------------
version 11
clear
input id str244( cit_1 company_1)
1 "US6449348-B1 3COM CORP _THRE-Non-standard_" "3COM CORP"
2 "US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_" "CETACEA
NETWORKS CORP"
3 "US5566180-A HEWLETT-PACKARD CO _HEWP_" "HEWLETT-PACKARD CO"
4 "US6215865-B1 E-TALK CORP _ETAL-Non-standard_" "E-TALK CORP"
6 "US5600312-A MOTOROLA INC _MOTI_" "MOTOROLA INC"
7 "CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S" "CONRED
ELECTRONICS LTD"
8 "TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B"
"TEMIC TELEFUNKEN MICROELECTRONIC GMBH"
9 "US3476883-A" ""
10 "US5136671-A AT & T BELL LAB _AMTT_" "AT & T BELL LAB"
11 "US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_" "AMERICAN
TELEPHONE & TELEGRAPH CO"
12 "US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_" "CHURCH &
DWIGHT CO INC"
13 "US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_"
"CAMBRIDGE RES & INSTR INC"
14 "US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B"
"DAEWOO ELECTRONICS CO LTD"
15 "US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F" "INT
BUSINESS MACHINES CORP"
16 "US6947529-B2 -- US761995" ""
17 "US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_  --
US2004249974-A1" "ALKHATIB H S"
18 "US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_  --
EP1379030-A2" "SAMSUNG ELECTRONICS CO LTD"
19 "WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_
WO2008027657-A2 -- US5773966-A" "GENERAL ELECTRIC CO"
end

compress

* Remove extra patent code; repeat as necessary.
* A patent code starts at beginning of the line, has two capital letters,
* followed by two or more digits, followed by a series of non-space
* characters, until a space is found (non-greedy matching).
gen s = regexr(cit_1,"^[A-Z][A-Z][0-9][0-9]+[^ ]* -- ","")

* The company name follows an optional patent code and contains one or
* more characters except "_", until a "_" is found (non-greedy matching).
gen co2 = trim(regexs(2)) if regexm(s,"^([A-Z][A-Z][0-9][0-9]+[^ ]*)*([^_]+)_")

assert company_1 == co2
list cit_1 co2, noobs

*---------------- example end ----------------------

Hope this helps,

Robert

On Thu, Apr 8, 2010 at 4:50 PM, Pavlos C. Symeou <[email protected]> wrote:
> Dear Robert, Ulrich and Nick,
>
> thank you for your responses. I have run Robert's  suggested code (at the
> moment) on a larger sample just to notice that the code does not capture the
> patent codes which start with text other than "US" and also the code does
> not consider the possibility of a second patent code in the string. I give
> examples below.
>
> cit_1
>
>  company_1
> US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_  --
> US2004249974-A1 ALKHATIB H S
> US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_  --
> EP1379030-A2 SAMSUNG ELECTRONICS CO LTD
> WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_ WO2008027657-A2 --
> US5773966-A GENERAL ELECTRIC CO
>
>
> Regards,
>
> Pavlos


On Thu, Apr 8, 2010 at 4:50 PM, Pavlos C. Symeou <[email protected]> wrote:
> Dear Robert, Ulrich and Nick,
>
> thank you for your responses. I have run Robert's  suggested code (at the
> moment) on a larger sample just to notice that the code does not capture the
> patent codes which start with text other than "US" and also the code does
> not consider the possibility of a second patent code in the string. I give
> examples below.
>
> cit_1
>
>  company_1
> US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_  --
> US2004249974-A1 ALKHATIB H S
> US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_  --
> EP1379030-A2 SAMSUNG ELECTRONICS CO LTD
> WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_ WO2008027657-A2 --
> US5773966-A GENERAL ELECTRIC CO
>
>
> Regards,
>
> Pavlos

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index