Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Extracting parts of string variable


From   "Pavlos C. Symeou" <p.symeou@lmu.de>
To   Robert Picard <picard@netbox.com>
Subject   Re: st: RE: Extracting parts of string variable
Date   Fri, 09 Apr 2010 10:32:20 +0200

Dear Robert,

this code creates perfect matches.

Thanks,

Pavlos

On 09/04/2010 00:04, Robert Picard wrote:
Pavlos,

I've adjusted the code to take care of the additional examples you
provided. I added some comments to explain what I have done. In this
case, it seemed easier to first remove additional patent codes at the
beginning of the line. I made the pattern that searches for patent
codes more general to take the issuing country into account.

*---------------- example start ----------------------
version 11
clear
input id str244( cit_1 company_1)
1 "US6449348-B1 3COM CORP _THRE-Non-standard_" "3COM CORP"
2 "US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_" "CETACEA
NETWORKS CORP"
3 "US5566180-A HEWLETT-PACKARD CO _HEWP_" "HEWLETT-PACKARD CO"
4 "US6215865-B1 E-TALK CORP _ETAL-Non-standard_" "E-TALK CORP"
6 "US5600312-A MOTOROLA INC _MOTI_" "MOTOROLA INC"
7 "CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S" "CONRED
ELECTRONICS LTD"
8 "TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B"
"TEMIC TELEFUNKEN MICROELECTRONIC GMBH"
9 "US3476883-A" ""
10 "US5136671-A AT&  T BELL LAB _AMTT_" "AT&  T BELL LAB"
11 "US5195132-A AMERICAN TELEPHONE&  TELEGRAPH CO _AMTT_" "AMERICAN
TELEPHONE&  TELEGRAPH CO"
12 "US5605491-A CHURCH&  DWIGHT CO INC _CHUR-Non-standard_" "CHURCH&
DWIGHT CO INC"
13 "US6028656-A CAMBRIDGE RES&  INSTR INC _CAMB-Non-standard_"
"CAMBRIDGE RES&  INSTR INC"
14 "US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B"
"DAEWOO ELECTRONICS CO LTD"
15 "US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F" "INT
BUSINESS MACHINES CORP"
16 "US6947529-B2 -- US761995" ""
17 "US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_  --
US2004249974-A1" "ALKHATIB H S"
18 "US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_  --
EP1379030-A2" "SAMSUNG ELECTRONICS CO LTD"
19 "WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_
WO2008027657-A2 -- US5773966-A" "GENERAL ELECTRIC CO"
end

compress

* Remove extra patent code; repeat as necessary.
* A patent code starts at beginning of the line, has two capital letters,
* followed by two or more digits, followed by a series of non-space
* characters, until a space is found (non-greedy matching).
gen s = regexr(cit_1,"^[A-Z][A-Z][0-9][0-9]+[^ ]* -- ","")

* The company name follows an optional patent code and contains one or
* more characters except "_", until a "_" is found (non-greedy matching).
gen co2 = trim(regexs(2)) if regexm(s,"^([A-Z][A-Z][0-9][0-9]+[^ ]*)*([^_]+)_")

assert company_1 == co2
list cit_1 co2, noobs

*---------------- example end ----------------------

Hope this helps,

Robert

On Thu, Apr 8, 2010 at 4:50 PM, Pavlos C. Symeou<p.symeou@lmu.de>  wrote:
Dear Robert, Ulrich and Nick,

thank you for your responses. I have run Robert's  suggested code (at the
moment) on a larger sample just to notice that the code does not capture the
patent codes which start with text other than "US" and also the code does
not consider the possibility of a second patent code in the string. I give
examples below.

cit_1

  company_1
US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_  --
US2004249974-A1 ALKHATIB H S
US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_  --
EP1379030-A2 SAMSUNG ELECTRONICS CO LTD
WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_ WO2008027657-A2 --
US5773966-A GENERAL ELECTRIC CO


Regards,

Pavlos

On Thu, Apr 8, 2010 at 4:50 PM, Pavlos C. Symeou<p.symeou@lmu.de>  wrote:
Dear Robert, Ulrich and Nick,

thank you for your responses. I have run Robert's  suggested code (at the
moment) on a larger sample just to notice that the code does not capture the
patent codes which start with text other than "US" and also the code does
not consider the possibility of a second patent code in the string. I give
examples below.

cit_1

  company_1
US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_  --
US2004249974-A1 ALKHATIB H S
US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_  --
EP1379030-A2 SAMSUNG ELECTRONICS CO LTD
WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_ WO2008027657-A2 --
US5773966-A GENERAL ELECTRIC CO


Regards,

Pavlos
.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index