Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Extracting parts of string variable


From   "Pavlos C. Symeou" <p.symeou@lmu.de>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: Extracting parts of string variable
Date   Thu, 08 Apr 2010 22:50:25 +0200

Dear Robert, Ulrich and Nick,

thank you for your responses. I have run Robert's suggested code (at the moment) on a larger sample just to notice that the code does not capture the patent codes which start with text other than "US" and also the code does not consider the possibility of a second patent code in the string. I give examples below.

cit_1 company_1 US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_ -- US2004249974-A1 ALKHATIB H S US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_ -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_ WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO


Regards,

Pavlos


On 08/04/2010 19:07, Robert Picard wrote:
Pavlos,

Here's my attempt:

*-------------------- example -------------------------
version 11
clear
input id str244( cit_1 company_1)
1 "US6449348-B1 3COM CORP _THRE-Non-standard_" "3COM CORP"
2 "US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_" "CETACEA
NETWORKS CORP"
3 "US5566180-A HEWLETT-PACKARD CO _HEWP_" "HEWLETT-PACKARD CO"
4 "US6215865-B1 E-TALK CORP _ETAL-Non-standard_" "E-TALK CORP"
6 "US5600312-A MOTOROLA INC _MOTI_" "MOTOROLA INC"
7 "CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S" "CONRED
ELECTRONICS LTD"
8 "TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B"
"TEMIC TELEFUNKEN MICROELECTRONIC GMBH"
9 "US3476883-A" ""
10 "US5136671-A AT&  T BELL LAB _AMTT_" "AT&  T BELL LAB"
11 "US5195132-A AMERICAN TELEPHONE&  TELEGRAPH CO _AMTT_" "AMERICAN
TELEPHONE&  TELEGRAPH CO"
12 "US5605491-A CHURCH&  DWIGHT CO INC _CHUR-Non-standard_" "CHURCH&
DWIGHT CO INC"
13 "US6028656-A CAMBRIDGE RES&  INSTR INC _CAMB-Non-standard_"
"CAMBRIDGE RES&  INSTR INC"
14 "US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B"
"DAEWOO ELECTRONICS CO LTD"
15 "US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F" "INT
BUSINESS MACHINES CORP"
16 "US6947529-B2 -- US761995" ""
end

compress

gen co2 = trim(regexs(2)) if regexm(cit_1,"^(US[0-9]+[^ ]*)*([^_]+)_")
assert company_1 == co2
list, noobs
*-------------------- example -------------------------


Robert


On Thu, Apr 8, 2010 at 12:56 PM, Nick Cox<n.j.cox@durham.ac.uk>  wrote:
I'd take a look at -split-. The recipe doesn't look simple even then given that your company names may contain blanks.

Nick
n.j.cox@durham.ac.uk

Pavlos C. Symeou

I am experiencing some problems with a command I use to extract a part
of a string variable which I use to create another string variable. The
existing string variable is cit_1 and may contain (one or multiple
instances of any of) a patent number (e.g. "US6449348-B1"), a company
name (e.g. "3COM CORP"), a company abbreviation enclosed by "_" (e.g.
"_THRE-Non-standard_"), other text after the "_" (e.g. see id 8). My aim
is to extract the company name, which appears always before its
abbreviation and use it to create a new string variable company_1. I
used the following command, which however fails to account for different
forms of the cit_1 values and produces incorrect company names.

gen company_1 = regexs(2) if (regexm(cit_1, "([A-Z0-9]*[\-][A-Z0-9]*[
\-]*) *([A-Z0-9 ]*)( *)([\_])(.*)([\_])"))

I provide below the various forms that cit_1 takes and how company_1
should look.


id      cit_1   company_1
1       US6449348-B1 3COM CORP _THRE-Non-standard_      3COM CORP
2       US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_       CETACEA
NETWORKS CORP
3       US5566180-A HEWLETT-PACKARD CO _HEWP_   HEWLETT-PACKARD CO
4       US6215865-B1 E-TALK CORP _ETAL-Non-standard_    E-TALK CORP

        US4528422-A -- US452232-A1 INTELEPLEX CORP _INTE-Non-standard_
INTELEPLEX CORP
6       US5600312-A MOTOROLA INC _MOTI_         MOTOROLA INC
7       CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S  CONRED
ELECTRONICS LTD
8       TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B
TEMIC TELEFUNKEN MICROELECTRONIC GMBH
9       US3476883-A
10      US5136671-A AT&  T BELL LAB _AMTT_      AT&  T BELL LAB
11      US5195132-A AMERICAN TELEPHONE&  TELEGRAPH CO _AMTT_    AMERICAN
TELEPHONE&  TELEGRAPH CO
12      US5605491-A CHURCH&  DWIGHT CO INC _CHUR-Non-standard_  CHURCH&
DWIGHT CO INC
13      US6028656-A CAMBRIDGE RES&  INSTR INC _CAMB-Non-standard_       CAMBRIDGE
RES&  INSTR INC
14      US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B
DAEWOO ELECTRONICS CO LTD
15      US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F         INT
BUSINESS MACHINES CORP
16      US6947529-B2 -- US761995

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index