Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Extracting parts of string variable


From   Robert Picard <picard@netbox.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: Extracting parts of string variable
Date   Thu, 8 Apr 2010 13:07:24 -0400

Pavlos,

Here's my attempt:

*-------------------- example -------------------------
version 11
clear
input id str244( cit_1 company_1)
1 "US6449348-B1 3COM CORP _THRE-Non-standard_" "3COM CORP"
2 "US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_" "CETACEA
NETWORKS CORP"
3 "US5566180-A HEWLETT-PACKARD CO _HEWP_" "HEWLETT-PACKARD CO"
4 "US6215865-B1 E-TALK CORP _ETAL-Non-standard_" "E-TALK CORP"
6 "US5600312-A MOTOROLA INC _MOTI_" "MOTOROLA INC"
7 "CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S" "CONRED
ELECTRONICS LTD"
8 "TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B"
"TEMIC TELEFUNKEN MICROELECTRONIC GMBH"
9 "US3476883-A" ""
10 "US5136671-A AT & T BELL LAB _AMTT_" "AT & T BELL LAB"
11 "US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_" "AMERICAN
TELEPHONE & TELEGRAPH CO"
12 "US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_" "CHURCH &
DWIGHT CO INC"
13 "US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_"
"CAMBRIDGE RES & INSTR INC"
14 "US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B"
"DAEWOO ELECTRONICS CO LTD"
15 "US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F" "INT
BUSINESS MACHINES CORP"
16 "US6947529-B2 -- US761995" ""
end

compress

gen co2 = trim(regexs(2)) if regexm(cit_1,"^(US[0-9]+[^ ]*)*([^_]+)_")
assert company_1 == co2
list, noobs
*-------------------- example -------------------------


Robert


On Thu, Apr 8, 2010 at 12:56 PM, Nick Cox <n.j.cox@durham.ac.uk> wrote:
> I'd take a look at -split-. The recipe doesn't look simple even then given that your company names may contain blanks.
>
> Nick
> n.j.cox@durham.ac.uk
>
> Pavlos C. Symeou
>
> I am experiencing some problems with a command I use to extract a part
> of a string variable which I use to create another string variable. The
> existing string variable is cit_1 and may contain (one or multiple
> instances of any of) a patent number (e.g. "US6449348-B1"), a company
> name (e.g. "3COM CORP"), a company abbreviation enclosed by "_" (e.g.
> "_THRE-Non-standard_"), other text after the "_" (e.g. see id 8). My aim
> is to extract the company name, which appears always before its
> abbreviation and use it to create a new string variable company_1. I
> used the following command, which however fails to account for different
> forms of the cit_1 values and produces incorrect company names.
>
> gen company_1 = regexs(2) if (regexm(cit_1, "([A-Z0-9]*[\-][A-Z0-9]*[
> \-]*) *([A-Z0-9 ]*)( *)([\_])(.*)([\_])"))
>
> I provide below the various forms that cit_1 takes and how company_1
> should look.
>
>
> id      cit_1   company_1
> 1       US6449348-B1 3COM CORP _THRE-Non-standard_      3COM CORP
> 2       US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_       CETACEA
> NETWORKS CORP
> 3       US5566180-A HEWLETT-PACKARD CO _HEWP_   HEWLETT-PACKARD CO
> 4       US6215865-B1 E-TALK CORP _ETAL-Non-standard_    E-TALK CORP
>
>        US4528422-A -- US452232-A1 INTELEPLEX CORP _INTE-Non-standard_
> INTELEPLEX CORP
> 6       US5600312-A MOTOROLA INC _MOTI_         MOTOROLA INC
> 7       CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S  CONRED
> ELECTRONICS LTD
> 8       TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B
> TEMIC TELEFUNKEN MICROELECTRONIC GMBH
> 9       US3476883-A
> 10      US5136671-A AT & T BELL LAB _AMTT_      AT & T BELL LAB
> 11      US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_    AMERICAN
> TELEPHONE & TELEGRAPH CO
> 12      US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_  CHURCH &
> DWIGHT CO INC
> 13      US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_       CAMBRIDGE
> RES & INSTR INC
> 14      US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B
> DAEWOO ELECTRONICS CO LTD
> 15      US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F         INT
> BUSINESS MACHINES CORP
> 16      US6947529-B2 -- US761995
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index