Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Robert Picard <picard@netbox.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Extracting parts of string variable |
Date | Thu, 8 Apr 2010 13:07:24 -0400 |
Pavlos, Here's my attempt: *-------------------- example ------------------------- version 11 clear input id str244( cit_1 company_1) 1 "US6449348-B1 3COM CORP _THRE-Non-standard_" "3COM CORP" 2 "US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_" "CETACEA NETWORKS CORP" 3 "US5566180-A HEWLETT-PACKARD CO _HEWP_" "HEWLETT-PACKARD CO" 4 "US6215865-B1 E-TALK CORP _ETAL-Non-standard_" "E-TALK CORP" 6 "US5600312-A MOTOROLA INC _MOTI_" "MOTOROLA INC" 7 "CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S" "CONRED ELECTRONICS LTD" 8 "TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B" "TEMIC TELEFUNKEN MICROELECTRONIC GMBH" 9 "US3476883-A" "" 10 "US5136671-A AT & T BELL LAB _AMTT_" "AT & T BELL LAB" 11 "US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_" "AMERICAN TELEPHONE & TELEGRAPH CO" 12 "US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_" "CHURCH & DWIGHT CO INC" 13 "US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_" "CAMBRIDGE RES & INSTR INC" 14 "US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B" "DAEWOO ELECTRONICS CO LTD" 15 "US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F" "INT BUSINESS MACHINES CORP" 16 "US6947529-B2 -- US761995" "" end compress gen co2 = trim(regexs(2)) if regexm(cit_1,"^(US[0-9]+[^ ]*)*([^_]+)_") assert company_1 == co2 list, noobs *-------------------- example ------------------------- Robert On Thu, Apr 8, 2010 at 12:56 PM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > I'd take a look at -split-. The recipe doesn't look simple even then given that your company names may contain blanks. > > Nick > n.j.cox@durham.ac.uk > > Pavlos C. Symeou > > I am experiencing some problems with a command I use to extract a part > of a string variable which I use to create another string variable. The > existing string variable is cit_1 and may contain (one or multiple > instances of any of) a patent number (e.g. "US6449348-B1"), a company > name (e.g. "3COM CORP"), a company abbreviation enclosed by "_" (e.g. > "_THRE-Non-standard_"), other text after the "_" (e.g. see id 8). My aim > is to extract the company name, which appears always before its > abbreviation and use it to create a new string variable company_1. I > used the following command, which however fails to account for different > forms of the cit_1 values and produces incorrect company names. > > gen company_1 = regexs(2) if (regexm(cit_1, "([A-Z0-9]*[\-][A-Z0-9]*[ > \-]*) *([A-Z0-9 ]*)( *)([\_])(.*)([\_])")) > > I provide below the various forms that cit_1 takes and how company_1 > should look. > > > id cit_1 company_1 > 1 US6449348-B1 3COM CORP _THRE-Non-standard_ 3COM CORP > 2 US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_ CETACEA > NETWORKS CORP > 3 US5566180-A HEWLETT-PACKARD CO _HEWP_ HEWLETT-PACKARD CO > 4 US6215865-B1 E-TALK CORP _ETAL-Non-standard_ E-TALK CORP > > US4528422-A -- US452232-A1 INTELEPLEX CORP _INTE-Non-standard_ > INTELEPLEX CORP > 6 US5600312-A MOTOROLA INC _MOTI_ MOTOROLA INC > 7 CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S CONRED > ELECTRONICS LTD > 8 TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B > TEMIC TELEFUNKEN MICROELECTRONIC GMBH > 9 US3476883-A > 10 US5136671-A AT & T BELL LAB _AMTT_ AT & T BELL LAB > 11 US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_ AMERICAN > TELEPHONE & TELEGRAPH CO > 12 US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_ CHURCH & > DWIGHT CO INC > 13 US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_ CAMBRIDGE > RES & INSTR INC > 14 US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B > DAEWOO ELECTRONICS CO LTD > 15 US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F INT > BUSINESS MACHINES CORP > 16 US6947529-B2 -- US761995 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/