Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extracting parts of string variable
From 
 
Ulrich Kohler <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: Extracting parts of string variable 
Date 
 
Thu, 08 Apr 2010 19:18:26 +0200 
Am Donnerstag, den 08.04.2010, 18:35 +0200 schrieb Pavlos C. Symeou:
> Dear Statalisters,
> 
> I am experiencing some problems with a command I use to extract a part 
> of a string variable which I use to create another string variable. The 
> existing string variable is cit_1 and may contain (one or multiple 
> instances of any of) a patent number (e.g. "US6449348-B1"), a company 
> name (e.g. "3COM CORP"), a company abbreviation enclosed by "_" (e.g. 
> "_THRE-Non-standard_"), other text after the "_" (e.g. see id 8). My aim 
> is to extract the company name, which appears always before its 
> abbreviation and use it to create a new string variable company_1. I 
> used the following command, which however fails to account for different 
> forms of the cit_1 values and produces incorrect company names.
> 
> gen company_1 = regexs(2) if (regexm(cit_1, "([A-Z0-9]*[\-][A-Z0-9]*[ 
> \-]*) *([A-Z0-9 ]*)( *)([\_])(.*)([\_])"))
> 
> I provide below the various forms that cit_1 takes and how company_1 
> should look.
> 
> 
> id 	cit_1 	company_1
> 1 	US6449348-B1 3COM CORP _THRE-Non-standard_ 	3COM CORP
> 2 	US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_ 	CETACEA 
> NETWORKS CORP
> 3 	US5566180-A HEWLETT-PACKARD CO _HEWP_ 	HEWLETT-PACKARD CO
> 4 	US6215865-B1 E-TALK CORP _ETAL-Non-standard_ 	E-TALK CORP
> 
> 	US4528422-A -- US452232-A1 INTELEPLEX CORP _INTE-Non-standard_ 
> INTELEPLEX CORP
> 6 	US5600312-A MOTOROLA INC _MOTI_ 	MOTOROLA INC
> 7 	CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S 	CONRED 
> ELECTRONICS LTD
> 8 	TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B 
> TEMIC TELEFUNKEN MICROELECTRONIC GMBH
> 9 	US3476883-A 	
> 10 	US5136671-A AT & T BELL LAB _AMTT_ 	AT & T BELL LAB
> 11 	US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_ 	AMERICAN 
> TELEPHONE & TELEGRAPH CO
> 12 	US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_ 	CHURCH & 
> DWIGHT CO INC
> 13 	US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_ 	CAMBRIDGE 
> RES & INSTR INC
> 14 	US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B 
> DAEWOO ELECTRONICS CO LTD
> 15 	US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F 	INT 
> BUSINESS MACHINES CORP
> 16 	US6947529-B2 -- US761995 	
> 
Here are some thoughts: 
In most lines you can extract the text between the first blank and the
first occurance of _. So you don't need regular expressions here: 
. gen pos1 = strpos(cit_1, " ")
. gen pos2 = strpos(cit_1, "_")
. gen x = substr(cit_1,pos1,pos2-pos1)
Obviously, this does not work when the patent number is missing. But you
can deal with that as follows: 
. gen pos1 = strpos(cit_1, " ") if strpos(cit_1, "US")==1
. replace pos1 = 1 if strpos(cit_1, "US")!=1
. gen pos2 = strpos(cit_1, "_")
. gen x = substr(cit_1,pos1,pos2-pos1)
This still leaves the problem unsolved, when there are two patent
numbers. In your example there is always " -- " between the patent
numbers. Erasing that substring at the beginning solves the problem as
well: 
. replace cit_1 = subinstr(cit_1," -- ","",.)
. gen pos1 = strpos(cit_1, " ") if strpos(cit_1, "US")==1
. replace pos1 = 1 if strpos(cit_1, "US")!=1
. gen pos2 = strpos(cit_1, "_")
. gen x = substr(cit_1,pos1,pos2-pos1)
This solves your example data set. It seems very likely that other
problems remain in the entire data, however.
Uli
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/