Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting different portions of string values

From	Eric Booth <[email protected]>
To	"<[email protected]>" <[email protected]>
Subject	Re: st: Extracting different portions of string values
Date	Fri, 1 Oct 2010 16:10:15 +0000

<>

Echoing Nick's comment, there needs to be a clearer explanation of what is being matched here.  As I read it, you want to match the first word before the first space in cit_1 unless there is a "--" that indicates that you want the third word in cit_1 (which follows the second space).  If these examples are the illustrative of the structure of all your cit_* variables, then you can use -split- to find the first word or third word depending on these conditions.  

However, if there is some kind of variation in cit_1 that is not in your example data, this approach will not work (e.g., if you've got other words before the first citation, so "2A2 EP562128-A -- DE1684639" would break it). If these variations exist, you will need to change my code below or change your approach completely using either a series of -strpos()- and -substr()- functions (as Nick mentions) or regular expression matching.  

If your example is consistent with the rest of your data, this _should_ work:

***************!
clear

inp id  str40(cit_1 cit_2 )
1   "EP696218-A -- WO9215370-A   SUND _SUND-Individual_" "EP578126-A -- CH180906-A "
2   "WO9425112-A -- GB298635-A"    "EP696218-A -- WO9215370-A   SUND "
3   "EP578126-A -- CH180906-A    AGE_OK" "US4994899-A   SEC OF"
4   "EP562128-A -- DE1684639-A" "EP588128-A -- DE1684639-A asdfasdf"
5   "WO9318277-A -- DK137935-B"  "WO9999997-A - A"
6   "US4434855-A   SEC OF NAVY _USNA_" "" 
end

**change the range below to the number of cit_* variables you have:
forval n = 1/2 {
	split cit_`n', p(" ")
	replace cit_`n' = cit_`n'1 if cit_`n'2!="--" & !mi(cit_`n'1)
	replace cit_`n' = cit_`n'3 if cit_`n'2=="--" & !mi(cit_`n'3)
	drop `r(varlist)'
 }
***************!

- Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754

On Oct 1, 2010, at 3:41 AM, Florian Seliger wrote:

> Hi,
> 
> we are searching for commands in order to extract different portions of string  values.
> 
> Our data with patent citations looks like this:
> 
> id  cit_1
> 1   EP696218-A -- WO9215370-A   SUND _SUND-Individual_
> 2   WO9425112-A -- GB298635-A
> 3   EP578126-A -- CH180906-A    AGE_OK
> 4   EP562128-A -- DE1684639-A
> 5   WO9318277-A -- DK137935-B
> 6   US4434855-A   SEC OF NAVY _USNA_
> .
> .
> .
> .
> 
> with 100,000 IDs and about 500 affected variables (cit_1, cit_2, cit_3...).
> In this example, we only want to keep the second portion for the IDs 1-5, but the first portion for ID 6. We want to extract the first portion whenever there is only one citation number.
> 
> The data should thus look like this:
> 
> id  cit_1
> 1   WO9215370-A
> 2   GB298635-A
> 3   CH180906-A
> 4   DE1684639-A
> 5   DK137935-B
> 6   US4434855-A
> .
> .
> .
> 
> 
> 

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Extracting different portions of string values
  - From: "Florian Seliger" <[email protected]>

Prev by Date: st: vif after newey2
Next by Date: Re: st: pseudo R2 with multiple imputed data
Previous by thread: st: RE: Extracting different portions of string values
Next by thread: Re: st: interesting reference
Index(es):
- Date
- Thread