Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Eric Booth <ebooth@ppri.tamu.edu> |
To | "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Extracting different portions of string values |
Date | Fri, 1 Oct 2010 16:10:15 +0000 |
<> Echoing Nick's comment, there needs to be a clearer explanation of what is being matched here. As I read it, you want to match the first word before the first space in cit_1 unless there is a "--" that indicates that you want the third word in cit_1 (which follows the second space). If these examples are the illustrative of the structure of all your cit_* variables, then you can use -split- to find the first word or third word depending on these conditions. However, if there is some kind of variation in cit_1 that is not in your example data, this approach will not work (e.g., if you've got other words before the first citation, so "2A2 EP562128-A -- DE1684639" would break it). If these variations exist, you will need to change my code below or change your approach completely using either a series of -strpos()- and -substr()- functions (as Nick mentions) or regular expression matching. If your example is consistent with the rest of your data, this _should_ work: ***************! clear inp id str40(cit_1 cit_2 ) 1 "EP696218-A -- WO9215370-A SUND _SUND-Individual_" "EP578126-A -- CH180906-A " 2 "WO9425112-A -- GB298635-A" "EP696218-A -- WO9215370-A SUND " 3 "EP578126-A -- CH180906-A AGE_OK" "US4994899-A SEC OF" 4 "EP562128-A -- DE1684639-A" "EP588128-A -- DE1684639-A asdfasdf" 5 "WO9318277-A -- DK137935-B" "WO9999997-A - A" 6 "US4434855-A SEC OF NAVY _USNA_" "" end **change the range below to the number of cit_* variables you have: forval n = 1/2 { split cit_`n', p(" ") replace cit_`n' = cit_`n'1 if cit_`n'2!="--" & !mi(cit_`n'1) replace cit_`n' = cit_`n'3 if cit_`n'2=="--" & !mi(cit_`n'3) drop `r(varlist)' } ***************! - Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu Office: +979.845.6754 On Oct 1, 2010, at 3:41 AM, Florian Seliger wrote: > Hi, > > we are searching for commands in order to extract different portions of string values. > > Our data with patent citations looks like this: > > id cit_1 > 1 EP696218-A -- WO9215370-A SUND _SUND-Individual_ > 2 WO9425112-A -- GB298635-A > 3 EP578126-A -- CH180906-A AGE_OK > 4 EP562128-A -- DE1684639-A > 5 WO9318277-A -- DK137935-B > 6 US4434855-A SEC OF NAVY _USNA_ > . > . > . > . > > with 100,000 IDs and about 500 affected variables (cit_1, cit_2, cit_3...). > In this example, we only want to keep the second portion for the IDs 1-5, but the first portion for ID 6. We want to extract the first portion whenever there is only one citation number. > > The data should thus look like this: > > id cit_1 > 1 WO9215370-A > 2 GB298635-A > 3 CH180906-A > 4 DE1684639-A > 5 DK137935-B > 6 US4434855-A > . > . > . > > > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/