Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Re: replace a string variable


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Re: replace a string variable
Date   Mon, 9 May 2005 16:18:09 +0100

I agree with Eric's general stance here. In essence
you want Stata to be smart and treat "similar" names
as "identical", but Stata has no idea what is "similar"
unless you spell that out. 

More positively, you can start building up 
a script like this: 

gen newvend = "" 
replace newvend = "ZIMMER" if substr(vend,1,6) == "ZIMMER" 
replace newvend = "SULZER" if substr(vend,1,6) == "SULZER" 
replace newvend = "STRYKER" if substr(vend,1,7) == "STRYKER" 
replace newvend = "STRYKER" if substr(vend,1,6) == "STYKER" 

tab vend if mi(newvend) 

will show you what remains unclassified. 

Stata 9 has new functions: 

regexm(s,re) performs a match of a regular expression and evaluates to 1 if regular expression re is
satisfied by the string s, otherwise returns 0.  Regular expression syntax is based on Henry
Spencer's NFA algorithm and as such, is nearly identical to the POSIX.2 standard.

regexr(s1,re,s2) returns the result of a replacement of s2 within s1 of the first match of the regular
expression re.  If re was not satisfied, the original s1 is returned.

regexs(n) returns subexpression n from a previous regexm() match, where 0 > n < 10.  Subexpression 0 is
reserved for the entire string that satisfied the regular expression.

I haven't tried these out yet. 

Nick 
[email protected] 

Eric G. Wruck

> I think this will necessarily involve some manual work & a 
> familiarity with Stata's string functions.  Using the 
> <word()> function will get you part of the way there:
> 
> 
> . gen vend = word(vendor,1)
> 
> . table vend
> 
> --------------------------
>          vend |      Freq.
> --------------+-----------
>       STRYKER |          4
> STRYKERITALIA |          1
>        STYKER |          1
>        SULZER |         10
>  SULZERMEDICA |          1
>        ZIMMER |          6
> --------------------------
> 
> .
> But as you can see, we also got STYKER, STRYKERITALIA, and 
> SULZERMEDICA.  The Stykers of the world (i.e., typos) are 
> going to cause you the most trouble.  I view this as a 
> necessary part of data analysis.

Paolo Grillo, MD 

> >I have a dataset with a string variable named VEND. It 
> contains a lot of different companies with a varied different 
> names although often they indicate the same company.
> >For example for three different firms
> >
> >STRYKER ITALIA SRL
> >STRYKER ITALIA SRL -
> >STRYKER ITALIA SRL S
> >         STRYKER SRL
> >       STRYKERITALIA
> >   STYKER ITALIA SRL
> >              SULZER
> >       SULZER MEDICA
> >SULZER OR ITALIA SPA
> >  SULZER ORTHOPEDICS
> >SULZER ORTHOPEDICS I
> >   SULZER ORTHPEDICS
> >    SULZER ORTOPEDIC
> >SULZER ORTOPEDICA IT
> >SULZER ORTOPEDICS IT
> >       SULZER PROTEK
> >        SULZERMEDICA
> >              ZIMMER
> >    ZIMMER - NEX GEN
> >          ZIMMER ARL
> >       ZIMMER S.R.L.
> >ZIMMER S.R.L.     (C
> >          ZIMMER SRL
> >
> >
> >
> >Where the names are easily
> >STRYKER
> >SULZER
> >ZIMMER
> >
> >How can I replace these strings with the same cluster name?
> >Do you know if there is a similar command as
> >. replace vend if  vend=="zimmer***"
> >or I have to build a do file with a lot of -substr- and 
> -index- command

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index