Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Removing Repeated Phrases in String Variable |
Date | Sat, 14 Dec 2013 22:24:01 +0000 |
In addition to Eric's helpful and detailed suggestions, check out http://www.stata.com/support/faqs/data-management/counting-distinct-strings/index.html Nick njcoxstata@gmail.com On 14 December 2013 21:56, Eric Booth <eric.a.booth@gmail.com> wrote: > <> > > > > Some examples: > > > ******************! > > ****EXAMPLE 1: > clear > inp str1000 test > "BMW North America; Honda; Toyota; Nissan; BMW North America; Mercedes Benz North America; Nissan; Subaru; Nissan; Ford" > "item1; item number2; item3; item number2" > end > > replace test = `"""'+test+`"""' > replace test = subinstr(test, "; ", `"" ""', .) //tokenize > > > ** > list test , notrim noobs > > forval n = 1/`=_N' { > loc t `"`=test[`n']'"' > loc t2 : list uniq t > replace test = `"`: list uniq t'"' in `n' > } > > > list test , notrim noobs //duplicates gone > > > > *************** > ****EXAMPLE 2: > > clear > inp str1000 test > "BMW North America; Honda; Toyota; Nissan; BMW North America; Mercedes Benz North America; Nissan; Subaru; Nissan; Ford" > "item1; item number2; item3; item number2" > end > replace test = subinstr(test, "; ", `"" ""', .) //tokenize > > > split test, parse(`"" ""') > di `"`r(nvars)'"' > drop test > > g i = _n > reshape long test@, i(i) j(j) > duplicates drop i test, force > > > **** > **put back together > reshape wide test@, i(i) j(j) > drop i > g test = "" > order test > foreach x of varlist test* { > replace test = test+ `"""' + `x' + `"" "' > } > replace test = subinstr(test, `""" "', "", .) > *****************! > > -lstrfun- and -moss- from SSC could be of use as well. > > > > - Eric > > > On Dec 14, 2013, at 3:26 PM, Becker Stein <becker.stein@aol.com> wrote: > >> Hi, >> >> I was wondering if someone could help me remove repeated words/phrases in a string variable. My data has a lot repeats and I only want to keep the first instance of an item. Below is an example. >> >> BMW North America; Honda; Toyota; Nissan; BMW North America; Mercedes Benz North America; Nissan; Subaru; Nissan; Ford >> >> In the above example, I'd like to get rid of the extra instances of BMW North America and Nissan. Is there a way to do this? Thanks in advance for your help. >> >> Becker >> >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/