# Re: st: string management questions

 From n j cox To statalist@hsphsun2.harvard.edu Subject Re: st: string management questions Date Thu, 10 May 2007 12:48:08 +0100

avoid looping over observations if possible. I agree.

Let us think which assumptions are reasonable and
which assumptions might be too restrictive.

I take it that

1. Commas are separators. In other problems there
might be different separators, but if so just vary the
recipe to come accordingly.

2. Leading and trailing spaces are cosmetic, and
not informative. Thus in "a, a", there is no
difference between "a" and " a".

On the other hand, Wanli's real problem might
be more complicated than the example given. As it turns
out, we can work out a solution with moderate
generality:

3. Distinct substrings may or may not be single letters.

4. Strings could contain spaces. Thus "New York"
and "New Haven" are acceptable.

Other than #1 and #2, what I suggest takes strings literally.

Incidentally, the macro list approach is going
to struggle with strings like

"New York, New Haven, New York"

which is (obviously to our eyes) a string with three
substrings, two distinct, but taking out the commas is
evidently the wrong way to go in such cases.

We needn't worry about that because a more direct solution
is possible.

Let's set up a toy example:

set obs 5
gen myvar = "C" in 1
quietly {
replace myvar = "Ch, C" in 2
replace myvar = "Chi, Ch, C" in 3
replace myvar = "Chin, Chi, Ch, C" in 4
replace myvar = "C, Ch, Chi, Chin, China, Ch" in 5
}

In essence we need to split the string variable into substrings
and then re-combine, throwing out substrings we have seen before.

split myvar, p(,)

-split- given -myvar- and -p(,)- will parse (split) on commas
and throw those commas away. It will create -myvar1-, -myvar2-,
etc. We could still have leading and trailing spaces.

It is safe to initialise with the first substring met,
trimming it first.

replace myvar1 = trim(myvar1)
gen distinct = myvar1

We will want to keep track of whether substrings are the
same (or different). An indicator variable is a good way
to do that. Initialise that.

gen byte same = 0

-split- leaves the names of the variables it creates
in r(varlist). Count how many new variables we have.

local nvars : word count `r(varlist)'

Now we want to loop over the other substrings.
First, we trim leading and trailing spaces. Then
we have to check whether any substring is one we
have met before. My logic for doing this is that
I first assume each substring is different (-same- is 0)
but finding even one that is the same is enough
to change my mind (-same- is 1). Then I add
a new substring only if it is not the same as
any so far met. A subtle detail (meaning, I missed this
in my first pass) is that an empty substring will differ
from any non-empty substring so far met, but we don't want to

qui forval i = 2/`nvars' {
replace myvar`i' = trim(myvar`i')
replace same = 0
local prev = `i' - 1
forval j = 1/`prev' {
replace same = 1 if myvar`i' == myvar`j'
}
replace distinct = distinct + "," + myvar`i' ///
if !same & myvar`i' != ""
}

Wanli's kind of example is easier. Here is another toy:

gen easier = "a" in 1
quietly {
replace easier = "a, b" in 2
replace easier = "a, b, c" in 3
replace easier = "a, b, c, a" in 4
replace easier = "a, b, c, d, d, d" in 5
}

The algorithm can be based on testing whether
each substring is already included in the list so
far. -index(<stringvar>, <substringvar>)- will be
positive if the contents of <substringvar> are
contained within <stringvar> and 0 otherwise.
Negating that to get -!index()- returns 1 if something
is not included and 0 if something is included.
In Stata 9 -index()- is called -strpos()-.

split easier, p(,)
local nvars : word count `r(varlist)'
gen edistinct = trim(easier1)

qui forval i = 2/`nvars' {
replace easier`i' = trim(easier`i')
replace edistinct = ///
edistinct + "," + easier`i' ///
if !index(edistinct, easier`i') & easier`i' != ""
}

Here is the more general example code in one for copiers and pasters.
To apply to a different example:

1. Change -myvar- to whatever is appropriate.
2. Check that variable names -distinct- and -same- are not in use.
3. Change the separator from , if needed.

-----------------------------------------------------------
split myvar, p(,)
replace myvar1 = trim(myvar1)
gen distinct = myvar1)
gen byte same = 0

local nvars : word count `r(varlist)'
qui forval i = 2/`nvars' {
replace myvar`i' = trim(myvar`i')
replace same = 0
local prev = `i' - 1
forval j = 1/`prev' {
replace same = 1 if myvar`i' == myvar`j'
}
replace distinct = distinct + "," + myvar`i' ///
if !same & myvar`i' != ""
}
-----------------------------------------------------------

Nick
n.j.cox@durham.ac.uk

David Kantor replied to Wanli Zhao:

> I have a string variable. One observation is like
> "a, b, f, g, b, a, a, f, g, g".
> How do I create another variable which shows no repeated values, i.e., > "a, b, f, g". The sequence does not matter.

I don't know how to do this in a variable. But, again, if you can
do it in macros, it is easy. But you need to get rid of the commas --
which you can do with subinstr. Once that is accomplished, you can
use the macro list facilities, in particular -local macro1: list uniq
macro0-. See -help macrolists-. If these values really must reside
in variables, you can loop through the observations, bring the value
into macros, do the operation using the macrolist facilities as shown
above, then store the value back into the variable. Note that
looping through observations is usually not a good idea and usually
not necessary. This might be an exception. But I'm not sure if
there are other better options; maybe I've overlooked something.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/