# Re: st: string management questions

 From wgould@stata.com (William Gould, Stata) To statalist@hsphsun2.harvard.edu Subject Re: st: string management questions Date Thu, 10 May 2007 08:37:39 -0500

```Wanli Zhao <zhaowl@temple.edu> has two questions on string functions:

>   1. Is there some way to increase the limit of string length of 244?
>      When I create string from other string variables, I found some
>      weird things happen.  Now I realize it's due to the length limit of
>      string variable.  I am using SE 9.2.
>
>
>   2. I asked this before. I have a string variable. One observation is
>      like "a, b, f, g, b, a, a, f, g, g". How do I create another
>      variable which shows no repeated values, i.e., "a, b, f, g". The
>      sequence does not matter.

The answer to the first is no with a proviso:  you can use Mata to work
around the 244 limit so long as, Statawise, the inputs and outputs are
no longer than 244.

I assume there are a number of answers to the second question.  The answer
I'm going to show uses Mata, mainly to show how one might go into and out
of Mata to solve a string problem.

So let's assume we have Stata variable -replies- containing strings
like "a, b, f, g, b, a, a, f, g, g" (order not significant).

I just made the following example dataset:

. list

+-------------------------------+
|                       replies |
|-------------------------------|
1. | a , b, f, g, b, a, a, f, g, g |
2. |                               |
3. |                       b, a, b |
+-------------------------------+

The first thing I want to do is get rid of the commas.  This can be done
in Stata, or in the midst of our Mata code.  I'm going to do it in Stata:

. replace replies = subinstr(replies, ",", "", .)

. list

+----------------------+
|              replies |
|----------------------|
1. | a  b f g b a a f g g |
2. |                      |
3. |                b a b |
+----------------------+

Here is my solution.  First, I created a do-file to contain my Mata code:

--------------------------------------------------- mymatacode.do ---
mata:

mata clear

void fixvar(string scalar varname)
{
string colvector    data

st_sview(data, ., varname)

for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
}

string scalar myinvtokens(string vector s)
{
string scalar   result
real scalar     i

if (length(s)) {
result = s[1]
for (i=2; i<=length(s); i++) {
result = result + " " + s[i]
}
}
return(result)
}
end
--------------------------------------------------- mymatacode.do ---

With that do-file written, I typed,

. do mymatacode
<output omitted>

. mata: fixvar("replies")

. list
+---------+
| replies |
|---------|
1. | a b f g |
2. |         |
3. |     a b |
+---------+

I apologize for all the code above.  Routine myinvtokens() would be
unnecessary if you have Ben Jann's MF_INVTOKENS installed and, really,
Mata should have had an -invtokens()- function all along.

The routine that's important above is

void fixvar(string scalar varname)
{
string colvector    data

st_sview(data, ., varname)

for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
}

and, as always, I emphasize you could have omitted the declarations:

void fixvar(varname)
{
st_sview(data, ., varname)

for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
}

I include the declarations because I'm hoping that will help you understand
the program.  Maybe it would have been better had I been a bit more verbose
in my code,

void fixvar(varname)
{
st_sview(data, ., varname)

for (i=1; i<=rows(data); i++) {
orig      = data[i]
origasvec = tokens(orig)
uniqorig  = uniqrows(origasvec')
data[i]   = myinvtokens(uniqorig)
}
}

Anyway, data[] is a view unto varname, which will be "replies".

data[i] is thus the i-th obsrvation of replies.

tokens(data[i]) changes "a b a" into row vector ("a", "b", "a").

Next I use function uniqrows().  There is no -uniqcols()- function, so I
transpose the argument tokens(data[i]):  uniqrows(tokens(data[i])').

Now I have ("a", "b").  I put that back into a scalar as "a b", and replace
data[i].

In the above, I didn't really need to make -fixvar()- a program.  I could
have done it interactively, something like

--------------------------------------------------- mymatacode.do ---
mata:

mata clear

function myinvtokens(s)
{
if (length(s)) {
result = s[1]
for (i=2; i<=length(s); i++) {
result = result + " " + s[i]
}
}
return(result)
}

st_sview(data=., ., "replies")

for (i=1; i<=rows(data); i++) {
data[i] = myinvtokens( uniqrows(tokens(data[i])') )
}
end
--------------------------------------------------- mymatacode.do ---

Now, if I had Ben Jann's -invoken()- function, I could have used that.
I assume Ben's -invtoken()- requires a row vector as an argument, and I
have a column vector, so I add a transpose to my code:

--------------------------------------------------- mymatacode.do ---
mata:

st_sview(data=., ., "replies")

for (i=1; i<=rows(data); i++) {
data[i] = invtokens( uniqrows(tokens(data[i])')' )
}
end
--------------------------------------------------- mymatacode.do ---

That, really, is the gist of the solution.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```