Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: string management questions


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: string management questions
Date   Thu, 10 May 2007 08:37:39 -0500

Wanli Zhao <zhaowl@temple.edu> has two questions on string functions:

>   1. Is there some way to increase the limit of string length of 244? 
>      When I create string from other string variables, I found some
>      weird things happen.  Now I realize it's due to the length limit of
>      string variable.  I am using SE 9.2.
> 
>
>   2. I asked this before. I have a string variable. One observation is
>      like "a, b, f, g, b, a, a, f, g, g". How do I create another 
>      variable which shows no repeated values, i.e., "a, b, f, g". The
>      sequence does not matter.

The answer to the first is no with a proviso:  you can use Mata to work 
around the 244 limit so long as, Statawise, the inputs and outputs are 
no longer than 244.

I assume there are a number of answers to the second question.  The answer
I'm going to show uses Mata, mainly to show how one might go into and out 
of Mata to solve a string problem.

So let's assume we have Stata variable -replies- containing strings 
like "a, b, f, g, b, a, a, f, g, g" (order not significant).

I just made the following example dataset:

        . list

             +-------------------------------+
             |                       replies |
             |-------------------------------|
          1. | a , b, f, g, b, a, a, f, g, g |
          2. |                               |
          3. |                       b, a, b |
             +-------------------------------+

The first thing I want to do is get rid of the commas.  This can be done 
in Stata, or in the midst of our Mata code.  I'm going to do it in Stata:

        . replace replies = subinstr(replies, ",", "", .)

        . list

             +----------------------+
             |              replies |
             |----------------------|
          1. | a  b f g b a a f g g |
          2. |                      |
          3. |                b a b |
             +----------------------+

Here is my solution.  First, I created a do-file to contain my Mata code:

        --------------------------------------------------- mymatacode.do ---
        mata:

        mata clear

        void fixvar(string scalar varname)
        {
                string colvector    data

                st_sview(data, ., varname)

                for (i=1; i<=rows(data); i++) {
                        data[i] = myinvtokens( uniqrows(tokens(data[i])') )
                }
        }

        string scalar myinvtokens(string vector s)
        {
                string scalar   result
                real scalar     i

                if (length(s)) {
                        result = s[1]
                        for (i=2; i<=length(s); i++) {
                                result = result + " " + s[i]
                        }
                }
                return(result)
        }
        end
        --------------------------------------------------- mymatacode.do ---

With that do-file written, I typed, 

        . do mymatacode
          <output omitted>

        . mata: fixvar("replies")

                . list
             +---------+
             | replies |
             |---------|
          1. | a b f g |
          2. |         |
          3. |     a b |
             +---------+


I apologize for all the code above.  Routine myinvtokens() would be
unnecessary if you have Ben Jann's MF_INVTOKENS installed and, really, 
Mata should have had an -invtokens()- function all along.

The routine that's important above is 

        void fixvar(string scalar varname)
        {
                string colvector    data

                st_sview(data, ., varname)

                for (i=1; i<=rows(data); i++) {
                        data[i] = myinvtokens( uniqrows(tokens(data[i])') )
                }
        }

and, as always, I emphasize you could have omitted the declarations:

        void fixvar(varname)
        {
                st_sview(data, ., varname)

                for (i=1; i<=rows(data); i++) {
                        data[i] = myinvtokens( uniqrows(tokens(data[i])') )
                }
        }

I include the declarations because I'm hoping that will help you understand
the program.  Maybe it would have been better had I been a bit more verbose
in my code, 

        void fixvar(varname)
        {
                st_sview(data, ., varname)

                for (i=1; i<=rows(data); i++) {
                        orig      = data[i]
                        origasvec = tokens(orig)
                        uniqorig  = uniqrows(origasvec')
                        data[i]   = myinvtokens(uniqorig)
                }
        }

Anyway, data[] is a view unto varname, which will be "replies".  

data[i] is thus the i-th obsrvation of replies.

tokens(data[i]) changes "a b a" into row vector ("a", "b", "a").

Next I use function uniqrows().  There is no -uniqcols()- function, so I 
transpose the argument tokens(data[i]):  uniqrows(tokens(data[i])').

Now I have ("a", "b").  I put that back into a scalar as "a b", and replace 
data[i].

In the above, I didn't really need to make -fixvar()- a program.  I could 
have done it interactively, something like


        --------------------------------------------------- mymatacode.do ---
        mata:

        mata clear

        function myinvtokens(s)
        {
                if (length(s)) {
                        result = s[1]
                        for (i=2; i<=length(s); i++) {
                                result = result + " " + s[i]
                        }
                }
                return(result)
        }


        st_sview(data=., ., "replies")

        for (i=1; i<=rows(data); i++) {
                data[i] = myinvtokens( uniqrows(tokens(data[i])') )
        }
        end
        --------------------------------------------------- mymatacode.do ---

Now, if I had Ben Jann's -invoken()- function, I could have used that. 
I assume Ben's -invtoken()- requires a row vector as an argument, and I 
have a column vector, so I add a transpose to my code:

        --------------------------------------------------- mymatacode.do ---
        mata:

        st_sview(data=., ., "replies")

        for (i=1; i<=rows(data); i++) {
                data[i] = invtokens( uniqrows(tokens(data[i])')' )
        }
        end
        --------------------------------------------------- mymatacode.do ---

That, really, is the gist of the solution.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index