Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Behaviour of -tokenize- shouldn't it drop the parsing character?


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Behaviour of -tokenize- shouldn't it drop the parsing character?
Date   Thu, 05 Oct 2006 09:28:19 -0500

David Elliott <dcelliott@gmail.com> writes, 

> I want to tokenize groups of numbers separated by the "|" character:
> e.g.: 1 2 3 | 4 5 6| 7 8 | 9 so that I have each group in a positional 
> macro _1 = 1 2 3, _2 = 4 5 6 ...  However, I have found that tokenize
> does not behave as I expected. 
> [...]

This is a perfect problem for Mata.  We can write a subroutine so that, 
in our ado-file, we can code 

        program ...
                ...
                mata: mytokenize("input")
                ...
        end

and, if local macro input contains "1 2 3 | 4 5 6 | 7 8 | 9", after 
-mata: mytokenizes("input")- runs, local macros _1, _2, _3, _4, and _5 
will be defined to be 

        _1 = "1 2 3"
        _2 = "4 5 6"
        _3 = "7 8"
        _4 = "9"
        _5 = ""

In the above, note that I am passing the NAME of the local macro to 
-mytokenize()-.  I could just as easily write -mytokenize()- to accept 
the contents of the local macro, so that, rather than coding 

                mata: mytokenize("input")

I would code 
                
                mata: mytokenize("`input'")

Actually, writing -mytokenize()- to accept input the second way would be 
easier, but I'm opting for the first because, if input contains a long, 
long string, our ado-file will run a little faster.

Anyway, here's the full solution

        ------------------------------------ myfile.ado --- BEGIN ---
        *! version ...
        program myfile 
                ...
                mata: mytokenize("input")
                ...
        end

        mata:
        void mytokenize(string scalar macname)
        {
                string scalar        s
                real scalar          i, l

                s = strtrim(st_local(macname))

                i = 1
                while (l = strpos(s, "|")) {
                        if (l>1) { 
                                st_local(strofreal(i++), 
                                        strtrim(substr(s, 1, l-1)))
                                s = strtrim(substr(s, l+1, .))
                        }
                        else    s = strtrim(substr(s, 2, .))
                }
                if (s != "") st_local(strofreal(i++), s)
                st_local(strofreal(i), "")
        }
        end
        ------------------------------------ myfile.ado ----- END ---

That is the full solution and I wanted to show that just to make clear 
mechanically where everything goes in the final ado-file, but what I 
actually did to write -mytokenize()- was create a do-file where I could 
easily test it, planning later to change it to the final ado-file:


        ------------------------------------- testit.do --- BEGIN ---
        clear 

        mata:
        void mytokenize(string scalar macname)
        {
                string scalar        s
                real scalar          i, l

                s = strtrim(st_local(macname))

                i = 1
                while (l = strpos(s, "|")) {
                        if (l>1) { 
                                st_local(strofreal(i++), 
                                        strtrim(substr(s, 1, l-1)))
                                s = strtrim(substr(s, l+1, .))
                        }
                        else    s = strtrim(substr(s, 2, .))
                }
                if (s != "") st_local(strofreal(i++), s)
                st_local(strofreal(i), "")
        }
        end

        local test "1 2 3 | 4 5 6 | 7 8"
        mata: mytokenize("test")
        mac list

        local test "1 2 3 | 4 5 6 |"
        mata: mytokenize("test")
        mac list
        ------------------------------------- testit.do ----- END ---


Concerning -mytokenize()-, 

    1.  The declarations at the top are all optional.  That is, rather than
        code 

                void mytokenize(string scalar macname)
                {
                        string scalar        s
                        real scalar          i, l

                        s = strtrim(st_local(macname))
                        ...

        I could just code, 

                void mytokenize(macname)
                {
                        s = strtrim(st_local(macname))
                        ...

        I include the declarations becuase (a) that's my style (it helps me 
        to avoid mistakes), and because (b) they make it a little easier for
        others to understand my code (because I have told the reader how I
        intend to use s, i, and l).


    2.  The guts of the program is the while loop:

                        while (l = strpos(s, "|")) {
                                ...
                        }

         The use of the single equal sign is tricky.  The -while- statement
         does *NOT* say, "while l is equal to strpos(...)".  If I wanted
         that, I would have coded -while (l==strpos(...))-.

         The -while- statement says, "assign to l the value of strpos(...);
         while l is not equal to zero".

         -strpos(...)- tells me the position of the next "|" in s.
         I save that in l.  -strpos(...)- returns 0 if there is no "|" in 
         s.  I continue doing the loop as long as there is another "|".


    3.  Inside the loop, I have separate code for l==1 and l>1.  I assume 
        l==1 should never happen, but I wanted to cover all the contingencies.
        If l==1, then the input was something like:

                      1 2 3 | 4 5 6 || 7 8 

                      1 2 3 | 4 5 6 | | 7 8 

                      1 2 3 | 4 5 6 | 7 8 |

        and I treat them all as if the input were 

                      1 2 3 | 4 5 6 | 7 8

        Perhaps David wants to do something different in those cases.


    4.  Note that when I'm all done (last line of program), I code 

                       st_local(strofreal(i), "")

        I set the last macro `#' macro to empty string just in case it 
        already exists.

I have a postscript:  Mata is a great string-processing language.  When you do
not find what you need in Stata, think about writing your own Mata function to
provide exactly what you need.  I find it easier to do that than to work
around in the ado-language the limitations of what is more convenient to
obtain.

Mata string functions have a second advantage:  they work with long, long
strings.  Strings as a long as a macro, or longer.


-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index