Stata 11 help for mf_tokenget

help mata tokenget() -------------------------------------------------------------------------------

Title

[M-5] tokenget() -- Advanced parsing

Syntax

t = tokeninit([wchars [, pchars [, qchars [, allownum [, allowhex]]]]])

t = tokeninitstata()

void tokenset(t, real scalar s)

string rowvector tokengetall(t)

string scalar tokenget(t)

string scalar tokenpeek(t)

string scalar tokenrest(t)

real scalar tokenoffset(t)

void tokenoffset(t, real scalar offset)

string scalar tokenwchars(t)

void tokenwchars(t, string scalar wchars)

string rowvector tokenpchars(t)

void tokenpchars(t, string rowvector pchars)

string rowvector tokenqchars(t)

void tokenqchars(t, string rowvector qchars)

real scalar tokenallownum(t)

void tokenallownum(t, real scalar allownum)

real scalar tokenallowhex(t)

void tokenallowhex(t, real scalar allowhex)

where

t is transmorphic and contains the parsing environment information. You obtain a t from tokeninit() or tokeninitstata() and then pass t to the other functions.

wchars is a string scalar containing the characters to be treated as white space, such as " ", (" "+char(9)), or "".

pchars is a string rowvector containing the strings to be treated as parsing characters, such as "" and (">", "<", ">=", "<="). "" and J(1,0,"") are given the same interpretation: there are no parsing characters.

qchars is a string rowvector containing the character pairs to be treated as quote characters. "" (i.e., empty string) is given the same interpretation as J(1,0,""); there are no quote characters. qchars=(`""""') (i.e., the two-character string quote indicates that " is to be treated as open quote and " is to be treated as close quote. qchars=(`""""', `"`""'"') indicates that, in addition, `" is to be treated as open quote and "' as close quote. In a syntax that did not use < and > as parsing characters, qchars=("<>") would indicate that < is to be treated as open quote and > as close quote.

allownum is a string scalar containing 0 or 1. allownum=1 indicates that numbers such as 12.23 and 1.52e+02 are to be returned as single tokens even in violation of other parsing rules.

allowhex is a string scalar containing 0 or 1. allowhex=1 indicates that numbers such as 1.921fb54442d18X+001 and 1.0x+a are to be returned as single tokens even in violation of other parsing rules.

Description

These functions provide advanced parsing. If you simply wish to convert strings into row vectors by separating on blanks, converting "mpg weight displ" into ("mpg", "weight", "displ"), see [M-5] tokens().

Remarks

Remarks are presented under the following headings:

Concepts White-space characters Parsing characters Quote characters Overrides Setting the environment to parse on blanks with quote binding Setting the environment to parse full Stata syntax Setting the environment to parse tab-delimited files Function overview tokeninit() and tokeninitstata() tokenset() tokengetall() tokenget(), tokenpeek(), and tokenrest() tokenoffset() tokenwchars(), tokenpchars(), and tokenqchars() tokenallownum and tokenallowhex()

Concepts Parsing refers to splitting a string into pieces, which we will call tokens. Parsing as implemented by the token*() functions is defined by (1) the white-space characters wchars, (2) the parsing characters pchars, and (3) the quote characters qchars.

White-space characters

Consider the string "this that what". If there are no white-space characters, no parsing characters, and no quote characters, i.e., if wchars=pchars=qchars="", then the result of parsing "this that what" would be one token that would be the string just as it is: "this that what".

If wchars were instead " ", then parsing "this that what" results in ("this", "that", "what"). Parsing "this that what" (note the multiple blanks) would result in the same thing. White-space characters separate one token from the next but are not otherwise significant.

Parsing characters

If we instead left wchars="" and set pchars=" ", "this that what" parses into ("this", " ", "that", " ", "what") and parsing "this that what" results in ("this", " ", "that", " ", " ", " ", "what").

pchars are like wchars except that they are themselves significant.

pchars do not usually contain space. A more reasonable definition of pchars is ("+", "-"). Then parsing "x+y" results in ("x", "+", "y"). Also, the parsing characters can be character combinations. If pchars = ("+", "-", "++", "--"), then parsing "x+y++" results in ("x", "+", "y", "++") and parsing "x+++y" results in ("x", "++", "+", "y"). Longer pchars are matched before shorter ones regardless of the order in which they appear in the pchars vector.

Quote characters

qchars specifies the quote characters. Pieces of the string being parsed that are surrounded by quotes are returned as one token, ignoring the separation that would usually occur because of the wchars and pchars definitions. Consider the string

mystr= "x = y"

Let wchars = " " and pchars include "=". That by itself would result in the above string parsing into the five tokens

+------------------------------+ | mystr | = | "x | = | y" | +------------------------------+

Now let qchars = (`""""'); i.e., qchars is the two-character string "". Parsing then results in the three tokens

+---------------------+ | mystr | = | "x = y" | +---------------------+

Each element of qchars contains a character pair: the open character followed by the close character. We defined those two characters as " and " above, i.e., as being the same. The two characters can differ. We might define the first as ` and the second as '. When the characters are different, quotations can nest. The quotation "he said "hello"" makes no sense because that parses into ("he said ", hello, ""). The quotation `he said `hello'', however, makes perfect sense and results in the single token `he said `hello''.

The quote characters can themselves be multiple characters. You can define open quote as `" and close as "': qchars=(`"`""'"'). Or you can define multiple sets of quotation characters, such as qchars=(`""""', `"`""'"').

The quote characters do not even have to be quotes at all. In some context you might find it convenient to specify them as ("()"). With that definition, "(2*(3+2))" would parse into one token. Specifying them like this can be useful, but in general we recommend against it. It is usually better to write your code so that quote characters really are quote characters and to push the work of handling other kinds of nested expressions back onto the caller.

Overrides

The token*() functions provide two overrides: allownum and allowhex. These have to do with parsing numbers. First, consider life without overrides. You have set wchars=" " and pchars=("=", "+", "-", "*", "/"). You attempt to parse

y = x + 1e+13

The result is

+-----------------------------+ | y | = | x | + | 1e | + | 13 | +-----------------------------+

when what you wanted was

+-----------------------+ | y | = | x | + | 1e+13 | +-----------------------+

Setting allownum=1 will achieve the desired result. allownum specifies that, when a token could be interpreted as a number, the number interpretation is to be taken even in violation of the other parsing rules.

Setting allownum=1 will not find numbers buried in the middle of strings, such as the 1e+3 in "xis1e+3", but if the number occurs at the beginning of the token according to the parsing rules set by wchars and pchars, allownum=1 will continue the token in violation of those rules if that results in a valid number.

The override allowhex is similar and Stata specific. Stata (and Mata) provide a unique and useful way of writing hexadecimal floating-point numbers in a printable, short, and precise way: pi can be written 1.921fb54442d18X+001. Setting allowhex=1 allows such numbers.

Setting the environment to parse on blanks with quote binding

Stata's default rule for parsing do-file arguments is "parse on blanks and bind on quotes". The settings for duplicating that behavior are

wchars = " "

pchars = ( "" )

qchars = ( `""""', `"`""'"')

allownum = 0

allowhex = 0

This behavior can be obtained by coding

t = tokeninit(" ", "", (`""""', `"`""'"'), 0, 0)

or by coding

t = tokeninit()

because in tokeninit() the arguments are optional and "parse on blank with quote binding" is the default.

With those settings, parsing `"first second "third fourth" fifth"' results in ("first", "second", `""third fourth""', "fifth").

This result is a little different from that of Stata because the third token includes the quote binding characters. Assume that the parsed string was obtained by coding

res = tokengetall(t)

The following code will remove the open and close quotes, should that be desirable.

for (i=1; i<=cols(res); i++) { if (res[i]==`"""') { res[i] = substr(res[i], 2, strlen(res[i])-2) } else if (substr(res[i], 1, 2)=="`" + `"""') { res[i] = substr(res[i], 3, strlen(res[i])-4) } }

Setting the environment to parse full Stata syntax

To parse full Stata syntax, the settings are

wchars = " "

pchars = ( "\", "~", "!", "=", ":", ";", ",", "?", "!", "@", "#", "==", "!=", ">=", "<=", "<", ">", "&", "|", "&&", "||", "+", "-", "++", "--", "*", "/", "^", "(", ")", "[", "]", "{", "}" )

qchars = ( `""""', `"`""'"', char(96)+char(39) )

allownum = 1

allowhex = 1

The above is a slight oversimplification. Stata is an interpretive language and Stata does not require users to type filenames in quotes, although Stata does allow it. Thus "\" is sometimes a parsing character and sometimes not, and the same is true of "/". As Stata parses a line from left to right, it will change pchars between two tokenget() calls when the next token could be or is known to be a filename. Sometimes Stata peeks ahead to decide which way to parse. You can do the same by using the tokenpchars() and tokenpeek() functions.

To obtain the above environment, code

t = tokeninitstata()

Setting the environment to parse tab-delimited files

The token*() functions can be used to parse lines from tab-delimited files. A tab-delimited file contains lines of the form

<field1><tab><field2><tab><field3>

The parsing environment variables are

wchars = ""

pchars = ( char(9) ) (i.e., tab)

qchars = ( "" )

allownum = 0

allowhex = 0

To set this environment, code

t = tokeninit("", char(9), "", 0, 0)

Say that you then parse the line

Farber, William<tab> 2201.00<tab>12

The results will be

("Farber, William", char(9), " 2201.00", char(9), "12")

If the line were

Farber, William<tab><tab>12

the result would be

("Farber, William", char(9), char(9), "12")

The tab-delimited format is not well defined when the missing fields occur at the end of the line. A line with the last field missing might be recorded

Farber, William<tab> 2201.00<tab>

or

Farber, William<tab> 2201.00

A line with the last two fields missing might be recorded

Farber, William<tab><tab>

or

Farber, William<tab>

or

Farber, William

The following program would correctly parse lines with missing fields regardless of how they are recorded:

real rowvector readtabbed(transmorphic t, real scalar n) { real scalar i string rowvector res string scalar token

res = J(1, n, "") i = 1 while ((token = tokenget(t))!="") { if (token==char(9)) i++ else res[i] = token } return(res) }

Function overview

The basic way to proceed is to initialize the parsing environment and store it in a variable,

t = tokeninit(...)

and then set the string s to be parsed,

tokenset(t, s)

and finally use tokenget() to obtain the tokens one at a time (tokenget() returns "" when the end of the line is reached), or obtain all the tokens at once using tokengetall(t). That is, either

while((token = tokenget(t)) != "") { ... process token ... }

or

tokens = tokengetall(t) for (i=1; i<=cols(tokens); i++) { ... process tokens[i] ... }

After that, set the next string to be parsed,

tokenset(t, nextstring)

and repeat.

tokeninit() and tokeninitstata()

tokeninit() and tokeninitstata() are alternatives. tokeninitstata() is generally unnecessary unless you are writing a fairly complicated function.

Whichever function you use, code

t = tokeninit(...)

or

t = tokeninitstata()

If you declare t, declare it transmorphic. t is in fact a structure containing all the details of your parsing environment, but that is purposely hidden from you so that you cannot accidentally modify the environment.

tokeninit() allows up to five arguments:

t = tokeninit(wchars, pchars , qchars, allownum, allowhex)

You may omit arguments from the end. If omitted, the default values of the arguments are

allowhex = 0

allownum = 0

qchars = ( `""""', `"`""'"')

pchars = ( "" )

wchars = " "

Notes

1. Concerning wchars:

a. wchars is a string scalar. The white-space characters appear one after the other in the string. The order in which the characters appear is irrelevant.

b. Specify wchars as " " to treat blank as white space.

c. Specify wchars as " "+char(9) to treat blank and tab as white space. Including tab is necessary only when strings to be parsed are obtained from a file; strings obtained from Stata already have the tab characters removed.

d. Any character can be treated as a white-space character, including letters.

e. Specify wchars as "" to specify that there are no white-space characters.

2. Concerning pchars:

a. pchars is a string rowvector. Each element of the vector is a separate parse character. The order in which the parse characters are specified is irrelevant.

b. Specify pchars as ("+", "-") to make + and - parse characters.

c. Parse characters may be character combinations such as ++ or >=. Character combinations may be up to four characters long.

d. Specify pchars as "" or J(1,0,"") to specify that there are no parse characters. It makes no difference which you specify, but you will realize that J(1,0,"") is more logically consistent if you think about it.

3. Concerning qchars:

a. qchars is a string rowvector. Each element of the vector contains the open followed by the close characters. The order in which sets of quote characters are specified is irrelevant.

b. Specify qchars as (`""""') to make " an open and close character.

c. Specify qchars as (`""""', `"`""'"') to make "" and `""' quote characters.

d. Individual quote characters can be up to two characters long.

e. Specify qchars as "" or J(1,0,"") to specify that there are no quote characters.

tokenset()

After tokeninit() or tokeninitstata(), you are not yet through with initialization. You must tokenset(s) to specify the string scalar you wish to parse. You tokenset() one line, parse it, and if you have more lines, you tokenset() again and repeat the process. Often you will need to parse only one line. Perhaps you wish to write a program to parse the argument of a complicated option in a Stata ado-file. The structure is

program ... ... syntax ... [, ... MYoption(string) ...] mata: parseoption(`"`myoption'"') ... end

mata: void parseoption(string scalar option) { transmorphic t

t = tokeninit(...) tokenset(t, option) ... } end

Notes

1. When you tokenset(s), the contents of s are not stored. Instead, a pointer to s is stored. This approach saves memory and time, but it means that if you change s after setting it, you will change the subsequent behavior of the token*() functions.

2. Simply changing s is not sufficient to restart parsing. If you change s, you must tokenset(s) again.

tokengetall()

You have two alternatives in how to process the tokens. You can parse the entire line into a row vector containing all the individual tokens by using tokengetall(),

tokens = tokengetall(t)

or you can use tokenget() to process the tokens one at a time, which is discussed in the next section.

Using tokengetall(), tokens[1] will be the first token, tokens[2] the second, and so on. There are, in total, cols(tokens) tokens. If the line was empty or contained only white-space characters, cols(tokens) will be 0.

tokenget(), tokenpeek(), and tokenrest()

tokenget() returns the tokens one at a time and returns "" when the end of the line is reached. The basic loop for processing all the tokens in a line is

while ( (token = tokenget(t)) != "") { ... }

tokenpeek() allows you to peek ahead at the next token without actually getting it, so whatever is returned will be returned again by the next call to tokenget(). tokenpeek() is suitable only for obtaining the next token after tokenget(). Calling tokenpeek() twice in a row will not return the next two tokens; it will return the next token twice. To obtain the next two tokens, code

... current = tokenget(t) // get the current token ... t2 = t // copy parse environment next_1 = tokenget(t2) // peek at next token next_1 = tokenget(t2) // peek at token after that ... current = tokenget(t) // get next token

If you declare t2, declare it transmorphic.

tokenrest() returns the unparsed portion of the tokenset() string. Assume that you have just gotten the first token by using tokenget(). tokenrest() would return the rest of the original string, following the first token, unparsed. tokenrest(t) returns substr(original_string, tokenoffset(t), .).

tokenoffset()

tokenoffset() is useful only when you are using the tokenget() rather than tokengetall() style of programming. Let the original string you tokenset() be "this is an example". Right after you have tokenset() this string, tokenoffset() is 1:

this is an example | tokenoffset() = 1

After getting the first token (say it is "this"), tokenoffset() is 5:

this is an example | tokenoffset() = 5

tokenoffset() is always located on the first character following the last character parsed.

The syntax of tokenoffset() is

tokenoffset(t)

and

tokenoffset(t, newoffset)

The first returns the current offset value. The second resets the parser's location within the string.

tokenwchars(), tokenpchars(), and tokenqchars()

tokenwchars(), tokenpchars(), and tokenqchars() allow resetting the current wchars, pchars, and qchars. As with tokenoffset(), they come in two syntaxes.

With one argument, t, they return the current value of the setting. With two arguments, t and newvalue, they reset the value.

Resetting in the midst of parsing is an advanced issue. The most useful of these functions is tokenpchars(), since for interactive grammars, it is sometimes necessary to switch on and off a certain parsing character such as /, which in one context means division and in another is a file separator.

tokenallownum and tokenallowhex()

These two functions allow obtaining the current values of allownum and allowhex and resetting them.

Conformability

tokeninit(wchars, pchars, qchars, allownum, allowhex): wchars: 1 x 1 (optional) pchars: 1 x c_p (optional) qchars: 1 x c_q (optional) allownum: 1 x 1 (optional) allowhex: 1 x 1 (optional) result: transmorphic

tokeninitstata(): result: transmorphic

tokenset(t, s): t: transmorphic s: 1 x 1 result: void

tokengetall(t): t: transmorphic result: 1 x k

tokenget(t), tokenpeek(t), tokenrest(t): t: transmorphic result: 1 x 1 tokenoffset(t), tokenwchars(t), tokenallownum(t), tokenallowhex(t): t: transmorphic result: 1 x 1

tokenoffset(t, newvalue), tokenwchars(t, newvalue), tokenallownum(t, newvalue), tokenallowhex(t, newvalue): t: transmorphic newvalue: 1 x 1 result: void tokenpchars(t), tokenqchars(t): t: transmorphic result: 1 x c

tokenpchars(t, newvalue), tokenqchars(t, newvalue): t: transmorphic newvalue: 1 x c result: void

Diagnostics

None.

Source code

tokenget.mata for all functions.

Also see

Manual: [M-5] tokenget()

Help: [M-5] tokens(); [M-4] programming, [M-4] string


© Copyright 1996–2009 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index