Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: string function


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: string function
Date   Fri, 11 Oct 2002 16:57:22 +0100

Sidney Atwood
>
> /*  a snippet that will count the number of
>     occurences of characters in strings
>
>     including the generation of lower case letters
>     for demonstration
>     SSA 10/11/02
>
> */
>
>
> /* we're just going to look for the first five
>    letters
> */
>
> #delimit ;
>
> set obs 100;
>
> gen str10 avar = "";
>
> for num 1/10:replace avar = avar + char(int(uniform() * 6 +
> 97)) if X >
> 0 ;
>
> /* so now we have strings of the first 5 letters
>    the line below counts them
>
>    note if the strings are exactly the same length
>    it is unnecessary to test length
> */
> /* for clarity output vars are initialized in a
>    separate line
> */
> for LETTER in any a b c d e:gen int LETTERcount = 0 ;
>
>
> for LETTER in any a b c d e:
>   for COLUMN in num 1/10:
>     replace LETTERcount = LETTERcount + 1
>       if substr(avar,COLUMN,1) == "LETTER" &
>          COLUMN <= length(avar)
> ;
>
> /* all done, look at the results */
>
> for var *count:tab X;
>
> /* This kind of thing can be reversed as well if you
>    need to concatenate variables
> */
>

Sidney's very clear code shows how to use two nested -for- loops to
count
the occurrences of several letters in a string variable.
In fact his code would have the same form as that here if
the counts were not of letters but of arbitrary substrings.
Note that initialisation of all count variables to 0 is required.

A relevant comparison is with Nick Winter's code posted yesterday.
Here I use not -for-, but -foreach-:

foreach l in a b c d e {
	egen `l'count = noccur(avar), s(`l')
}

Of course, almost all the hard work is off stage, as -noccur()- is
a tool invented for this purpose, whereas Sidney's code was
developed from first principles, so like is not
compared with exact like.

Three further, and more general, comments:

1. For interactive use in Stata versions up to 6,
the -for- approach has attractions. However, Stata 7
introduced the -foreach- structure which is on the whole a much
better tool, and also -forvalues-.

2. -for- is implemented as an ado: there are various implications of
this, of which one is that running -for- imposes considerable
interpretive overload, as Stata has to interpret hundreds
of lines of its own code. This holds a fortiori for nested -for-s.
Although this is clearly _not_ suggested by Sidney, wrapping this
up in a program to be used repeatedly would be very inefficient
compared with Nick Winter's code.

3. Nesting -for-s is not guaranteed to work. Sometimes it does
and sometimes it doesn't, so be warned. Also, -for- does not
extend gracefully to very complicated problems.

Further material on these matters can be located by

. findit foreach

Nick Winter's -noccur()- will be added to -egenmore-
on SSC very shortly.

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index