Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: string function

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: string function
Date	Wed, 24 Aug 2011 12:19:29 +0100

As stressed in that SJ Tip: for substrings longer than one character,
you need to divide

(length("abcdaf") - length(subinstr("abcdaf", "abc", "", .))) / length("abc")

See also -moss- (SSC)

Title

    moss -- Find multiple occurrences of substrings


Syntax

        moss strvar [if] [in] match(["]pattern["]) [ regex
prefix(prefix) suffix(suffix) maximum(#)
                 compact ]


Description

    moss finds occurrences of substrings matching a pattern in a given
string variable. Depending on what
    is sought and what is found, variables are created giving the
count of occurrences (always); the
    positions of occurrences (whenever any are found); and the exact
substrings found (when a regular
    expression defines a subexpression to be returned). The default
names are respectively _count, _pos1
    up, and _match1 up.


Remarks

    By default, moss finds repeated occurrences of the string
specified in match() using Stata's strpos()
    string function (in older versions of Stata, strpos() was named
index()). A _count variable is
    created to indicate the number of occurrences per observation. The
position, per observation, of the
    first instance will be recorded in _pos1, the second in _pos2, and so on.

    With the regex option, moss can be used to repeatedly find more
complex patterns of text. The
    specification of the search pattern must follow regexm() syntax
and include one and only one
    subexpression to be matched. When using regular expressions,
subexpressions are identified using
    parentheses.  For example, match("AMC ([A-Za-z]+)") will match
"AMC Concord", "AMC Pacer", and "AMC
    AMC Spirit" but moss will put in _match1 the matched
subexpressions "Concord", "Pacer", and "AMC
    Spirit".

    moss follows the principle that occurrences must be disjoint and
may not overlap. That is, it finds
    just one occurrence of "ana" in "banana", not two.


Options

    match() is required and the pattern can be either literal text or
a regular expression.

    regex specifies that the pattern is to be interpreted as a regular
expression. Such a pattern must
        contain precisely one subexpression to be extracted. See Examples.

    prefix() specifies an alternative prefix for new variable names to
be created by moss. Such a prefix
        must start either with a letter or with an underscore.

    suffix() specifies a suffix for new variable names to be created.

    prefix() and suffix() may not be combined.

    maximum() specifies an upper limit to the number of position and
match variables to be created. That
        is, specify max(3) if you want to see details of at most the
first 3 occurrences of your pattern.

    compact specifies that the most compact storage types possible be
used during calculations.
        Specifying this option may slow moss down.


Examples

    . moss make, match(",")

    . moss make, match("([0-9]+)") regex

    . moss history, match("(X+)") regex

    . moss s, match("([^ ]+)") prefix(s_) regex



Authors

    Robert Picard
    [email protected]

    Nicholas J. Cox, Durham University
    [email protected]


Acknowledgments

    A question on Statalist from Rebecca A. Pope was the stimulus for
writing this program.



On Wed, Aug 24, 2011 at 11:59 AM, Nick Cox <[email protected]> wrote:
> Solutions to all these could be written as -egen- functions or Mata functions.
>
> Here I focus on "official Stata only" solutions.
>
> First question is discussed in
>
> Nicholas J. Cox
> Stata tip 98: Counting substrings within strings
> The Stata Journal 11(2): 318-320
>
> length("abcdaf") - length(subinstr("abcdaf", "a", "", .))
>
> Last two questions
>
> any of "a", "b", "c"
>
> max(strpos("abcdaf","a"), strpos("abcdaf", "b"), strpos("abcdaf", "c")) > 0
>
> all of "a", "b", "c"
>
> min(strpos("abcdaf","a"), strpos("abcdaf", "b"), strpos("abcdaf", "c")) > 0
>
> If you had a long list of candidates, I would do something like this:
>
> gen found = 0
>
> qui foreach letter in s o m e t h i n g {
>       replace found = max(found, strpos(strvar, "`letter'") > 0)
> }
>
> where for "max" substitute "min" as needed.
>
> The mapping max <-> any, min <-> all is discussed in
> http://www.stata.com/support/faqs/data/anyall.html
>
> Nick
>
> 2011/8/24 Grace Jessie <[email protected]>:
>
>> How to count how many times a substring appears in a string?
>> For example,
>> function("abcdaf","a")=2
>>
>> And, how to check if a string variable has certain substrings?
>> With regard to this, I want to ask two functions.
>> For example,
>> function("abcdaf","a","b","c")
>> One of what I want to do is to return 1 if a or b or c is included in "abcdaf", ;
>> the other is to return 1 if a, b and c are included in "abcdaf".
>> Could anyone tell me the correct functions for thoes above?
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: string function
  - From: Grace Jessie <[email protected]>
- Re: st: string function
  - From: Nick Cox <[email protected]>

Prev by Date: RE: st: string function
Next by Date: RE: st: Useful labelling of dummy variables following logit
Previous by thread: RE: st: string function
Index(es):
- Date
- Thread