Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: string function |

Date |
Wed, 24 Aug 2011 12:19:29 +0100 |

As stressed in that SJ Tip: for substrings longer than one character, you need to divide (length("abcdaf") - length(subinstr("abcdaf", "abc", "", .))) / length("abc") See also -moss- (SSC) Title moss -- Find multiple occurrences of substrings Syntax moss strvar [if] [in] match(["]pattern["]) [ regex prefix(prefix) suffix(suffix) maximum(#) compact ] Description moss finds occurrences of substrings matching a pattern in a given string variable. Depending on what is sought and what is found, variables are created giving the count of occurrences (always); the positions of occurrences (whenever any are found); and the exact substrings found (when a regular expression defines a subexpression to be returned). The default names are respectively _count, _pos1 up, and _match1 up. Remarks By default, moss finds repeated occurrences of the string specified in match() using Stata's strpos() string function (in older versions of Stata, strpos() was named index()). A _count variable is created to indicate the number of occurrences per observation. The position, per observation, of the first instance will be recorded in _pos1, the second in _pos2, and so on. With the regex option, moss can be used to repeatedly find more complex patterns of text. The specification of the search pattern must follow regexm() syntax and include one and only one subexpression to be matched. When using regular expressions, subexpressions are identified using parentheses. For example, match("AMC ([A-Za-z]+)") will match "AMC Concord", "AMC Pacer", and "AMC AMC Spirit" but moss will put in _match1 the matched subexpressions "Concord", "Pacer", and "AMC Spirit". moss follows the principle that occurrences must be disjoint and may not overlap. That is, it finds just one occurrence of "ana" in "banana", not two. Options match() is required and the pattern can be either literal text or a regular expression. regex specifies that the pattern is to be interpreted as a regular expression. Such a pattern must contain precisely one subexpression to be extracted. See Examples. prefix() specifies an alternative prefix for new variable names to be created by moss. Such a prefix must start either with a letter or with an underscore. suffix() specifies a suffix for new variable names to be created. prefix() and suffix() may not be combined. maximum() specifies an upper limit to the number of position and match variables to be created. That is, specify max(3) if you want to see details of at most the first 3 occurrences of your pattern. compact specifies that the most compact storage types possible be used during calculations. Specifying this option may slow moss down. Examples . moss make, match(",") . moss make, match("([0-9]+)") regex . moss history, match("(X+)") regex . moss s, match("([^ ]+)") prefix(s_) regex Authors Robert Picard picard@netbox.com Nicholas J. Cox, Durham University n.j.cox@durham.ac.uk Acknowledgments A question on Statalist from Rebecca A. Pope was the stimulus for writing this program. On Wed, Aug 24, 2011 at 11:59 AM, Nick Cox <njcoxstata@gmail.com> wrote: > Solutions to all these could be written as -egen- functions or Mata functions. > > Here I focus on "official Stata only" solutions. > > First question is discussed in > > Nicholas J. Cox > Stata tip 98: Counting substrings within strings > The Stata Journal 11(2): 318-320 > > length("abcdaf") - length(subinstr("abcdaf", "a", "", .)) > > Last two questions > > any of "a", "b", "c" > > max(strpos("abcdaf","a"), strpos("abcdaf", "b"), strpos("abcdaf", "c")) > 0 > > all of "a", "b", "c" > > min(strpos("abcdaf","a"), strpos("abcdaf", "b"), strpos("abcdaf", "c")) > 0 > > If you had a long list of candidates, I would do something like this: > > gen found = 0 > > qui foreach letter in s o m e t h i n g { > replace found = max(found, strpos(strvar, "`letter'") > 0) > } > > where for "max" substitute "min" as needed. > > The mapping max <-> any, min <-> all is discussed in > http://www.stata.com/support/faqs/data/anyall.html > > Nick > > 2011/8/24 Grace Jessie <gracejessie@hotmail.com>: > >> How to count how many times a substring appears in a string? >> For example, >> function("abcdaf","a")=2 >> >> And, how to check if a string variable has certain substrings? >> With regard to this, I want to ask two functions. >> For example, >> function("abcdaf","a","b","c") >> One of what I want to do is to return 1 if a or b or c is included in "abcdaf", ; >> the other is to return 1 if a, b and c are included in "abcdaf". >> Could anyone tell me the correct functions for thoes above? > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: string function***From:*Grace Jessie <gracejessie@hotmail.com>

**Re: st: string function***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**RE: st: string function** - Next by Date:
**RE: st: Useful labelling of dummy variables following logit** - Previous by thread:
**RE: st: string function** - Index(es):