Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: counting the number of times a string appears in a string variable?


From   "Mingfeng Lin" <[email protected]>
To   [email protected]
Subject   Re: st: counting the number of times a string appears in a string variable?
Date   Wed, 5 Nov 2008 08:34:37 -0500

Dear Friedrich, Phil and Nick:

Thank you all very much for your help!

Mingfeng

On Wed, Nov 5, 2008 at 7:56 AM, Nick Cox <[email protected]> wrote:
> I think Phil is correct so far as official Stata is concerned.
>
> But there are -egen- functions -noccur()- and -nss()- in -egenmore- from
> SSC.
>
> The help explains:
>
> ===================
> noccur(strvar) , string(substr) creates a variable containing the number
> of occurrences of the string substr in string variable strvar.  Note
> that occurrences must be disjoint (non-overlapping): thus there are two
> occurrences of "aa" within "aaaaa". (Stata 7 required.)
>
> nss(strvar) , find(substr) [ insensitive ] returns the number of
> occurrences of substr within the string variable strvar.  insensitive
> makes counting case-insensitive. (Stata 6 required.)
>
> The inclusion of noccur() and nss(), two almost identical functions, was
> an act of sheer inadvertence by the maintainer.
> =================
>
> These functions both predate regular expression syntax in Stata, but I
> don't think that latter helps much, if at all, with this particular
> problem. It's certainly not essential, as Phil's solution also
> indicates.
>
> Use -ssc inst egenmore- to install, and then -help egenmore-.
>
> Nick
> [email protected]
>
> Phil Schumm
>
> No, I don't believe so.  There are two ways to approach this: (1)
> compute the number of occurrences for each observation and then loop
> over observations, or (2) proceed one occurrence at a time, handling
> all observations at once.  The first approach would in general be more
> efficient if the variance in the number of occurrences were large;
> note that it would need to be done in Mata for it to scale well in the
> number of observations.  However, the fact that string variables can
> only be 244 characters long imposes an upper bound on the maximum
> number of occurrences (and therefore on the variance), and, in many
> situations, the effective upper bound may be pretty small (i.e., at
> most only a couple of occurrences per observation).  In such cases,
> the second approach would be adequate, e.g.,
>
> tempvar t1 t2
> gen `t1' = X
> gen `t2' = X
> gen Y = 0
> qui while 1 {
>     replace `t1' = subinstr(`t1', "john", "", 1)
>     cap ass `t1'==`t2'
>     if _rc {
>         replace Y = Y + (`t1'!=`t2')
>         replace `t2' = `t1'
>     }
>     else continue, br
> }
>
> where -regexr()- can be substituted for -subinstr()- if additional
> flexibility in matching is required.
>
> On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote:
>
>> I looked through the list of string functions but couldn't find one
>> that fits the bill.  Suppose I have a string variable X, and I would
>> like to generate a new numeric variable Y containing the number of
>> times a certain string appeared in X.  For instance
>>
>> X = "johnabc johncd"
>>
>> If I'd like to find the number of times "john" shows up in X, I hope
>> to obtain Y = 2
>>
>> Is there a function in Stata to do this?
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index