Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: counting the number of times a string appears in a string variable?


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: counting the number of times a string appears in a string variable?
Date   Wed, 5 Nov 2008 12:56:33 -0000

I think Phil is correct so far as official Stata is concerned. 

But there are -egen- functions -noccur()- and -nss()- in -egenmore- from
SSC. 

The help explains: 

===================
noccur(strvar) , string(substr) creates a variable containing the number
of occurrences of the string substr in string variable strvar.  Note
that occurrences must be disjoint (non-overlapping): thus there are two
occurrences of "aa" within "aaaaa". (Stata 7 required.)

nss(strvar) , find(substr) [ insensitive ] returns the number of
occurrences of substr within the string variable strvar.  insensitive
makes counting case-insensitive. (Stata 6 required.)

The inclusion of noccur() and nss(), two almost identical functions, was
an act of sheer inadvertence by the maintainer.
=================

These functions both predate regular expression syntax in Stata, but I
don't think that latter helps much, if at all, with this particular
problem. It's certainly not essential, as Phil's solution also
indicates. 

Use -ssc inst egenmore- to install, and then -help egenmore-. 

Nick 
n.j.cox@durham.ac.uk 

Phil Schumm

No, I don't believe so.  There are two ways to approach this: (1)  
compute the number of occurrences for each observation and then loop  
over observations, or (2) proceed one occurrence at a time, handling  
all observations at once.  The first approach would in general be more  
efficient if the variance in the number of occurrences were large;  
note that it would need to be done in Mata for it to scale well in the  
number of observations.  However, the fact that string variables can  
only be 244 characters long imposes an upper bound on the maximum  
number of occurrences (and therefore on the variance), and, in many  
situations, the effective upper bound may be pretty small (i.e., at  
most only a couple of occurrences per observation).  In such cases,  
the second approach would be adequate, e.g.,

tempvar t1 t2
gen `t1' = X
gen `t2' = X
gen Y = 0
qui while 1 {
     replace `t1' = subinstr(`t1', "john", "", 1)
     cap ass `t1'==`t2'
     if _rc {
         replace Y = Y + (`t1'!=`t2')
         replace `t2' = `t1'
     }
     else continue, br
}

where -regexr()- can be substituted for -subinstr()- if additional  
flexibility in matching is required.

On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote:

> I looked through the list of string functions but couldn't find one  
> that fits the bill.  Suppose I have a string variable X, and I would  
> like to generate a new numeric variable Y containing the number of  
> times a certain string appeared in X.  For instance
>
> X = "johnabc johncd"
>
> If I'd like to find the number of times "john" shows up in X, I hope  
> to obtain Y = 2
>
> Is there a function in Stata to do this?


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index