Dear Friedrich, Phil and Nick: Thank you all very much for your help! Mingfeng On Wed, Nov 5, 2008 at 7:56 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > I think Phil is correct so far as official Stata is concerned. > > But there are -egen- functions -noccur()- and -nss()- in -egenmore- from > SSC. > > The help explains: > > =================== > noccur(strvar) , string(substr) creates a variable containing the number > of occurrences of the string substr in string variable strvar. Note > that occurrences must be disjoint (non-overlapping): thus there are two > occurrences of "aa" within "aaaaa". (Stata 7 required.) > > nss(strvar) , find(substr) [ insensitive ] returns the number of > occurrences of substr within the string variable strvar. insensitive > makes counting case-insensitive. (Stata 6 required.) > > The inclusion of noccur() and nss(), two almost identical functions, was > an act of sheer inadvertence by the maintainer. > ================= > > These functions both predate regular expression syntax in Stata, but I > don't think that latter helps much, if at all, with this particular > problem. It's certainly not essential, as Phil's solution also > indicates. > > Use -ssc inst egenmore- to install, and then -help egenmore-. > > Nick > n.j.cox@durham.ac.uk > > Phil Schumm > > No, I don't believe so. There are two ways to approach this: (1) > compute the number of occurrences for each observation and then loop > over observations, or (2) proceed one occurrence at a time, handling > all observations at once. The first approach would in general be more > efficient if the variance in the number of occurrences were large; > note that it would need to be done in Mata for it to scale well in the > number of observations. However, the fact that string variables can > only be 244 characters long imposes an upper bound on the maximum > number of occurrences (and therefore on the variance), and, in many > situations, the effective upper bound may be pretty small (i.e., at > most only a couple of occurrences per observation). In such cases, > the second approach would be adequate, e.g., > > tempvar t1 t2 > gen `t1' = X > gen `t2' = X > gen Y = 0 > qui while 1 { > replace `t1' = subinstr(`t1', "john", "", 1) > cap ass `t1'==`t2' > if _rc { > replace Y = Y + (`t1'!=`t2') > replace `t2' = `t1' > } > else continue, br > } > > where -regexr()- can be substituted for -subinstr()- if additional > flexibility in matching is required. > > On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote: > >> I looked through the list of string functions but couldn't find one >> that fits the bill. Suppose I have a string variable X, and I would >> like to generate a new numeric variable Y containing the number of >> times a certain string appeared in X. For instance >> >> X = "johnabc johncd" >> >> If I'd like to find the number of times "john" shows up in X, I hope >> to obtain Y = 2 >> >> Is there a function in Stata to do this? > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

