[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: counting the number of times a string appears in a string variable? |

Date |
Wed, 5 Nov 2008 12:56:33 -0000 |

I think Phil is correct so far as official Stata is concerned. But there are -egen- functions -noccur()- and -nss()- in -egenmore- from SSC. The help explains: =================== noccur(strvar) , string(substr) creates a variable containing the number of occurrences of the string substr in string variable strvar. Note that occurrences must be disjoint (non-overlapping): thus there are two occurrences of "aa" within "aaaaa". (Stata 7 required.) nss(strvar) , find(substr) [ insensitive ] returns the number of occurrences of substr within the string variable strvar. insensitive makes counting case-insensitive. (Stata 6 required.) The inclusion of noccur() and nss(), two almost identical functions, was an act of sheer inadvertence by the maintainer. ================= These functions both predate regular expression syntax in Stata, but I don't think that latter helps much, if at all, with this particular problem. It's certainly not essential, as Phil's solution also indicates. Use -ssc inst egenmore- to install, and then -help egenmore-. Nick n.j.cox@durham.ac.uk Phil Schumm No, I don't believe so. There are two ways to approach this: (1) compute the number of occurrences for each observation and then loop over observations, or (2) proceed one occurrence at a time, handling all observations at once. The first approach would in general be more efficient if the variance in the number of occurrences were large; note that it would need to be done in Mata for it to scale well in the number of observations. However, the fact that string variables can only be 244 characters long imposes an upper bound on the maximum number of occurrences (and therefore on the variance), and, in many situations, the effective upper bound may be pretty small (i.e., at most only a couple of occurrences per observation). In such cases, the second approach would be adequate, e.g., tempvar t1 t2 gen `t1' = X gen `t2' = X gen Y = 0 qui while 1 { replace `t1' = subinstr(`t1', "john", "", 1) cap ass `t1'==`t2' if _rc { replace Y = Y + (`t1'!=`t2') replace `t2' = `t1' } else continue, br } where -regexr()- can be substituted for -subinstr()- if additional flexibility in matching is required. On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote: > I looked through the list of string functions but couldn't find one > that fits the bill. Suppose I have a string variable X, and I would > like to generate a new numeric variable Y containing the number of > times a certain string appeared in X. For instance > > X = "johnabc johncd" > > If I'd like to find the number of times "john" shows up in X, I hope > to obtain Y = 2 > > Is there a function in Stata to do this? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: counting the number of times a string appears in a string variable?***From:*"Mingfeng Lin" <mingfeng.lin@gmail.com>

**References**:**st: counting the number of times a string appears in a string variable?***From:*"Mingfeng Lin" <mingfeng.lin@gmail.com>

**Re: st: counting the number of times a string appears in a string variable?***From:*Phil Schumm <pschumm@uchicago.edu>

- Prev by Date:
**Re: st: Error using svrmodel** - Next by Date:
**st: RE: xtreg vs. xtgls vs. xtpcse vs. xthtaylor** - Previous by thread:
**Re: st: counting the number of times a string appears in a string variable?** - Next by thread:
**Re: st: counting the number of times a string appears in a string variable?** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |