[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: counting the number of times a string appears in a string variable?

From   Phil Schumm <>
Subject   Re: st: counting the number of times a string appears in a string variable?
Date   Wed, 5 Nov 2008 02:05:31 -0600

On Nov 4, 2008, at 8:42 PM, Mingfeng Lin wrote:
I looked through the list of string functions but couldn't find one that fits the bill. Suppose I have a string variable X, and I would like to generate a new numeric variable Y containing the number of times a certain string appeared in X. For instance

X = "johnabc johncd"

If I'd like to find the number of times "john" shows up in X, I hope to obtain Y = 2

Is there a function in Stata to do this?

No, I don't believe so. There are two ways to approach this: (1) compute the number of occurrences for each observation and then loop over observations, or (2) proceed one occurrence at a time, handling all observations at once. The first approach would in general be more efficient if the variance in the number of occurrences were large; note that it would need to be done in Mata for it to scale well in the number of observations. However, the fact that string variables can only be 244 characters long imposes an upper bound on the maximum number of occurrences (and therefore on the variance), and, in many situations, the effective upper bound may be pretty small (i.e., at most only a couple of occurrences per observation). In such cases, the second approach would be adequate, e.g.,

tempvar t1 t2
gen `t1' = X
gen `t2' = X
gen Y = 0
qui while 1 {
    replace `t1' = subinstr(`t1', "john", "", 1)
    cap ass `t1'==`t2'
    if _rc {
        replace Y = Y + (`t1'!=`t2')
        replace `t2' = `t1'
    else continue, br

where -regexr()- can be substituted for -subinstr()- if additional flexibility in matching is required.

-- Phil

*   For searches and help try:

© Copyright 1996–2022 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index