Search
   >> Home >> Resources & support >> FAQs >> Counting distinct strings across a set of variables
The following material is based on postings on Statalist.

How do I count the number of distinct strings across a set of variables?

Title   Counting distinct strings across a set of variables
Author Nicholas J. Cox, Durham University, UK
Date July 2004; minor revisions November 2006; minor revision February 2014

1. The problem

Each observation in my data represents a respondent. Besides the first variable id, which gives an identifier, the other variables (call them A to Z) contain either interesting strings or missing values indicated by ".". I need to create a variable nvals that counts the number of unique strings found for any given respondent in A to Z.

2. Has someone else solved the problem?

The first step is clearly to find out if a command or a function or an egen function has been written to do precisely this. search and findit are the tools, but using keywords such as “unique strings” or “distinct strings” yielded nothing. Those familiar with Stata’s built-in functions would probably guess, as I did, that this is a little too esoteric to be matched by a built-in function. It is more likely an egen function was created to solve this problem; you may be familiar with egen functions like rowtotal() and rowmax(), which operate across variables and give a result in each observation, but again no such function is evident.

3. A double reshape solution

The problem of counting distinct values within variables has often been addressed; see, in particular, the FAQ: How do I compute the number of distinct observations?

This FAQ suggests that we can attack this problem through the sequence

        reshape 
        calculate number of distinct observations  
        reshape 

otherwise known as the Stata two-step. For more information, see [D] reshape and also another FAQ: I am having problems with the reshape command. Can you give further guidance?

In what follows, I am going to assume you know about foreach and forvalues and how to use references to local macros with such structures. Both the manual entries for those structures and the help are a little terse, so some long-winded explanations in Cox (2002) and Cox (2003) may be useful.

First, we rename variables so that they have a common prefix

 . foreach v of var A-Z { 
 . 	rename `v' S_`v' 
 . } 

Then we reshape to long:

 . reshape long S_ , i(id) string 

Now our count of distinct strings could be

 . by id S_, sort: gen nvals = _n == 1 
 . by id: replace nvals = sum(nvals) 
 . by id: replace nvals = nvals[_N] 

If missing values indicated by "." are of no interest, we should exclude them. (In another problem, what is deemed missing by the user might be empty strings, or strings that are only spaces. Stata’s definition of missing with strings is the empty string "".) One way of excluding "." from the count is to take special action with those deemed missing. Instead of using the previous block of code, we type

 . by id S_, sort: gen nvals = (_n == 1) * (S_ != ".")  

The expression S_ != "." evaluates to 1 whenever the string variable S_ is not "." and to 0 when it is ".". So we do not add any occurrence of "." to the count. More precisely, we add 0 whenever S_ == ".", but that has the same consequence. The rest of the block is the same:

 . by id: replace nvals = sum(nvals) 
 . by id: replace nvals = nvals[_N] 

Now we reshape back

 . reshape wide S_, i(id) string 

and then nvals is an extra variable in the dataset. The prefix we added at the outset needs removing:

 . renpfix S_ 

4. A looping over observations solution

In a way, this solution is quite drastic: we restructure the whole dataset twice for the sake of one new variable. This raises the question of whether it can be done in place, and indeed it can. If you were doing this by hand for a small dataset, how would you do it? You would loop over observations:

 for each observation { 
         compile a list of all the values of the string variables 
         count the number of distinct values (not ".") in the list
 }

Experienced Stata users have often seen this advice: Do not loop over observations, as it is usually not the best way to do something. Here, however, is an example running contrary to that advice. The simple algorithm above is a good one; our only problem is translating it to Stata code. How do we loop over observations? You know that _N is Stata’s special built-in for the number of observations, so you could type

 . local N = _N 
 . forval i = 1/`N' { 
 . 	...
 . } 

but Stata 7 introduced a way of doing this on the fly, which was documented beginning with Stata 8 (see macro or [U] 18.3.7 Macro increment and decrement functions):

 . forval i = 1/`=_N' { 
 . 	...
 . }

That is, given the ` ', Stata looks inside, sees

        =_N 

and evaluates the expression, which here is just _N.

Now we need a way to build up a list of all the values in A to Z in an observation. The structure is going to look like this:

 . forval i = 1/`=_N' { 
 . 	foreach v of var A-Z { 
 . 		...
 . 	} 	
 . }

To compile a list of all the string values for a given observation, we could loop over

 . local all `"`all' `"`=`v'[`i']'"'"' 

If you nod with recognition when you see this, you really shouldn’t need this commentary because you know enough Stata to write it. It deserves a lot of explanation. First, understand, or recall, that compound double quotes start with `" and end with "'. They are Stata’s solution to a nasty little problem: as the beginning double quote " is the same symbol as the end double quote " how does Stata tell the difference between nested quoted strings

        "a"b"c"

in which the quoted string "b" is embedded within a string that starts with "a and ends with c", and separate quoted strings

        "a"b"c"

in which the quoted strings "a" and "c" are separate and b is an extra character between them? You can usually tell the difference because you (should) know what you mean, but Stata can’t tell the difference. The answer lies instead in using `" and "' as delimiters so that Stata looks for the matching delimiter to see what the building blocks are, just as with, say, something with the structure (( ) ( )), which is entirely familiar from elementary mathematics.

So we are going to look inside each value of each string variable. Observations are referred to within the outer forval loop by the local macro i and variables within the inner foreach group by the local macro v, so we want generically to add `v'[`i'] to the list. We do this on the fly using the same technique as before:

        `=`v'[`i']'

That is, the outermost ` ' contain within them

        =`v'[`i'] 

and the = indicates that there is an expression to be evaluated, namely, the current value of the current string variable. We must bundle the result in compound double quotes (at worst, it also might contain yet another quotation mark):

        `"`=`v'[`i']'"' 	

The basic idea is that every time we pick up another string value we add it to our list, which is where we came in a few paragraphs back:

 . local all `"`all' `"`=`v'[`i']'"'"' 

Think of this the first time through. The local macro all is at that point undefined, so `all' evaluates to nothing or the empty string "". So all is born as

        `" `"`=`v'[`i']'"'"' 

Concretely, if the first value of the first variable A[1] is "frog", all is born as

        `" `"frog"'"' 

where again the outermost delimiters `" "' are needed so that everything remains straight, given the ambiguity if we were to nest " ". Note an important nuance here. We do not start

 . local all = ...

because forcing Stata to evaluate the expression beyond = could run into a problem where the results cannot be longer than the maximum number of characters in a string expression that can be evaluated. That can be as low as 80, depending on your version of Stata; see limits. Be warned: this could bite you. It is much safer just to copy, not to evaluate.

The code so far is

 . forval i = 1/`=_N' { 
 . 	foreach v of var A-Z { 
 . 		local all `"`all' `"`=`v'[`i']'"'"' 
 . 	} 
 . }

We need some way of getting the number of unique (meaning distinct) strings in the local macro all. That is much harder than anything yet, but some customized machinery exists for the job. The tools at help macrolists or [P] macro lists are invaluable here. We first remove any duplicates,

 . local all : list uniq all 

and then count the number of strings in that resulting list

 . local nall : list sizeof all 

However, we have to put this number into a variable, somehow. The standard device here is to generate a variable outside all the loops and then to replace it, observation by observation, with our result inside the loops.

 . gen nvals = 0 
 . quietly forval i = 1/`=_N' { 
 . 	foreach v of var A-Z { 
 . 		local all `"`all' `"`=`v'[`i']'"'"' 
 . 	} 
 . 	local all : list uniq all 
 . 	replace nvals = `: list sizeof all' in `i' 
 . 	local all 
 . }

Some flourishes, variously cosmetic and crucial, have been added here.

This still leaves one problem untouched: how to avoid counting "." as another value. One way is just to avoid it when seen:

 . gen nvals = 0 
 . quietly forval i = 1/`=_N' { 
 . 	foreach v of var A-Z { 
 . 		if `v'[`i'] != "." { 
 . 			local all `"`all' `"`=`v'[`i']'"'"'
 . 		} 	
 . 	} 
 . 	local all : list uniq all 
 . 	replace nvals = `: list sizeof all' in `i' 
 . 	local all 
 . }

In another problem, we might prefer not to count empty strings or strings containing only spaces. We should just add that constraint to the if condition:

        if `v'[`i'] != "." & trim(`v'[`i']) != ""

References

Cox, N. J. 2002.
Speaking Stata: How to face lists with fortitude. Stata Journal 2: 202–222.
Cox, N. J. 2003.
Speaking Stata: Problems with lists. Stata Journal 3: 185–202.
Cox, N. J. and G. M. Longton. 2008.
Distinct observations. Stata Journal 8: 557–568.
The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube