Note: The following material is based on postings on
Statalist.

Title | Counting distinct strings across a set of variables | |

Author | Nicholas J. Cox, Durham University, UK |

Each observation in my data represents a respondent. Besides the first
variable **id**, which gives an identifier, the
other variables (call them **A** to
**Z**) contain either interesting strings or
missing values indicated by **"."**. I need to
create a variable **nvals** that counts the
number of unique strings found for any given respondent in
**A** to **Z**.

The first step is clearly to find out if a command or a function or an
egen function has
been written to do precisely this.
search and
**findit** are the
tools, but using keywords such as “unique strings” or
“distinct strings” yielded nothing. Those familiar with
Stata’s built-in functions would probably guess, as I did, that this
is a little too esoteric to be matched by a built-in function.
It is more likely an **egen** function was created
to solve this problem; you may be familiar with
**egen** functions like
**rowtotal()** and
**rowmax()**, which operate across variables and
give a result in each observation, but again no such function is
evident.

The problem of counting distinct values within variables has often
been addressed; see, in particular, the
*FAQ:
How do I compute the number of distinct observations?*

This FAQ suggests that we can attack this problem through the sequence

reshapecalculate number of distinct observationsreshape

otherwise known as the Stata two-step. For more information, see
[D] **reshape**
and also another
*
FAQ: I am having problems with the reshape command. Can you give further
guidance?*

In what follows, I am going to assume you know about foreach and forvalues and how to use references to local macros with such structures. Both the manual entries for those structures and the help are a little terse, so some long-winded explanations in Cox (2002) and Cox (2003) may be useful.

First, we rename variables so that they have a common prefix

.rename (A-Z) S_=

Then we **reshape** to long:

. reshape long S_ , i(id) string

Now our count of distinct strings could be

. by id S_, sort: gen nvals = _n == 1 . by id: replace nvals = sum(nvals) . by id: replace nvals = nvals[_N]

If missing values indicated by **"."** are of no
interest, we should exclude them. (In another problem, what is deemed
missing by the user might be empty strings, or strings that are only spaces.
Stata’s definition of missing with strings is the empty string
**""**.) One way of excluding **"."** from the count is to take
special action with those deemed missing. Instead of using the previous
block of code, we type

. by id S_, sort: gen nvals = (_n == 1) * (S_ != ".")

The expression **S_ != "."** evaluates to 1 whenever the string variable
**S_** is not **"."** and to 0 when it is **"."**. So we
do not add any occurrence of **"."** to the count. More precisely,
we add 0 whenever **S_ == "."**, but that has the same consequence. The
rest of the block is the same:

. by id: replace nvals = sum(nvals) . by id: replace nvals = nvals[_N]

Now we **reshape** back

. reshape wide S_, i(id) string

and then **nvals** is an extra variable in the
dataset. The prefix we added at the outset needs removing:

. rename S_* *

In a way, this solution is quite drastic: we restructure the whole dataset twice for the sake of one new variable. This raises the question of whether it can be done in place, and indeed it can. If you were doing this by hand for a small dataset, how would you do it? You would loop over observations:

for each observation { compile a list of all the values of the string variables count the number of distinct values (not".") in the list }

Experienced Stata users have often seen this advice: Do not loop over
observations, as it is usually not the best way to do something. Here,
however, is an example running contrary to that advice. The simple algorithm
above is a good one; our only problem is translating it to Stata code. How
do we loop over observations? You know that
**_N** is Stata’s special built-in for
the number of observations, so you could type

. local N = _N . forval i = 1/`N' { . ... . }

but Stata 7 introduced a way of doing this on the fly, which was documented
beginning with Stata 8 (see
macro or [U] **18.3.7 Macro increment and decrement functions**):

. forval i = 1/`=_N' { . ... . }

That is, given the **` '**, Stata looks inside, sees

=_N

and evaluates the expression, which here is just
**_N**.

Now we need a way to build up a list of all the values in
**A** to **Z** in
an observation. The structure is going to look like this:

. forval i = 1/`=_N' { . foreach v of var A-Z { . ... . } . }

To compile a list of all the string values for a given observation, we could loop over

. local all `"`all' `"`=`v'[`i']'"'"'

If you nod with recognition when you see this, you really shouldn’t
need this commentary because you know enough Stata to write it. It deserves
a lot of explanation. First, understand, or recall, that compound double
quotes start with **`"** and end with **"'**. They are Stata’s
solution to a nasty little problem: as the beginning double quote **"**
is the same symbol as the end double quote **"** how does Stata tell the
difference between nested quoted strings

"a"b"c"

in which the quoted string **"b"** is embedded within a string that
starts with **"a** and ends with **c"**, and separate quoted strings

"a"b"c"

in which the quoted strings **"a"** and **"c"** are separate and
**b** is an extra character between them? You can usually tell the
difference because you (should) know what you mean, but Stata can’t tell
the difference. The answer lies instead in using **`"** and **"'** as
delimiters so that Stata looks for the matching delimiter to see what the
building blocks are, just as with, say, something with the structure **(( )
( ))**, which is entirely familiar from elementary mathematics.

So we are going to look inside each value of each string variable.
Observations are referred to within the outer **forval** loop by the
local macro **i** and variables within the inner **foreach** group by
the local macro **v**, so we want generically to add **`v'[`i']** to
the list. We do this on the fly using the same technique as before:

`=`v'[`i']'

That is, the outermost **` '** contain within them

=`v'[`i']

and the **=** indicates that there is an expression to be evaluated,
namely, the current value of the current string variable. We must bundle the
result in compound double quotes (at worst, it also might contain yet
another quotation mark):

`"`=`v'[`i']'"'

The basic idea is that every time we pick up another string value we add it to our list, which is where we came in a few paragraphs back:

. local all `"`all' `"`=`v'[`i']'"'"'

Think of this the first time through. The local macro **all** is at that
point undefined, so **`all'** evaluates to nothing or the empty string
**""**. So **all** is born as

`" `"`=`v'[`i']'"'"'

Concretely, if the first value of the first variable **A[1]** is
**"frog"**, **all** is born as

`" `"frog"'"'

where again the outermost delimiters **`" "'** are needed so that
everything remains straight, given the ambiguity if we were to nest **"
"**.

The code so far is

. forval i = 1/`=_N' { . foreach v of var A-Z { . local all `"`all' `"`=`v'[`i']'"'"' . } . }

We need some way of getting the number of unique (meaning distinct) strings
in the local macro **all**. That is much harder than anything yet, but
some customized machinery exists for the job. The tools at
**help
macrolists** or [P] **macro lists** are invaluable
here. We first remove any duplicates,

. local all : list uniq all

and then count the number of strings in that resulting list

. local nall : list sizeof all

However, we have to put this number into a variable, somehow. The standard device here is to generate a variable outside all the loops and then to replace it, observation by observation, with our result inside the loops.

. gen nvals = 0 . quietly forval i = 1/`=_N' { . foreach v of var A-Z { . local all `"`all' `"`=`v'[`i']'"'"' . } . local all : list uniq all . replace nvals = `: list sizeof all' in `i' . local all . }

Some flourishes, variously cosmetic and crucial, have been added here.

- The whole thing will be rather noisy, with lots of
**replace**s producing lots of little messages. Slapping a quietly on the outermost loop will fix that. - We can do the counting using
**list sizeof**from**macro lists**on the fly using the**`: '**construct. More at [U]**18.3.7**. - Most importantly, we have to set the local macro
**all**back to empty; otherwise, we just accumulate results from observation to observation.

This still leaves one problem untouched: how to avoid counting **"."** as
another value. One way is just to avoid it when seen:

. gen nvals = 0 . quietly forval i = 1/`=_N' { . foreach v of var A-Z { . if `v'[`i'] != "." { . local all `"`all' `"`=`v'[`i']'"'"' . } . } . local all : list uniq all . replace nvals = `: list sizeof all' in `i' . local all . }

In another problem, we might prefer not to count empty strings or strings containing only spaces. We should just add that constraint to the if condition:

if `v'[`i'] != "." & trim(`v'[`i']) != ""

- Cox, N. J. 2002.
- Speaking Stata: How to face lists with fortitude. Stata Journal 2: 202–222.

- Cox, N. J. 2003.
- Speaking Stata: Problems with lists. Stata Journal 3: 185–202.

- Cox, N. J. and G. M. Longton. 2008.
- Distinct observations.
*Stata Journal*8: 557–568.