The following material is based on postings on
Statalist.
How do I count the number of distinct strings across a set of variables?
|
Title
|
|
Counting distinct strings across a set of variables
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
July 2004; minor revisions November 2006
|
1. The problem
Each observation in my data represents a respondent. Besides the first
variable id, which gives an identifier, the
other variables (call them A to
Z) contain either interesting strings or
missing values indicated by ".". I need to
create a variable nvals that counts the
number of unique strings found for any given respondent in
A to Z.
2. Has someone else solved the problem?
The first step is clearly to find out if a command or a function or an
egen function has
been written to do precisely this.
search and
findit are the
tools, but using keywords such as “unique strings” or
“distinct strings” yielded nothing. Those familiar with
Stata’s built-in functions would probably guess, as I did, that this
is a little too esoteric to be matched by a built-in function.
It is more likely an egen function was created
to solve this problem; you may be familiar with
egen functions like
rowtotal() and
rowmax(), which operate across variables and
give a result in each observation, but again no such function is
evident.
3. A double reshape solution
The problem of counting distinct values within variables has often
been addressed; see, in particular, the FAQ
How do I compute the number of distinct observations?
This FAQ suggests that we can attack this problem through the sequence
reshape
calculate number of distinct observations
reshape
otherwise known as the Stata two-step. For more information, see [D]
reshape
and also another FAQ
I am having problems with the reshape command. Can you give further
guidance?
In what follows, I am going to assume you know about
foreach and
forvalues and
how to use references to local macros with such structures. Both the manual
entries for those structures and the online help are a little terse, so some
long-winded explanations in Cox (2002) and Cox (2003) may be
useful.
First, we rename
variables so that they have a common prefix
. foreach v of var A-Z {
. rename `v' S_`v'
. }
Then we reshape to long:
. reshape long S_ , i(id) string
Now our count of distinct strings could be
. by id S_, sort: gen nvals = _n == 1
. by id: replace nvals = sum(nvals)
. by id: replace nvals = nvals[_N]
If missing values indicated by "." are of no
interest, we should exclude them. (In another problem, what is deemed
missing by the user might be empty strings, or strings that are only spaces.
Stata’s definition of missing with strings is the empty string
"".) One way of excluding "." from the count is to take
special action with those deemed missing. Instead of using the previous
block of code, we type
. by id S_, sort: gen nvals = (_n == 1) * (S_ != ".")
The expression S_ != "." evaluates to 1 whenever the string variable
S_ is not "." and to 0 when it is ".". So we
do not add any occurrence of "." to the count. More precisely,
we add 0 whenever S_ == ".", but that has the same consequence. The
rest of the block is the same:
. by id: replace nvals = sum(nvals)
. by id: replace nvals = nvals[_N]
Now we reshape back
. reshape wide S_, i(id) string
and then nvals is an extra variable in the
dataset. The prefix we added at the outset needs removing:
. renpfix S_
4. A looping over observations solution
In a way, this solution is quite drastic: we restructure the whole dataset
twice for the sake of one new variable. This raises the question of whether
it can be done in place, and indeed it can. If you were doing this by hand
for a small dataset, how would you do it? You would loop over observations:
for each observation {
compile a list of all the values of the string variables
count the number of distinct values (not ".") in the list
}
Experienced Stata users have often seen this advice: Do not loop over
observations, as it is usually not the best way to do something. Here,
however, is an example running contrary to that advice. The simple algorithm
above is a good one; our only problem is translating it to Stata code. How
do we loop over observations? You know that
_N is Stata’s special built-in for
the number of observations, so you could type
. local N = _N
. forval i = 1/`N' {
. ...
. }
but Stata 7 introduced a way of doing this on the fly, which was documented
beginning with Stata 8 (see
help
macro or [U] 18.3.7 Macro increment and decrement functions):
. forval i = 1/`=_N' {
. ...
. }
That is, given the ` ', Stata looks inside, sees
=_N
and evaluates the expression, which here is just
_N.
Now we need a way to build up a list of all the values in
A to Z in
an observation. The structure is going to look like this:
. forval i = 1/`=_N' {
. foreach v of var A-Z {
. ...
. }
. }
To compile a list of all the string values for a given observation, we could
loop over
. local all `"`all' `"`=`v'[`i']'"'"'
If you nod with recognition when you see this, you really shouldn’t
need this commentary because you know enough Stata to write it. It deserves
a lot of explanation. First, understand, or recall, that compound double
quotes start with `" and end with "'. They are Stata’s
solution to a nasty little problem: as the beginning double quote "
is the same symbol as the end double quote " how does Stata tell the
difference between nested quoted strings
"a"b"c"
in which the quoted string "b" is embedded within a string that
starts with "a and ends with c", and separate quoted strings
"a"b"c"
in which the quoted strings "a" and "c" are separate and
b is an extra character between them? You can usually tell the
difference because you (should) know what you mean, but Stata can’t tell
the difference. The answer lies instead in using `" and "' as
delimiters so that Stata looks for the matching delimiter to see what the
building blocks are, just as with, say, something with the structure (( )
( )), which is entirely familiar from elementary mathematics.
So we are going to look inside each value of each string variable.
Observations are referred to within the outer forval loop by the
local macro i and variables within the inner foreach group by
the local macro v, so we want generically to add `v'[`i'] to
the list. We do this on the fly using the same technique as before:
`=`v'[`i']'
That is, the outermost ` ' contain within them
=`v'[`i']
and the = indicates that there is an expression to be evaluated,
namely, the current value of the current string variable. We must bundle the
result in compound double quotes (at worst, it also might contain yet
another quotation mark):
`"`=`v'[`i']'"'
The basic idea is that every time we pick up another string value we add
it to our list, which is where we came in a few paragraphs back:
. local all `"`all' `"`=`v'[`i']'"'"'
Think of this the first time through. The local macro all is at that
point undefined, so `all' evaluates to nothing or the empty string
"". So all is born as
`" `"`=`v'[`i']'"'"'
Concretely, if the first value of the first variable A[1] is
"frog", all is born as
`" `"frog"'"'
where again the outermost delimiters `" "' are needed so that
everything remains straight, given the ambiguity if we were to nest "
". Note an important nuance here. We do not start
. local all = ...
because forcing Stata to evaluate the expression beyond = could run
into a problem where the results cannot be longer than the maximum number
of characters in a string expression that can be evaluated. That can be as
low as 80, depending on your version of Stata; see
help limits.
Be warned: this could bite you. It is much safer just to copy, not to
evaluate.
The code so far is
. forval i = 1/`=_N' {
. foreach v of var A-Z {
. local all `"`all' `"`=`v'[`i']'"'"'
. }
. }
We need some way of getting the number of unique (meaning distinct) strings
in the local macro all. That is much harder than anything yet, but
some customized machinery exists for the job. The tools at
help
macrolists or [P] macro lists are invaluable
here. We first remove any duplicates,
. local all : list uniq all
and then count the number of strings in that resulting list
. local nall : list sizeof all
However, we have to put this number into a variable, somehow. The
standard device here is to
generate a
variable outside all the loops and then to
replace it,
observation by observation, with our result inside the loops.
. gen nvals = 0
. quietly forval i = 1/`=_N' {
. foreach v of var A-Z {
. local all `"`all' `"`=`v'[`i']'"'"'
. }
. local all : list uniq all
. replace nvals = `: list sizeof all' in `i'
. local all
. }
Some flourishes, variously cosmetic and crucial, have been added here.
- The whole thing will be rather noisy, with lots of
replaces
producing lots of little messages. Slapping a
quietly on
the outermost loop will fix that.
- We can do the counting using list sizeof from
macro lists
on the fly using the `: ' construct. More at [U]
18.3.7.
- Most importantly, we have to set the local macro
all back to empty; otherwise, we just
accumulate results from observation to observation.
This still leaves one problem untouched: how to avoid counting "." as
another value. One way is just to avoid it when seen:
. gen nvals = 0
. quietly forval i = 1/`=_N' {
. foreach v of var A-Z {
. if `v'[`i'] != "." {
. local all `"`all' `"`=`v'[`i']'"'"'
. }
. }
. local all : list uniq all
. replace nvals = `: list sizeof all' in `i'
. local all
. }
In another problem, we might prefer not to count empty strings
or strings containing only spaces. We should just add that constraint to the
if condition:
if `v'[`i'] != "." & trim(`v'[`i']) != ""
References
- Cox, N. J. 2002.
-
Speaking Stata: How to face lists with fortitude.
Stata Journal 2: 202–222.
- Cox, N. J. 2003.
-
Speaking Stata: Problems with lists.
Stata Journal 3: 185–202.
|