[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: counting unique values across a set of variables

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: counting unique values across a set of variables
Date	Tue, 25 May 2004 15:12:47 +0100
C.M. Hsieh posted a question yesterday. I had a couple of attempts 
at it. It's a really nice question for illustrating some points of 
Stata technique. Here is a reprise, with second thoughts built in and
some lengthier explanations. The point is not so much that a
solution is written up just in case you ever get the same problem,
but that some tips and tricks are made more transparent for
attacking problems in nearby territory. 

First, I am going to change CM's terminology and notation
slightly: 

	Each observation in my data represents a respondent.
	Besides the first variable -id- giving an identifier, the
	other variables (call them -A- to -Z-) contain either
	interesting strings or missing values indicated by "."  I
	need to create a variable -nvals- that counts the number
	of unique strings found for any given respondent in -A- to
	-Z-. 
 
The first step is clearly to find out if a command or a function
or an -egen- function has been written to do precisely this. You
may be in luck, or out of it. -search- and -findit- are the tools,
but using keywords like "unique strings" or "distinct strings"
yielded nothing. Those familiar with Stata's built-in functions
would probably guess, as I did, that this is a little too esoteric
to be matched by a built-in. A -egen- function is more likely: you
may be familiar with -egen- functions like -rsum()- and -rmax()-
which operate across variables and give a single result in each
observation, but again no such function is evident. 

However, the problem of counting distinct values within variables
has often been addressed: see in particular the FAQ 

How do I compute the number of distinct observations?
http://www.stata.com/support/faqs/data/distinct.html 

This suggests that we can attack this through a 

	-reshape- 
	calculate number of distinct observations  
	-reshape- 

sequence, otherwise known as the Stata two-step. [R] reshape and
also another FAQ 

I am having problems with the reshape command. Can you give 
further guidance?
http://www.stata.com/support/faqs/data/reshape3.html

may be helpful here. 

In what follows, I am going to assume that you know about
-foreach- and -forval- and how to use references to local macros
with such structures. Both the manual entries for those structures
and the on-line help are a little terse, so some long-winded
explanations in Stata Journal 2(2): 202--222 (2002) and 3(2):
185--202 (2003) may be useful. 

First we -rename- variables, so that they have a common prefix,
say 

foreach v of var A-Z { 
	rename `v' S_`v' 
} 

Then we -reshape- to long: 

reshape long S_ , i(id) string 

Now our count of distinct strings could be 

bysort id S_ : gen nvals = _n == 1 
by id : replace nvals = sum(nvals) 
by id : replace nvals = nvals[_N] 

Not so fast, however! If missing values indicated by "." are of no
interest, we should exclude them. (In another problem, what is
deemed missing by the user might be empty strings, or strings that
are only spaces. Stata's definition of missing with strings is the
empty string "".) One way of excluding "." from the count is to
take special action with those deemed missing. Instead of the
previous block of code, we go 

bysort id S_ : gen nvals = (_n == 1) * (S_ != ".")  

The expression 

	S_ != "."

evaluates to 1 whenever the string variable -S_- is not "." and to
0 when it is ".". So we don't add any occurrence of "." to the
count. More precisely, we add 0 whenever -S_ == "."-, but that has
the same consequence. The rest of the block is the same:  

by id : replace nvals = sum(nvals) 
by id : replace nvals = nvals[_N] 

Now we -reshape- back 

reshape wide S_ , i(id) string 

and then -nvals- is an extra variable in the dataset. The prefix
we added at the outset needs removing: 

renpfix S_ 

In a way this solution is quite drastic: we restructure the whole
dataset twice for the sake of one new variable. This raises the
question of whether it can be done in place, and it can. If you
were doing this by hand for a small dataset, how would you do it? 
You would loop over observations: 

for each observation { 
	compile a list of all the values of the string variables 
	count the number of distinct values (not ".") in the list
} 

Experienced Stata users have often seen the advice: Don't loop
over observations, as it is usually not the best way to do
something. Here, however, is an example running contrary to that
advice. The simple algorithm above is a good one: our only problem
is translating it to Stata code. How do we loop over observations?
You know that _N is Stata's special built-in for the number of
observations, so you could go 

local N = _N 
forval i = 1/`N' { 
	...
} 

but Stata 7 introduced a way of doing this on the fly, which was
documented in Stata 8 (e.g. see -help macro- or [U] 21.3.7):

forval i = 1/`=_N' { 
	...
}

That is, given the ` ', Stata looks inside, sees 

=_N 

and evaluates the expression, here just _N. 

Now we need a way of building up a list of all the values in -A-
to -Z- in an observation. The structure is going to look like 

forval i = 1/`=_N' { 
	foreach v of var A-Z { 
		...
	} 	
}

To compile a list of all the string values for a given
observation, we could loop over 

local all `"`all' `"`=`v'[`i']'"'"' 

If you nod with recognition when you see this, you really
shouldn't be reading this commentary, as you know enough Stata to
write it.  It deserves a lot of explanation. First, know, or
recall, that compound double quotes start with `" and end with "'.
They are Stata's solution to a nasty little problem: as the
beginning double quote " is the same symbol as the end double
quote " how is Stata to tell the difference between nested quoted
strings 

"a"b"c"

-- in which the quoted string "b" is embedded within a string that
starts with "a and ends with c" -- and separate quoted strings 

"a"b"c"

-- in which the quoted strings "a" and "c" are separate, and b is
an extra character between them?  Now you can usually tell the
difference, as you (should) know what you mean: the issue is how
can Stata tell the difference. The answer lies in using `" and "'
as delimiters, so that Stata looks for the matching delimiter to
see what are the building blocks, just as with (say) something
with the structure (( ) ( )), as is entirely familiar to you from 
elementary mathematics. 

So we are going to look inside each value of each string variable.
Observations are referred to within the outer -forval- loop by the
local macro i, and variables within the inner -foreach- group by
the local macro v, so we want generically to add `v'[`i'] to the
list. We do this on the fly using the same technique as before 

	`=`v'[`i']'

That is, the outermost ` ' contain within them 

	=`v'[`i'] 

and the = flags that there is an expression to be evaluated,
namely the current value of the current string variable. We must
bundle the result in compound double quotes (at worst, it also
might contain yet another quote mark)

	`"`=`v'[`i']'"' 	

The basic idea is that every time we pick up another string value
we add it to our list, which is where we came in a few paragraphs
back: 

local all `"`all' `"`=`v'[`i']'"'"' 

Think of this the first time through. The local macro -all- is at
that point undefined, so `all' evaluates to nothing, or the empty
string "".  So -all- is born as 

`" `"`=`v'[`i']'"'"' 

Concretely, if the first value of the first variable -A[1]- is 
"frog", -all- is born as 

`" `"frog"'"' 

where again the outermost delimiters `" "' are needed so that
everything remains straight, given the ambiguity if we were to
nest " ". Note a nuance here, but a important one. We do not start 

local all = 

because forcing Stata to evaluate the expression beyond = could
run into the problem that the results cannot be longer than 80
characters. Be warned: this could bite you. It is much safer just
to copy, not to evaluate. 

The code so far is 

forval i = 1/`=_N' { 
	foreach v of var A-Z { 
		local all `"`all' `"`=`v'[`i']'"'"' 
	} 
}

We need some way of getting the number of unique (meaning
distinct) strings in the local macro all. This is much harder than
anything yet, but some customised machinery exists for the job.
The tools at -help macrolists- or [P] macrolists are invaluable
here. We first remove any duplicates 

		local all : list uniq all 

and then count the number of strings in that resulting list  

		local nall : list sizeof all 

However, we have got to put this number into a variable somehow.
The standard device here is to -generate- a variable outside all
the loops, and then to -replace- it observation by observation
with our result inside the loop. 

gen nvals = 0 
quietly forval i = 1/`=_N' { 
	foreach v of var A-Z { 
		local all `"`all' `"`=`v'[`i']'"'"' 
	} 
	local all : list uniq all 
	replace nvals = `: list sizeof all' in `i' 
	local all 
}

Some flourishes, variously cosmetic and crucial, have been added
here. 

* The whole thing will be rather noisy, with lots of -replace-s 
producing lots of little messages. Slapping a -quietly- on the
outermost loop will fix that. 

* We can do the counting using -list sizeof- from -macrolists- 
on the fly using the `: ' construct. More at [U] 21.3.7. 

* Most importantly, we have to set the local macro all back 
to empty. Otherwise we just accumulate results from observation to
observation. 

This still leaves one problem untouched: how to avoid counting 
"." as another value. One way is just to avoid it when seen: 

gen nvals = 0 
quietly forval i = 1/`=_N' { 
	foreach v of var A-Z { 
		if `v'[`i'] != "." { 
			local all `"`all' `"`=`v'[`i']'"'"' 
		} 	
	} 
	local all : list uniq all 
	replace nvals = `: list sizeof all' in `i' 
	local all 
}

Nick 
[email protected] 


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- st: margin and conformability error after constrained mlogit
  - From: Adrian Gonzalez-Gonzalez <[email protected]>
Prev by Date: Re: st: Top 10 Tricks in Stata (summary to date)
Next by Date: st: Robust standard errors
Previous by thread: Re: st: Re: multiple )))brackets, is there a more efficientway?
Next by thread: st: margin and conformability error after constrained mlogit
Index(es):
- Date
- Thread