[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: counting unique values across a set of variables |

Date |
Tue, 25 May 2004 15:12:47 +0100 |

C.M. Hsieh posted a question yesterday. I had a couple of attempts at it. It's a really nice question for illustrating some points of Stata technique. Here is a reprise, with second thoughts built in and some lengthier explanations. The point is not so much that a solution is written up just in case you ever get the same problem, but that some tips and tricks are made more transparent for attacking problems in nearby territory. First, I am going to change CM's terminology and notation slightly: Each observation in my data represents a respondent. Besides the first variable -id- giving an identifier, the other variables (call them -A- to -Z-) contain either interesting strings or missing values indicated by "." I need to create a variable -nvals- that counts the number of unique strings found for any given respondent in -A- to -Z-. The first step is clearly to find out if a command or a function or an -egen- function has been written to do precisely this. You may be in luck, or out of it. -search- and -findit- are the tools, but using keywords like "unique strings" or "distinct strings" yielded nothing. Those familiar with Stata's built-in functions would probably guess, as I did, that this is a little too esoteric to be matched by a built-in. A -egen- function is more likely: you may be familiar with -egen- functions like -rsum()- and -rmax()- which operate across variables and give a single result in each observation, but again no such function is evident. However, the problem of counting distinct values within variables has often been addressed: see in particular the FAQ How do I compute the number of distinct observations? http://www.stata.com/support/faqs/data/distinct.html This suggests that we can attack this through a -reshape- calculate number of distinct observations -reshape- sequence, otherwise known as the Stata two-step. [R] reshape and also another FAQ I am having problems with the reshape command. Can you give further guidance? http://www.stata.com/support/faqs/data/reshape3.html may be helpful here. In what follows, I am going to assume that you know about -foreach- and -forval- and how to use references to local macros with such structures. Both the manual entries for those structures and the on-line help are a little terse, so some long-winded explanations in Stata Journal 2(2): 202--222 (2002) and 3(2): 185--202 (2003) may be useful. First we -rename- variables, so that they have a common prefix, say foreach v of var A-Z { rename `v' S_`v' } Then we -reshape- to long: reshape long S_ , i(id) string Now our count of distinct strings could be bysort id S_ : gen nvals = _n == 1 by id : replace nvals = sum(nvals) by id : replace nvals = nvals[_N] Not so fast, however! If missing values indicated by "." are of no interest, we should exclude them. (In another problem, what is deemed missing by the user might be empty strings, or strings that are only spaces. Stata's definition of missing with strings is the empty string "".) One way of excluding "." from the count is to take special action with those deemed missing. Instead of the previous block of code, we go bysort id S_ : gen nvals = (_n == 1) * (S_ != ".") The expression S_ != "." evaluates to 1 whenever the string variable -S_- is not "." and to 0 when it is ".". So we don't add any occurrence of "." to the count. More precisely, we add 0 whenever -S_ == "."-, but that has the same consequence. The rest of the block is the same: by id : replace nvals = sum(nvals) by id : replace nvals = nvals[_N] Now we -reshape- back reshape wide S_ , i(id) string and then -nvals- is an extra variable in the dataset. The prefix we added at the outset needs removing: renpfix S_ In a way this solution is quite drastic: we restructure the whole dataset twice for the sake of one new variable. This raises the question of whether it can be done in place, and it can. If you were doing this by hand for a small dataset, how would you do it? You would loop over observations: for each observation { compile a list of all the values of the string variables count the number of distinct values (not ".") in the list } Experienced Stata users have often seen the advice: Don't loop over observations, as it is usually not the best way to do something. Here, however, is an example running contrary to that advice. The simple algorithm above is a good one: our only problem is translating it to Stata code. How do we loop over observations? You know that _N is Stata's special built-in for the number of observations, so you could go local N = _N forval i = 1/`N' { ... } but Stata 7 introduced a way of doing this on the fly, which was documented in Stata 8 (e.g. see -help macro- or [U] 21.3.7): forval i = 1/`=_N' { ... } That is, given the ` ', Stata looks inside, sees =_N and evaluates the expression, here just _N. Now we need a way of building up a list of all the values in -A- to -Z- in an observation. The structure is going to look like forval i = 1/`=_N' { foreach v of var A-Z { ... } } To compile a list of all the string values for a given observation, we could loop over local all `"`all' `"`=`v'[`i']'"'"' If you nod with recognition when you see this, you really shouldn't be reading this commentary, as you know enough Stata to write it. It deserves a lot of explanation. First, know, or recall, that compound double quotes start with `" and end with "'. They are Stata's solution to a nasty little problem: as the beginning double quote " is the same symbol as the end double quote " how is Stata to tell the difference between nested quoted strings "a"b"c" -- in which the quoted string "b" is embedded within a string that starts with "a and ends with c" -- and separate quoted strings "a"b"c" -- in which the quoted strings "a" and "c" are separate, and b is an extra character between them? Now you can usually tell the difference, as you (should) know what you mean: the issue is how can Stata tell the difference. The answer lies in using `" and "' as delimiters, so that Stata looks for the matching delimiter to see what are the building blocks, just as with (say) something with the structure (( ) ( )), as is entirely familiar to you from elementary mathematics. So we are going to look inside each value of each string variable. Observations are referred to within the outer -forval- loop by the local macro i, and variables within the inner -foreach- group by the local macro v, so we want generically to add `v'[`i'] to the list. We do this on the fly using the same technique as before `=`v'[`i']' That is, the outermost ` ' contain within them =`v'[`i'] and the = flags that there is an expression to be evaluated, namely the current value of the current string variable. We must bundle the result in compound double quotes (at worst, it also might contain yet another quote mark) `"`=`v'[`i']'"' The basic idea is that every time we pick up another string value we add it to our list, which is where we came in a few paragraphs back: local all `"`all' `"`=`v'[`i']'"'"' Think of this the first time through. The local macro -all- is at that point undefined, so `all' evaluates to nothing, or the empty string "". So -all- is born as `" `"`=`v'[`i']'"'"' Concretely, if the first value of the first variable -A[1]- is "frog", -all- is born as `" `"frog"'"' where again the outermost delimiters `" "' are needed so that everything remains straight, given the ambiguity if we were to nest " ". Note a nuance here, but a important one. We do not start local all = because forcing Stata to evaluate the expression beyond = could run into the problem that the results cannot be longer than 80 characters. Be warned: this could bite you. It is much safer just to copy, not to evaluate. The code so far is forval i = 1/`=_N' { foreach v of var A-Z { local all `"`all' `"`=`v'[`i']'"'"' } } We need some way of getting the number of unique (meaning distinct) strings in the local macro all. This is much harder than anything yet, but some customised machinery exists for the job. The tools at -help macrolists- or [P] macrolists are invaluable here. We first remove any duplicates local all : list uniq all and then count the number of strings in that resulting list local nall : list sizeof all However, we have got to put this number into a variable somehow. The standard device here is to -generate- a variable outside all the loops, and then to -replace- it observation by observation with our result inside the loop. gen nvals = 0 quietly forval i = 1/`=_N' { foreach v of var A-Z { local all `"`all' `"`=`v'[`i']'"'"' } local all : list uniq all replace nvals = `: list sizeof all' in `i' local all } Some flourishes, variously cosmetic and crucial, have been added here. * The whole thing will be rather noisy, with lots of -replace-s producing lots of little messages. Slapping a -quietly- on the outermost loop will fix that. * We can do the counting using -list sizeof- from -macrolists- on the fly using the `: ' construct. More at [U] 21.3.7. * Most importantly, we have to set the local macro all back to empty. Otherwise we just accumulate results from observation to observation. This still leaves one problem untouched: how to avoid counting "." as another value. One way is just to avoid it when seen: gen nvals = 0 quietly forval i = 1/`=_N' { foreach v of var A-Z { if `v'[`i'] != "." { local all `"`all' `"`=`v'[`i']'"'"' } } local all : list uniq all replace nvals = `: list sizeof all' in `i' local all } Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: margin and conformability error after constrained mlogit***From:*Adrian Gonzalez-Gonzalez <gonzalez-gonzale.1@osu.edu>

- Prev by Date:
**Re: st: Top 10 Tricks in Stata (summary to date)** - Next by Date:
**st: Robust standard errors** - Previous by thread:
**Re: st: Re: multiple )))brackets, is there a more efficientway?** - Next by thread:
**st: margin and conformability error after constrained mlogit** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |