Re: st: find categorical variables

Thu, 22 Mar 2012 09:53:54 +0000

** finding categorical variables ************ sysuse auto,clear findname, all(@ == int(@)) local(catvarlist) distinct `catvarlist' *************************************** | Observations | total distinct --------------+---------------------- price | 74 74 mpg | 74 21 rep78 | 69 5 trunk | 74 18 weight | 74 64 length | 74 47 turn | 74 18 displacement | 74 31 foreign | 74 2 On 22/03/2012 09:27, Nick Cox wrote:

The assumption is fallacious. As one of many counter-examples, consider a variable that is 50% 0s and 50% 1s. The mean would be 0.5. The same example shows that the median of integers need not be integer either; while the mode of integers must be integer, the mode is not always well defined, at least with the usual naive definitions of mode. In any case the converse does not follow for any of these measures, i.e. an integer-valued summary implies very little of consequence. The problem is related to that in a concurrent thread, started by Bert Jung in http://www.stata.com/statalist/archive/2012-03/msg00731.html and with no closure yet, on distinguishing binary from continuous variables. The discussion in that thread is broken, so use e.g. http://www.stata.com/statalist/archive/2012-03/ as a catalogue of contributions. Stata clearly has no precise notion of categorical variables; it has a precise notion of factor variables, but nothing makes the use of factor variables compulsory for what any researcher might call categorical variables. Categoricity, like beauty, is in the eye of the beholder. So, you need a definition, and whatever definition you use will not be fail-safe, but it is easy enough to find all variables that are integer-valued. With -findname- (SSC, SJ) you can do this findname, all(@ == int(@)) (-floor()- or -ceil()- would work fine instead of -int()-). The criterion for an integer used here is that it rounds to itself. I also posted yesterday a signal that -distinct- (SSC) has just been updated in a way that makes it more useful in this territory. http://www.stata.com/statalist/archive/2012-03/msg00893.html -distinct- lets you find the number of distinct values. I know that many users work with categorical variables with many, many categories, so that need not be part of your criterion. As in the thread cited above, note also the possibility that categorical variables are held in string form. Nick On Thu, Mar 22, 2012 at 9:02 AM, Jakob Petersen<jpeterb@essex.ac.uk> wrote:Looking for a way to distinguish categorical from other types of variables. The following code is based on the assumption that the mean of a variable of all integers would be an integer, but _rc seems to take the value 7 here regardless - any ideas? Many thanks in advance. ** finding categorical variables ****** sysuse auto,clear local catvarlist foreach v of var * { qui su `v',mean cap confirm integer number `r(mean)' if _rc==0 { local catvarlist `catvarlist' `v' } } di "`catvarlist'" *************************************** *adapted from: http://www.stata-journal.com/article.html?article=dm0048* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

-- Kind regards, Jakob

