Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: find categorical variables


From   Jakob Petersen <jpeterb@essex.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: find categorical variables
Date   Thu, 22 Mar 2012 09:53:54 +0000

Thanks a lot. Categorical variable? In a large dataset, I would expect fewer categories than observations (price, weight, length would not qualify) - although it would be difficult to set a general threshold... Jakob

** finding categorical variables ************
sysuse auto,clear
findname, all(@ == int(@)) local(catvarlist)
distinct `catvarlist'
***************************************

              |        Observations

              |      total   distinct

--------------+----------------------

        price |         74         74

          mpg |         74         21

        rep78 |         69          5

        trunk |         74         18

       weight |         74         64

       length |         74         47

         turn |         74         18

 displacement |         74         31

      foreign |         74          2



On 22/03/2012 09:27, Nick Cox wrote:
The assumption is fallacious. As one of many counter-examples,
consider a variable that is 50% 0s and 50% 1s. The mean would be 0.5.
The same example shows that the median of integers need not be integer
either; while the mode of integers must be integer, the mode is not
always well defined, at least with the usual naive definitions of
mode. In any case the converse does not follow for any of these
measures, i.e. an integer-valued summary implies very little of
consequence.

The problem is related to that in a concurrent thread, started by Bert Jung in

http://www.stata.com/statalist/archive/2012-03/msg00731.html

and with no closure yet, on distinguishing binary from continuous
variables. The discussion in that thread is broken, so use e.g.

http://www.stata.com/statalist/archive/2012-03/

as a catalogue of contributions.

Stata clearly has no precise notion of categorical variables; it has a
precise notion of factor variables, but nothing makes the use of
factor variables compulsory for what any researcher might call
categorical variables. Categoricity, like beauty, is in the eye of the
beholder.

So, you need a definition, and whatever definition you use will not be
fail-safe, but it is easy enough to find all variables that are
integer-valued. With -findname- (SSC, SJ) you can do this

findname, all(@ == int(@))

(-floor()- or -ceil()- would work fine instead of -int()-). The
criterion for an integer used here is that it rounds to itself.

I also posted yesterday a signal that -distinct- (SSC) has just been
updated in a way that makes it more useful in this territory.

http://www.stata.com/statalist/archive/2012-03/msg00893.html

-distinct- lets you find the number of distinct values. I know that
many users work with categorical variables with many, many categories,
so that need not be part of your criterion.

As in the thread cited above, note also the possibility that
categorical variables are held in string form.

Nick

On Thu, Mar 22, 2012 at 9:02 AM, Jakob Petersen<jpeterb@essex.ac.uk>  wrote:

Looking for a way to distinguish categorical from other types of variables.
The following code is based on the assumption that the mean of a variable of
all integers would be an integer, but _rc seems to take the value 7 here
regardless - any ideas? Many thanks in advance.

** finding categorical variables ******
sysuse auto,clear
local catvarlist
foreach v of var * {
qui su `v',mean
cap confirm integer number `r(mean)'
if _rc==0 {
local catvarlist `catvarlist' `v'
}
}
di "`catvarlist'"
***************************************
*adapted from: http://www.stata-journal.com/article.html?article=dm0048

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



--

Kind regards,
Jakob


/*
Jakob Petersen
Institute for Social and Economic Research (ISER)
University of Essex, Colchester, Essex CO4 3SQ, UK
T: +44 (01206) 873683
E: jpeterb@essex.ac.uk
www.iser.essex.ac.uk

Understanding Society User Support / BHPS User Support
http://data.understandingsociety.org.uk/documentation/support

From 2012 we have moved to an on-line system for user support. This will help us deal more effectively with user requests and also make
the most of users' experience to build a knowledge database for the UKHLS and BHPS data sets. We are always interested in receiving feedback, e.g. if there are anything we could do to improve access or content of the data documentation/user guide.
*/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index