Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: generating composite categorical variables


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: generating composite categorical variables
Date   Tue, 3 Jul 2007 21:29:38 +0100

In a couple of recent threads a rather poor method 
of creating composite categorical variables has been 
used. I want to discourage this method and suggest
much better ones. 

What do I mean by a composite categorical variable? 
===================================================

If you have two or more categorical variables, 
you may want to create a single variable that 
takes on all the possible joint values. 

The canonical example for Stata users is 

. sysuse auto 

. groups foreign rep78

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

Here I throw in a gratuitous advertisement 
for -groups-, available from SSC. A 
fortuitous detail is that the default separation
for -groups- (which is just a wrapper for -list-)
is one separation line after 5, just right for
this example. 

-foreign- and -rep78- could be used jointly 
to define a composite variable, with values
"Domestic 1", "Domestic 2" and so forth. 

Or if you used the underlying values, not
the value labels, the values could be "0 1", 
"0 2", and so forth. ("Domestic" is the value
label for 0.)  

In this example I jumped
into expressing these joint values as strings. 

You should wonder whether that is inevitable, 
or possible and a good idea, or possible and 
a bad idea. Good question, and we'll get to 
it shortly. 

A bad method
============

The bad method is 

. tostring foreign rep78, generate(Foreign Rep78) 
. gen both = Foreign + Rep78 

Naturally, there are endless small variations
on this. A small but useful improvement is
to insert a space or other punctuation: 

. gen both = Foreign + " " + Rep78 

Calling this bad is a little stark. But it
is not especially good. 

-tostring- is really for correcting mistakes, 
usually attributable to you or to Microsoft or 
to misguided collaborators. 

Some variable that should be string is in fact numeric. 
You need to correct that mistake. -tostring- is 
a safe way of doing that. 

That intended purpose does not stop it being useful for things
for which it was not intended. Only the other day
I opened a big box from StataCorp with my keys because
I had left my Swiss Army knife at home. The keys were
not intended to rip through strong tape, but they worked
fine. In this case, however, the analogue for
the Swiss Army knife is just as near as -tostring-. 

Beyond that question of style or taste, there are 
two specific disadvantages to this method: 

1. This method needs two lines, and you can do it in one. 
That is a little deal. 

2. This could lose information, especially for variables
with value labels, or with non-integer values. That is, 
potentially, a big deal. 

#2 may suggest using -decode- instead, but my suggestions differ. 

A better method: egen, group()
==============================

My favourite method is this:

. egen both = group(foreign rep78), label 

If you tried it, you would find a new numeric variable,
with integer values 1 up, and value labels defined 
and attached. 

Note particularly the -label- option, which many people
forget (or perhaps never knew about). 

This has at least five advantages, and no disadvantages that 
I can think of. 

1. One line. 

2. No loss of information. Observations that are 
identical on the arguments are identical on the results. 
Value labels are used, not ignored. 

3. The label is useful -- indeed essential -- for tables
and graphs to make sense. 

4. Efficient storage. 

5. Extends readily to three or more variables. 

Another better method: egen, concat()
=====================================

. egen both = concat(foreign rep78), decode p(" ") 

This creates a string variable, so is less efficient 
for data storage. Compared with -tostring-, the 
advantages are 

1. One line. 

2. You can mix numeric and string arguments. -concat()-
will figure out what is needed. 

3. You can use the -decode- option to get value labels
shown on the fly. 

4. You can specify punctuation as separator, here 
a single blank. 

5. Extends to three or more variables. 

A general comment
=================

On the whole, an integer-valued numeric variable with 
value labels defined and attached is the best arrangement
for any categorical variable. 

A personal comment
==================

I was the original author of -tostring- and -egen, concat()-, 
but -egen, group()- is still the best solution for this problem. 

Nick 
n.j.cox@durham.ac.uk 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index