# st: generating composite categorical variables

 From "Nick Cox" To Subject st: generating composite categorical variables Date Tue, 3 Jul 2007 21:29:38 +0100

```In a couple of recent threads a rather poor method
of creating composite categorical variables has been
used. I want to discourage this method and suggest
much better ones.

What do I mean by a composite categorical variable?
===================================================

If you have two or more categorical variables,
you may want to create a single variable that
takes on all the possible joint values.

The canonical example for Stata users is

. sysuse auto

. groups foreign rep78

+------------------------------------+
|  foreign   rep78   Freq.   Percent |
|------------------------------------|
| Domestic       1       2      2.90 |
| Domestic       2       8     11.59 |
| Domestic       3      27     39.13 |
| Domestic       4       9     13.04 |
| Domestic       5       2      2.90 |
|------------------------------------|
|  Foreign       3       3      4.35 |
|  Foreign       4       9     13.04 |
|  Foreign       5       9     13.04 |
+------------------------------------+

Here I throw in a gratuitous advertisement
for -groups-, available from SSC. A
fortuitous detail is that the default separation
for -groups- (which is just a wrapper for -list-)
is one separation line after 5, just right for
this example.

-foreign- and -rep78- could be used jointly
to define a composite variable, with values
"Domestic 1", "Domestic 2" and so forth.

Or if you used the underlying values, not
the value labels, the values could be "0 1",
"0 2", and so forth. ("Domestic" is the value
label for 0.)

In this example I jumped
into expressing these joint values as strings.

You should wonder whether that is inevitable,
or possible and a good idea, or possible and
a bad idea. Good question, and we'll get to
it shortly.

============

The bad method is

. tostring foreign rep78, generate(Foreign Rep78)
. gen both = Foreign + Rep78

Naturally, there are endless small variations
on this. A small but useful improvement is
to insert a space or other punctuation:

. gen both = Foreign + " " + Rep78

Calling this bad is a little stark. But it
is not especially good.

-tostring- is really for correcting mistakes,
usually attributable to you or to Microsoft or
to misguided collaborators.

Some variable that should be string is in fact numeric.
You need to correct that mistake. -tostring- is
a safe way of doing that.

That intended purpose does not stop it being useful for things
for which it was not intended. Only the other day
I opened a big box from StataCorp with my keys because
I had left my Swiss Army knife at home. The keys were
not intended to rip through strong tape, but they worked
fine. In this case, however, the analogue for
the Swiss Army knife is just as near as -tostring-.

Beyond that question of style or taste, there are
two specific disadvantages to this method:

1. This method needs two lines, and you can do it in one.
That is a little deal.

2. This could lose information, especially for variables
with value labels, or with non-integer values. That is,
potentially, a big deal.

#2 may suggest using -decode- instead, but my suggestions differ.

A better method: egen, group()
==============================

My favourite method is this:

. egen both = group(foreign rep78), label

If you tried it, you would find a new numeric variable,
with integer values 1 up, and value labels defined
and attached.

Note particularly the -label- option, which many people
forget (or perhaps never knew about).

This has at least five advantages, and no disadvantages that
I can think of.

1. One line.

2. No loss of information. Observations that are
identical on the arguments are identical on the results.
Value labels are used, not ignored.

3. The label is useful -- indeed essential -- for tables
and graphs to make sense.

4. Efficient storage.

5. Extends readily to three or more variables.

Another better method: egen, concat()
=====================================

. egen both = concat(foreign rep78), decode p(" ")

This creates a string variable, so is less efficient
for data storage. Compared with -tostring-, the

1. One line.

2. You can mix numeric and string arguments. -concat()-
will figure out what is needed.

3. You can use the -decode- option to get value labels
shown on the fly.

4. You can specify punctuation as separator, here
a single blank.

5. Extends to three or more variables.

A general comment
=================

On the whole, an integer-valued numeric variable with
value labels defined and attached is the best arrangement
for any categorical variable.

A personal comment
==================

I was the original author of -tostring- and -egen, concat()-,
but -egen, group()- is still the best solution for this problem.

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```