*The following question and answer is based on an exchange that started on
Statalist.*

Title | Expanding datasets to all possible pairs | |

Author | Nicholas J. Cox, Durham University, UK | |

Date | July 2001; updated February 2003; minor revisions September 2005 |

I have a dataset containing a group variable, an individual identifier variable, and various descriptive variables. I would like to produce a new dataset based on all possible pairs of identifiers within groups; that is, the observations in a pair must be from the same group.

Let us phrase this in terms of a group variable **group**, an individual
identifier variable **id**, and descriptive variables **a** and
**b**. We will assume that **id** is unique, at least within
**group**.

Here is an example with concocted data:

. list+----------------------+ | group id a b | |----------------------| 1. | 1 9 23 98 | 2. | 1 11 62 78 | 3. | 2 8 8 65 | 4. | 2 12 56 39 | 5. | 2 15 72 1 | |----------------------| 6. | 2 25 26 88 | 7. | 2 35 22 66 | 8. | 3 9 48 21 | 9. | 3 15 63 64 | 10. | 3 16 19 37 | |----------------------| 11. | 3 22 88 87 | 12. | 3 27 71 60 | 13. | 3 32 37 64 | 14. | 4 4 38 18 | 15. | 4 20 28 11 | |----------------------| 16. | 5 14 84 65 | +----------------------+

Consider each group *i*, in which there are *n _{i}*
observations, so that each observation can in turn be paired with each
observation within the group, yielding

We need first to expand the dataset, so that each observation is replaced by
*n _{i}* copies of itself. The Stata command for this is

. sort group . by group: gen gfreq = _N

In Stata 7 and later, this can be expressed as

. by group, sort: gen gfreq = _N

after which we can type

. expand gfreq

Now the trick needed is to work on this expanded dataset so that pairs are identified properly. Focus on group 1 in our original data:

group id a b 1. 1 9 23 98 2. 1 11 62 78

In this group, we have **id**s of 9 and 11, so the pairs are 9 and 9, 9
and 11, 11 and 9, and 11 and 11. We need to generate the paired **id**,
which we will call **id2**. The first step is to re-sort the data, as
**expand** just adds extra observations at the end of the dataset.
Because of this, they are no longer sorted by **group**, and we need to
break ties by sorting on **id**:

. sort group id

After expansion and sorting, this group is now

. list if group == 1+------------------------------+ | group id a b gfreq | |------------------------------| 1. | 1 9 23 98 2 | 2. | 1 9 23 98 2 | 3. | 1 11 62 78 2 | 4. | 1 11 62 78 2 | +------------------------------+

Again, in Stata 7 and later, we use

. by group id: gen numid2 = _n

Here we are numbering the copies of each original observation. Focus on the
results of that for **group** 1:

. list if group == 1+---------------------------------------+ | group id a b gfreq numid2 | |---------------------------------------| 1. | 1 9 23 98 2 1 | 2. | 1 9 23 98 2 2 | 3. | 1 11 62 78 2 1 | 4. | 1 11 62 78 2 2 | +---------------------------------------+

For **id2**, one acceptable solution is values of 9, 11, 9, 11; one way
of getting that is by selecting the 2nd, 4th, 2nd, 4th values of **id**
within this group. This method is mentioned because it generalizes easily to
selecting observations subscripted by values of **gfreq * numid2** within
each group.

Let us go through that again. Here's the Stata command we need:

. by group: gen id2 = id[gfreq * numid2]

For **group** 1, **gfreq * numid2** takes on values in turn of 2 * 12,
2 * 24, 2 * 12, and 2 * 24. We could put those values in a new variable, but
that is not necessary because Stata can handle expressions within subscripts.
Because **generate** is being executed under **by group:**, subscripts
are interpreted within each distinct value of **group**. Hence
**id[2]** in this instance is interpreted successively as the second
value of **id**, within each of groups 1 to 5. See
[U] **13.7 Explicit
subscripting** for more explanation.

Having done that, it is easy to see how the same idea can be applied to
each descriptive variable. Let us suppose that **a2** and **b2** are
not in use as variable names. Then we can do the remaining assignments with
the following commands:

. by group: gen a2 = a[gfreq * numid2] . by group: gen b2 = b[gfreq * numid2]

Two detailed points remain. Suppose that we are not interested in pairs for
which **id** and **id2** are identical:

. drop if id == id2

Suppose that we do not want both (*j*, *k*) and
(*k*,*j*):

. drop if id > id2

(or use <).

A note of caution: expansion can result in much larger datasets. You may need to worry about memory. For example, if each group frequency is 10, then each group expands to 45, 90, or 100, depending on the precise definition of pair, and much larger expansion factors are clearly possible.