|
The following question and answer is based on an exchange that started on
Statalist.
How do I produce a dataset based on all possible pairs of identifiers
within each group?
|
Title
|
|
Expanding datasets to all possible pairs
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
July 2001; updated February 2003; minor revisions September 2005
|
Question
I have a dataset containing a group variable, an individual identifier
variable, and various descriptive variables. I would like to produce a new
dataset based on all possible pairs of identifiers within groups; that is,
the observations in a pair must be from the same group.
Answer
Let us phrase this in terms of a group variable group, an individual
identifier variable id, and descriptive variables a and
b. We will assume that id is unique, at least within
group.
Here is an example with concocted data:
. list
+----------------------+
| group id a b |
|----------------------|
1. | 1 9 23 98 |
2. | 1 11 62 78 |
3. | 2 8 8 65 |
4. | 2 12 56 39 |
5. | 2 15 72 1 |
|----------------------|
6. | 2 25 26 88 |
7. | 2 35 22 66 |
8. | 3 9 48 21 |
9. | 3 15 63 64 |
10. | 3 16 19 37 |
|----------------------|
11. | 3 22 88 87 |
12. | 3 27 71 60 |
13. | 3 32 37 64 |
14. | 4 4 38 18 |
15. | 4 20 28 11 |
|----------------------|
16. | 5 14 84 65 |
+----------------------+
Consider each group i, in which there are ni
observations, so that each observation can in turn be paired with each
observation within the group, yielding ni *
ni pairs. We will return later to consider what to do if
the ni pairs in which each observation is paired with
itself are not wanted, as that is easy to fix. We will also return to
removing duplicate pairs if observation j with observation k
and observation k with observation j are not both wanted.
We need first to expand the dataset, so that each observation is replaced by
ni copies of itself. The Stata command for this is
expand; see [D]
expand. To obtain the group frequencies ni, we type
. sort group
. by group: gen gfreq = _N
In Stata 7 and later, this can be expressed as
. by group, sort: gen gfreq = _N
after which we can type
. expand gfreq
Now the trick needed is to work on this expanded dataset so that pairs are
identified properly. Focus on group 1 in our original data:
group id a b
1. 1 9 23 98
2. 1 11 62 78
In this group, we have ids of 9 and 11, so the pairs are 9 and 9, 9
and 11, 11 and 9, and 11 and 11. We need to generate the paired id,
which we will call id2. The first step is to re-sort the data, as
expand just adds extra observations at the end of the dataset.
Because of this, they are no longer sorted by group, and we need to
break ties by sorting on id:
. sort group id
After expansion and sorting, this group is now
. list if group == 1
+------------------------------+
| group id a b gfreq |
|------------------------------|
1. | 1 9 23 98 2 |
2. | 1 9 23 98 2 |
3. | 1 11 62 78 2 |
4. | 1 11 62 78 2 |
+------------------------------+
Again, in Stata 7 and later, we use
. by group id: gen numid2 = _n
Here we are numbering the copies of each original observation. Focus on the
results of that for group 1:
. list if group == 1
+---------------------------------------+
| group id a b gfreq numid2 |
|---------------------------------------|
1. | 1 9 23 98 2 1 |
2. | 1 9 23 98 2 2 |
3. | 1 11 62 78 2 1 |
4. | 1 11 62 78 2 2 |
+---------------------------------------+
For id2, one acceptable solution is values of 9, 11, 9, 11; one way
of getting that is by selecting the 2nd, 4th, 2nd, 4th values of id
within this group. This method is mentioned because it generalizes easily to
selecting observations subscripted by values of gfreq * numid2 within
each group.
Let us go through that again. Here's the Stata command we need:
. by group: gen id2 = id[gfreq * numid2]
For group 1, gfreq * numid2 takes on values in turn of 2 * 12,
2 * 24, 2 * 12, and 2 * 24. We could put those values in a new variable, but
that is not necessary because Stata can handle expressions within subscripts.
Because generate is being executed under by group:, subscripts
are interpreted within each distinct value of group. Hence
id[2] in this instance is interpreted successively as the second
value of id, within each of groups 1 to 5. See [U] 13.7 Explicit
subscripting for more explanation.
Having done that, it is easy to see how the same idea can be applied to
each descriptive variable. Let us suppose that a2 and b2 are
not in use as variable names. Then we can do the remaining assignments with
the following commands:
. by group: gen a2 = a[gfreq * numid2]
. by group: gen b2 = b[gfreq * numid2]
Two detailed points remain. Suppose that we are not interested in pairs for
which id and id2 are identical:
. drop if id == id2
Suppose that we do not want both (j, k) and
(k,j):
. drop if id > id2
(or use <).
A note of caution: expansion can result in much larger datasets. You may
need to worry about memory. For example, if each group frequency is 10, then
each group expands to 45, 90, or 100, depending on the precise definition of
pair, and much larger expansion factors are clearly possible.
|