Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Requirement(s) for a match merge

From   "louis boakye-yiadom" <[email protected]>
To   [email protected]
Subject   Re: st: Requirement(s) for a match merge
Date   Wed, 30 Mar 2005 17:48:48 +0000

Thanks so much, David. Your detailed comment has been immensely useful.


From: David Kantor <[email protected]>

At 04:07 PM 3/30/2005 +0000, Louis Boakye-Yiadom wrote:
Dear All,
Is it true that "for a match merge to work, the identifier or identifiers must uniquely identify each observation"? I found this statement in sample lecture NC 101 (one of StataCorp's NetCourses), but I thought that this requirement (of the id uniquely identifying each observation) is often desirable, but not necessary in all cases. Any insights will be appreciated. Thank you.
As I understand it, it's not required.

But it's a good idea for the identifier(s) to uniquely identify observations in at least one of the files. In nearly all instances, it is essential to adhere to this rule in order to get meaningful results.

Sometimes it/they uniquely identify observations in both files. That is easy to understand. (And you can use the -uniq- option.)

Often, you may have unique identification in one file but not the other. Then the observations in the file of unique identification get spread out over multiple observations in the other. Typically, this is to bring in information about non-key attributes. For example, in a file of person, you may merge in information about their families. (And you can use the uniqu or uniqm option, depending on which direction you are going.)

But -merge- will still work if neither file is uniquely identified by the identifier(s). But it is rare that you would want to do that; it usually leads to meaningless pairings. So you need to be careful about what and why you are doing it. In ten years of merging, I have done it once (to produce something for clerical inspection).

When this situation occurs, the matchings proceed one-to-one in the order that observations appear, until one side or the other runs out of observations. Then "spreading" occurs on the remainder. (This is a generalization of the one-to-many or many-to-one matching described above.) Suppose that the in-memory file has 4 observations with a particular value in the matching identifier, and that the using file has 6 observations with that same value in the matching identifier. Then, for these observations, the first four will be paired in the order received, and the final observation in the in-memory file will also be paired with the other two in the using file.

Understand that in this situation, the pairings are probably meaningless; they share a value in the matching identifier, but there is no particular reason that observation a got paired with observation b. Furthermore, unless you impose stable sorting, the resulting pairings are not reproducible.

(A weaker condition for getting "meaningful" pairings is that the identifier be unique in one or the other file for any particular value(s) in the matching identifier(s) -- but not necessarily the same file in every case. While this leads to possibly meaningful pairings, which are also reproducible, it is a contrived situation that wouldn't naturally occur -- as far as I can see.)

I hope this has been useful.
-- David

David Kantor
Institute for Policy Studies
Johns Hopkins University
[email protected]

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index