[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Requirement(s) for a match merge
Thanks so much, David. Your detailed comment has been immensely useful.
From: David Kantor <firstname.lastname@example.org>
At 04:07 PM 3/30/2005 +0000, Louis Boakye-Yiadom wrote:
As I understand it, it's not required.
Is it true that "for a match merge to work, the identifier or identifiers
must uniquely identify each observation"? I found this statement in sample
lecture NC 101 (one of StataCorp's NetCourses), but I thought that this
requirement (of the id uniquely identifying each observation) is often
desirable, but not necessary in all cases. Any insights will be
appreciated. Thank you.
But it's a good idea for the identifier(s) to uniquely identify
observations in at least one of the files. In nearly all instances, it is
essential to adhere to this rule in order to get meaningful results.
Sometimes it/they uniquely identify observations in both files. That is
easy to understand. (And you can use the -uniq- option.)
Often, you may have unique identification in one file but not the other.
Then the observations in the file of unique identification get spread out
over multiple observations in the other. Typically, this is to bring in
information about non-key attributes. For example, in a file of person,
you may merge in information about their families. (And you can use the
uniqu or uniqm option, depending on which direction you are going.)
But -merge- will still work if neither file is uniquely identified by the
identifier(s). But it is rare that you would want to do that; it usually
leads to meaningless pairings. So you need to be careful about what and
why you are doing it. In ten years of merging, I have done it once (to
produce something for clerical inspection).
When this situation occurs, the matchings proceed one-to-one in the order
that observations appear, until one side or the other runs out of
observations. Then "spreading" occurs on the remainder. (This is a
generalization of the one-to-many or many-to-one matching described above.)
Suppose that the in-memory file has 4 observations with a particular
value in the matching identifier, and that the using file has 6
observations with that same value in the matching identifier. Then, for
these observations, the first four will be paired in the order received,
and the final observation in the in-memory file will also be paired with
the other two in the using file.
Understand that in this situation, the pairings are probably meaningless;
they share a value in the matching identifier, but there is no particular
reason that observation a got paired with observation b. Furthermore,
unless you impose stable sorting, the resulting pairings are not
(A weaker condition for getting "meaningful" pairings is that the
identifier be unique in one or the other file for any particular value(s)
in the matching identifier(s) -- but not necessarily the same file in every
case. While this leads to possibly meaningful pairings, which are also
reproducible, it is a contrived situation that wouldn't naturally occur --
as far as I can see.)
I hope this has been useful.
Institute for Policy Studies
Johns Hopkins University
* For searches and help try: