Socrates asks some tough questions, as Plato also said.
(Sorry, couldn't resist.)
-fndmtch2- is mine and on SSC and dates from 2000. Its ugly name
reflects the fact that some Stata users were still working
with platforms limited to 8.3 or filename.ext filenames,
and also that there is a -fndmtch- too.
Socrates found a bug. Thanks for pointing it out. I got rid
of it by rewriting the program from scratch, almost. The
original program used a grotesque backward logic that produces
the right answer for examples given, but falls over
for Socrates' example, which isn't exotic. In retrospect
that bug is shocking, but Stata is much more powerful now than
it was in 2000, and I have more experience, but fewer brain cells.
The net result is still positive. Anyway, in the original
I evidently made a hidden assumption which just isn't true
in general.
I have a -findmatch- now that produces correct answers
in this case, and in previous ones too. I'll send it to Kit
Baum. But more interesting, and probably more useful, is
to talk about a direct attack on Socrates' problem
so that he gets to see how to do it himself.
The "find a match" problem here has this flavour: for
different values of -var1-, how many values of -var2-
are the same? They can be anywhere in the dataset,
unless you want to slap on -if- or -in- restrictions.
There is going to be a loop over the distinct values
in my solutions. Each time round the loop I am going
to do a -count-, and put the result into a variable
in the right place(s). To do that I need to have a
variable to put it in.
gen long count = 0
initialises a counter variable. The -long- is paranoid,
just in case the counts get really big. Initialising
it to missing is another good way.
For toy examples, I can use -levelsof- confidently.
In Socrates' case, -var1- and -var2- are both string,
so let's focus on that situation.
levelsof var1, local(levels)
puts the distinct values into a local macro.
quietly foreach l of local levels {
count if `"`l'"' == var2
replace count = r(N) if var1 == `"`l'"'
}
That's a first solution. I slapped on compound
double quotes `" "' just in case there are double
quotes lurking in the strings. That's paranoid too,
but does no harm. Just because you're paranoid
doesn't mean the data aren't trying to get you.
Now this pivots on both variables being string. Also,
in a industrial-strength solution, you wouldn't want
to rely on all the distinct values fitting into a macro,
so -levelsof- may be set on one side. One thing we
can always do is map the distinct values to successive
integers:
egen group = group(var1)
su group, meanonly
local ngroup = r(max)
-egen, group()- maps the distinct values of -var1- to the
integers 1,...,#groups; and we can retrieve #groups by a
-summarize- and then peeking at the saved results.
Initialise as before:
gen long count = 0
Another variable will come in useful, holding the
observation numbers:
gen long obs = _n
qui forval i = 1/`ngroup' {
su obs if group == `i', meanonly
local first = r(min)
count if var1[`first'] == var2
replace count = r(N) if group == `i'
}
The loop uses a look-up technique. When we
are focusing on -group == 1-, for example, how
we know what value of -var1- we are dealing with?
(By construction, -var1- is constant for each
distinct value of -group- -- we set up a one-to-one
mapping -- but what is that constant?) Notice that
it is not general enough to go
su var1 if group == `i'
and look at the saved results, because in general
-var1- could be a string (and it is in Socrates'
example). We have to be one step more devious.
We just need to find the observation number for any
observation in a particular group, and then we can
get at the corresponding value of -var1-. That
is where the -obs- variable comes in useful.
There are two saved results that will work, the
minimum or the maximum, and you can choose. (The
mean won't work in general: consider, for example,
a group with just two representatives, in observation
8 and observation 10: the mean at 9 does not
identify a representative.)
So here is some code for Socrates' example:
egen group = group(owner)
su group, meanonly
local ngroup = r(max)
gen long match = 0
gen long obs = _n
qui forval i = 1/`ngroup' {
su obs if group == `i', meanonly
local first = r(min)
count if owner[`first'] == inter
replace match = r(N) if group == `i'
}
Nick
n.j.cox@durham.ac.uk
Socrates Mokkas
> I seem to have a problem with the command fndmtch2.
> My data is a huge sample of companies. They have the form of:
>
> Firms Inter Owner match
> c r g 0
> c r t 1
> b t r 1
>
> I want find whether companies that are "Owners" are included
> in the category
> of "Inter" also. I run the command fndmtch2 which gives me
> the variable "match"
> The command I run is:
> fndmtch2 Owner Inter, generate(match3) count miss
>
> What I do not understand is why isn't match=2 for the case of
> the 3rd observation since the element of "r" can be met twice
> (1st and 2nd observation) in the "Inter" variable. Thank you
> very much!
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/