Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Question about fndmtch2


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Question about fndmtch2
Date   Fri, 3 Nov 2006 11:54:22 -0000

Socrates asks some tough questions, as Plato also said. 
(Sorry, couldn't resist.) 

-fndmtch2- is mine and on SSC and dates from 2000. Its ugly name 
reflects the fact that some Stata users were still working 
with platforms limited to 8.3 or filename.ext filenames, 
and also that there is a -fndmtch- too. 

Socrates found a bug. Thanks for pointing it out. I got rid 
of it by rewriting the program from scratch, almost. The 
original program used a grotesque backward logic that produces 
the right answer for examples given, but falls over 
for Socrates' example, which isn't exotic. In retrospect
that bug is shocking, but Stata is much more powerful now than
it was in 2000, and I have more experience, but fewer brain cells.  
The net result is still positive. Anyway, in the original 
I evidently made a hidden assumption which just isn't true 
in general. 

I have a -findmatch- now that produces correct answers
in this case, and in previous ones too. I'll send it to Kit
Baum. But more interesting, and probably more useful, is 
to talk about a direct attack on Socrates' problem 
so that he gets to see how to do it himself. 

The "find a match" problem here has this flavour: for 
different values of -var1-, how many values of -var2- 
are the same? They can be anywhere in the dataset,
unless you want to slap on -if- or -in- restrictions. 

There is going to be a loop over the distinct values
in my solutions. Each time round the loop I am going
to do a -count-, and put the result into a variable
in the right place(s). To do that I need to have a 
variable to put it in. 

gen long count = 0 

initialises a counter variable. The -long- is paranoid, 
just in case the counts get really big. Initialising 
it to missing is another good way. 

For toy examples, I can use -levelsof- confidently. 
In Socrates' case, -var1- and -var2- are both string, 
so let's focus on that situation. 

levelsof var1, local(levels) 

puts the distinct values into a local macro. 

quietly foreach l of local levels { 
	count if `"`l'"' == var2 
	replace count = r(N) if var1 == `"`l'"' 
} 

That's a first solution. I slapped on compound
double quotes `" "' just in case there are double 
quotes lurking in the strings. That's paranoid too, 
but does no harm. Just because you're paranoid
doesn't mean the data aren't trying to get you. 

Now this pivots on both variables being string. Also, 
in a industrial-strength solution, you wouldn't want
to rely on all the distinct values fitting into a macro, 
so -levelsof- may be set on one side. One thing we 
can always do is map the distinct values to successive
integers: 

egen group = group(var1) 
su group, meanonly 
local ngroup = r(max) 

-egen, group()- maps the distinct values of -var1- to the 
integers 1,...,#groups; and we can retrieve #groups by a 
-summarize- and then peeking at the saved results. 
Initialise as before: 

gen long count = 0 

Another variable will come in useful, holding the 
observation numbers: 

gen long obs = _n 

qui forval i = 1/`ngroup' { 
	su obs if group == `i', meanonly 
	local first = r(min) 
	count if var1[`first'] == var2 
	replace count = r(N) if group == `i' 
} 

The loop uses a look-up technique. When we 
are focusing on -group == 1-, for example, how 
we know what value of -var1- we are dealing with? 
(By construction, -var1- is constant for each 
distinct value of -group- -- we set up a one-to-one
mapping -- but what is that constant?) Notice that 
it is not general enough to go 

	su var1 if group == `i' 

and look at the saved results, because in general
-var1- could be a string (and it is in Socrates' 
example). We have to be one step more devious. 
We just need to find the observation number for any 
observation in a particular group, and then we can 
get at the corresponding value of -var1-. That 
is where the -obs- variable comes in useful. 
There are two saved results that will work, the
minimum or the maximum, and you can choose. (The 
mean won't work in general: consider, for example, 
a group with just two representatives, in observation
8 and observation 10: the mean at 9 does not 
identify a representative.) 

So here is some code for Socrates' example: 

egen group = group(owner) 
su group, meanonly 
local ngroup = r(max) 
gen long match = 0 
gen long obs = _n 
qui forval i = 1/`ngroup' { 
	su obs if group == `i', meanonly 
	local first = r(min) 
	count if owner[`first'] == inter  
	replace match = r(N) if group == `i' 
} 

Nick 
n.j.cox@durham.ac.uk 

Socrates Mokkas
 
> I seem to have a problem with the command fndmtch2.
> My data is a huge sample of companies. They have the form of:
> 
> Firms	Inter	Owner	match
> c	r	g	0
> c	r	t	1
> b	t	r	1
> 
> I want find whether companies that are "Owners" are included 
> in the category 
> of "Inter" also. I run the command fndmtch2 which gives me 
> the variable "match"
> The command I run is:
> fndmtch2  Owner Inter, generate(match3) count miss
> 
> What I do not understand is why isn't match=2 for the case of 
> the 3rd observation since the element of "r" can be met twice 
> (1st and 2nd observation) in the "Inter" variable. Thank you 
> very much!


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index