# Re: st: STATA code for identifying similar observations within groupid

 From n j cox To statalist@hsphsun2.harvard.edu Subject Re: st: STATA code for identifying similar observations within groupid Date Sun, 01 Jun 2008 17:51:46 +0100

Please note that you are asked not to send attachments to the list and not to include copies of irrelevant previous mailings.

The same kinds of problems came up a few days ago. For example, see

<http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/statalist.0805/Author/article-1121.html>

Dummy for inventors on the same patent all from the same country:

bysort pub_nbr (inv_cou) :
gen byte same_inv_cou = inv_cou[1] == inv_cou[_N]

Dummy for all Eastern European:

You need a variable in_EE, 1 for in EE and 0 otherwise. Then it's almost the same idea:

bysort pub_nbr (is_EE) :
gen byte all_in_EE = in_EE[1] == in_EE[_N] & in_EE[1] == 1

Or alternatively

bysort pub_nbr (is_EE) :
gen byte all_in_EE = in_EE[1] == 1 & in_EE[_N] == 1

At least one country is in OECD:

You need a variable in_OECD, 1 for in OECD and O otherwise. Then

bysort pub_nbr (is_OECD) :
gen byte any_in_OECD = in_OECD[_N] == 1

I am not clear about your fourth definition. It should yield to similar technique.

How do you get these extra variables? See the linked FAQs

How do I select a subset of observations using a complicated criterion?
<http://www.stata.com/support/faqs/data/selectid.html>

How do you define group characteristics in your data in order to create subsets?
<http://www.stata.com/support/faqs/data/characteristics.html>

You may find the -merge- method easiest for your set-up.

Nick
n.j.cox@durham.ac.uk

Chirantan Chatterjee [edited]

I am working on a dataset of European Patents, patents that have at least one inventor (variable -inv_cou-) belonging to an Eastern European country.

There are 21 such EE countries, identified with International Patent Classification codes. Thus for the patent EP1701504, there are 5 inventors, 4 German, identified by "DE" in -inv_cou-, and one Polish, identified by "PL". Apart from EE countries, inventors for a multi-inventor patent also come from OECD countries, again identified by IPC codes, DE for Germany, MX for Mexico, KR for Korea and likewise.

The observations are not uniquely identified by the patent identifier, pub_nbr or publication number. Thus for patent EP1701504, EP1701504 is the value under pub_nbr which is stacked one upon another for each of its 5 inventors. There are some other characteristics too for a patent that come in the dataset.

Here is a shortened sketch for the data, for patent EP0000287, identified by pub_nbr, the patent identifier: it has two inventors stacked one upon another.

pub_nbr inv_name inv_city inv_cou inv_total app_city app_cou app_name
EP0000287 Szabó, Sándor Budapest XIHU 5 Budapest HU AUTäIPARI
EP0000287 Vad, László Visegrád HU 5 Budapest HU Ikarus

My objective is to create dummy variables telling me whether the inventors that created the patent are:

a. Located in the same country.

b. Resident in multiple countries, but all of the countries are EE countries. (Have the EE country code set)

c. Resident in multiple countries, and at least one of the countries is an OECD member state. (Have the OECD country code set)

d. When the patent applicant is located in an OECD country, app_cou identifies applicant country like for inventor countries as you will see in the attached sample of the data.

What is the best way to create each of the four dummy variables?

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/