# Re: st: Merge Problem

 From "Eric G. Wruck" To statalist@hsphsun2.harvard.edu Subject Re: st: Merge Problem Date Thu, 4 Aug 2005 00:58:42 -0400

```I kind of hacked away at one way to do this.  I assumed what you wanted was a larger data set -- one that had, for example,

id1   id2

112   5100
112   5101
112   5102
.     .
.     .
112   5199

If this is correct, one approach would call for use of the expandcl statement.

Here you have a cluster whenever you have a dash in id2.  So

gen cluster = 1 if strpos(id2,"-")

Then you'll need to know how many observations to add/expand.  Here

gen expand = 10^(4 - (strpos(id2,"-") - 1)) if cluster

I also save the observation number as a new variable n

gen n = _n

And now

expandcl expand if cluster == 1, cluster(n) generate(newcl)

But you're still not there since you need to create new id2 variables.  Here you'll want something like this:

gen id3 = substr(id2,1,strpos(id2,"-")-1) * expand if cluster

and then

sort id1 id2 cluster

by id1 id2 cluster: gen ct = sum(cluster) - 1

replace id3 = id3 + ct if cluster == 1

replace id3 = real(id2) if cluster ~= 1

And now your id3 should be what you wanted for id2.

+-------------------+
| id1    id2    id3 |
|-------------------|
1. | 111   4569   4569 |
2. | 111   4574   4574 |
3. | 112   51--   5100 |
4. | 112   51--   5101 |
5. | 112   51--   5102 |
|-------------------|
6. | 112   51--   5103 |
7. | 112   51--   5104 |
8. | 112   51--   5105 |
9. | 112   51--   5106 |
10. | 112   51--   5107 |
|-------------------|
11. | 112   51--   5108 |
12. | 112   51--   5109 |
13. | 112   51--   5110 |
14. | 112   51--   5111 |
15. | 112   51--   5112 |
+-------------------+

Now, if you have an embedded dash, such as 3-75, then the task is somewhat more complicated.

Let me know how you fare.

Eric

>Would someone kindly help me with this problem?
>
>I'm trying to join two datasets. The matching is one to many. Let's call
>identifiers in the two datasets id1 and id2.  I have a concordance of the
>form (all variables are string):
>
>id1	id2
>111	4569
>111	4574
>112	51--
>112	5689
>113	411-
>113	4145
>113	4310
>
>The problem is where I have an entry for id2 like 51--. This means all
>four digits that start with 51 should be matched to the id1 value of 112.
>Likewise, 411- means all four digits starting with 411 correspond to 113.
>
>If I had just one entry like that, then I could go into dataset2 and
>create another id variable for which all id2s starting with 51 or 411 have
>the same value, and come back and make the appropriate change in the
>concordance. But I have many cases like this where many id2s match with
>id1. Would there be an easy way to take care of this? Perhaps some way to
>automate the creation of an alternative id2 in the second dataset?
>
>Thank you very much for your help.
>
>Jason

--

===================================================

Eric G. Wruck
Econalytics
Columbus, OH  43209

ph:      614.231.5034
cell:    614.330.8846
eFax:    614.573.6639
eMail:   ewruck@econalytics.com
website: http://www.econalytics.com

====================================================
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```