Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Merge Problem


From   "Eric G. Wruck" <ewruck@econalytics.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Merge Problem
Date   Thu, 4 Aug 2005 00:58:42 -0400

I kind of hacked away at one way to do this.  I assumed what you wanted was a larger data set -- one that had, for example,

id1   id2

112   5100
112   5101
112   5102
.     .
.     .
112   5199

If this is correct, one approach would call for use of the expandcl statement.

Here you have a cluster whenever you have a dash in id2.  So

gen cluster = 1 if strpos(id2,"-")

Then you'll need to know how many observations to add/expand.  Here

gen expand = 10^(4 - (strpos(id2,"-") - 1)) if cluster

I also save the observation number as a new variable n

gen n = _n

And now

expandcl expand if cluster == 1, cluster(n) generate(newcl)

But you're still not there since you need to create new id2 variables.  Here you'll want something like this:

gen id3 = substr(id2,1,strpos(id2,"-")-1) * expand if cluster

and then

sort id1 id2 cluster

by id1 id2 cluster: gen ct = sum(cluster) - 1

replace id3 = id3 + ct if cluster == 1

replace id3 = real(id2) if cluster ~= 1


And now your id3 should be what you wanted for id2.

      +-------------------+
     | id1    id2    id3 |
     |-------------------|
  1. | 111   4569   4569 |
  2. | 111   4574   4574 |
  3. | 112   51--   5100 |
  4. | 112   51--   5101 |
  5. | 112   51--   5102 |
     |-------------------|
  6. | 112   51--   5103 |
  7. | 112   51--   5104 |
  8. | 112   51--   5105 |
  9. | 112   51--   5106 |
 10. | 112   51--   5107 |
     |-------------------|
 11. | 112   51--   5108 |
 12. | 112   51--   5109 |
 13. | 112   51--   5110 |
 14. | 112   51--   5111 |
 15. | 112   51--   5112 |
     +-------------------+

Now, if you have an embedded dash, such as 3-75, then the task is somewhat more complicated.

Let me know how you fare.

Eric




>Would someone kindly help me with this problem?
>
>I'm trying to join two datasets. The matching is one to many. Let's call
>identifiers in the two datasets id1 and id2.  I have a concordance of the
>form (all variables are string):
>
>id1	id2
>111	4569
>111	4574
>112	51--
>112	5689
>113	411-
>113	4145
>113	4310
>
>The problem is where I have an entry for id2 like 51--. This means all
>four digits that start with 51 should be matched to the id1 value of 112.
>Likewise, 411- means all four digits starting with 411 correspond to 113.
>
>If I had just one entry like that, then I could go into dataset2 and
>create another id variable for which all id2s starting with 51 or 411 have
>the same value, and come back and make the appropriate change in the
>concordance. But I have many cases like this where many id2s match with
>id1. Would there be an easy way to take care of this? Perhaps some way to
>automate the creation of an alternative id2 in the second dataset?
>
>Thank you very much for your help.
>
>Jason


-- 

===================================================

       Eric G. Wruck
       Econalytics
       2535 Sherwood Road
       Columbus, OH  43209

       ph:      614.231.5034
       cell:    614.330.8846
       eFax:    614.573.6639
       eMail:   ewruck@econalytics.com
       website: http://www.econalytics.com

====================================================
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index