Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Merging of two datasets

From (William Gould, Stata)
Subject   Re: st: Merging of two datasets
Date   Tue, 08 May 2007 15:39:53 -0500

Tobi Brütsch <> writes, 

> I have to match two datasets from two diffrend sources but about the same
> issue:
>       Dataset1 (+/- 5000 observations): 
>       date            Firm     recommendation  brokerID 
>       15.03.2002      ABB             2        LEHM
>       11.01.2005      ABB             5        HESLB 
>       01.07.2005      ABB             2        JBCOB 
>       06.08.2004      ABB             3        MORGAN 
>       Dataset2 (+/- 2500 observations): 
>       Date             firm                brokerID        recommendation 
>       04aug2003        ABB                 MORGAN               4 
>       13nov2002        ABB                 WAREURO              4 
>       22mar2005        ABB                 JYSKE                4 
>       25jan2005        ABB                 SOGENED              4 
>       27jan2004        ABB                 PARIBEU              3 
> [...]
> [...] i want to analyse if the recommendations of Dataset 2 are:
>     1. the same like these in set 1 (should be for some but I think not for
>        all)
>     2. The recommendations in set 2 are published with a certain lag,
>        perhabs 5-10 days after the same recommendation in set1.

> Now i don't know to merge the sets. [...]

I think the way to proceed is not to -merge- the two datasets, but to 
-append- them.

        . use Dataset1
        . gen byte source = 1 
        . append using Dataset2
        . replace source = 2 if source==.
        . save combined

It appears that the date variable in Dataset2 is a %td value, but I'm 
unsure about the value in Dataset1.  Make sure both are %td dates BEFORE
appending, and make sure that the variables are named the same in each.

Now with the datasets combined, we can start to work on a fuzzy merge.
This is more a process than a procedure.

For instance, let's first get rid of broker/firm combinations that are 
unique to one dataset or the other, or, said in our new terms, that are 
to unique to source==1 or source==2:

        . use combined
        . sort firm brokerID date
        . by firm brokerID: gen unique = (source==source[1])
        . by firm brokerID: replace unique = sum(unique)
        . by firm brokerID: gen todrop = (unique==_N) if _n==_N
        . by firm brokerID: replace todrop = todrop[_N]
        . drop if todrop
        . drop unique

Excuse me for being so verbose in my code; if I worked at it I could 
reduce it to just one or two lines.  I wrote the code verbosely, however, 
because that was how I thought about it as I was writing it and, really, 
it is easier to understand.  Everything above hinges on

        . by firm brokerID: gen unique = (source==source[1])

That variable takes on value 0 or 1; it is 1 if the source is the same as 
the source was in observation 1, and 0 otherwise.  Thus, if the sum(source)
== _N, the source never changed.  I want to get rid of all the observions for
the firm/brokerID combination in that case.

The dataset is probably worth saving at this point.

After this first step, by advice is less specific.  Tobi could do the

        . sort firm brokerID recommendation date source

Then if he listed the data, it would be easy to see how long ago each 
firm/brokerID made the same recommendation.  (By the way, I included source
at the end of the -sort- just in case the same recommendation was made twice
on the same day.  I wanted to put the records in a deterministic order, even
in that case.)

Rather than listing, Tobi coiuld generate a variable recording how long 
ago the same recommendation was made:

        . by firm brokerID recommendation: gen lastrec = date - date[_n-1]

lastrec would require days from last recommendation, but it would not
necessarily be from different sources; the firm/brokerID might have made the
same recommendation in the same source.  Tobi will have to think about what
that means, if it occurs.

We could make sure it was from different sources:

        . by firm brokerID rec: gen n=_n if source==1
        . by firm brokerID rec: replace n = n[_n-1] if n==.
        . by firm brokerID rec: gen lobs1 = n[_n-1] 
        . drop n

If I coded the above right, and if I'm thinking correctly, I just put in 
lobs1 the observation number (within firm brokerID rec) of the last 
observation made by the firm/brokerid at source==1.

We could now do the same for source==2:

        . by firm brokerID rec: gen n= _n if source==1
        . by firm brokerID rec: replace n = n[_n-1] if n==.
        . by firm brokerID rec: gen lobs2 = n[_n-1] 
        . drop n

With lobs1 and lobs2, Tobi could now get the last recommendation made 
by the other source:

        . by firm brokerID rec: gen lastrec = date-date[lobs2] if source==1
        . by firm brokerID rec: replace lastrec = date-date[lobs1] if source==2

Continuing like this, Tobi should be able come up with operation definitions
of what it means to make the same recommendations, and then count them.

-- Bill
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index