Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Merge by range of values


From   Phil Schumm <pschumm@uchicago.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Merge by range of values
Date   Mon, 13 Jun 2011 19:57:26 -0500

On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:
I am trying to find a way to merge data sets according to a range of values, sort of a combination of m:1 merge and inrange().

In one data set, each observation represents a subject with the individual's value for variable X.

In another data set, each observation represents a range of values for variable X. The start and end values of the range are separate variables, such as start_X and end_X. The remaining variables contain the values of Y and Z for all values of X within that range.

Is there a way to merge the Y and Z data from the second data set into the first by comparing the value of X to the range specified by start_X and end_X?

I thought of transforming the second data set in order to create new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2, end_X_2, Y_2, Z_2, etc., adding those data to each observation in the first dataset, and using a loop and inrange() in order to compute Y and Z for each subject, but there are about 3,000,000 different ranges of X in the second data set, so this is impractical.


There are (at least) two ways to approach this. The first is only viable for small datasets, though it is worth knowing about. I'll take for granted that the intervals in your second dataset are non- overlapping; you should verify this, and if they are not, then you'll need to decide how to handle this. Also, I am ignoring the issue of numerical precision on the boundaries of your intervals; if the boundaries of your intervals are non-integer values, then you'll need to consider this issue as well.

Here is the first approach:


    use dataset1
    cross using dataset2
    keep if inrange(x,start_x,end_x)


Note that any records in the first dataset that do not have a corresponding interval in the second will be excluded from the result. A lower-memory variant of this is to work initially with only the variables x, start_x and end_x; once you've created your mapping, you can then merge your datasets in two steps (i.e., merge the mapping onto the first dataset, and then merge the result onto the second).

If your dataset is too large for this approach (as it sounds like it is in this case), then an alternative is the following:


    use dataset1
    merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen

    gen start = .
    gen end = .
    forv i=1/`c(N)' {
        if mi(start_x[`i']) continue, break
replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
        replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
    }

    drop start_x end_x
    ren start start_x
    ren end end_x
    merge m:1 start_x end_x using dataset2


Unlike the first approach, this approach will retain all records (including those from the first dataset without a corresponding interval in the second, and those in the second without a matching observation in the first); you may use the -keep()- option on the second merge command to exclude one or both of these if you wish.

Note that this second approach does something you should in general avoid doing; that is, using Stata code to loop manually through the observations in a large dataset. I did this only to illustrate the technique; moving the loop into Mata would speed it up considerably.


-- Phil

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index