Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Merge by range of values

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: Merge by range of values
Date	Mon, 13 Jun 2011 19:57:26 -0500

On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:

I am trying to find a way to merge data sets according to a range ofvalues, sort of a combination of m:1 merge and inrange().
In one data set, each observation represents a subject with theindividual's value for variable X.
In another data set, each observation represents a range of valuesfor variable X. The start and end values of the range are separatevariables, such as start_X and end_X. The remaining variablescontain the values of Y and Z for all values of X within that range.
Is there a way to merge the Y and Z data from the second data setinto the first by comparing the value of X to the range specified bystart_X and end_X?
I thought of transforming the second data set in order to create newvariables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2, end_X_2,Y_2, Z_2, etc., adding those data to each observation in the firstdataset, and using a loop and inrange() in order to compute Y and Zfor each subject, but there are about 3,000,000 different ranges ofX in the second data set, so this is impractical.

There are (at least) two ways to approach this. The first is onlyviable for small datasets, though it is worth knowing about. I'lltake for granted that the intervals in your second dataset are non-overlapping; you should verify this, and if they are not, then you'llneed to decide how to handle this. Also, I am ignoring the issue ofnumerical precision on the boundaries of your intervals; if theboundaries of your intervals are non-integer values, then you'll needto consider this issue as well.


Here is the first approach:


    use dataset1
    cross using dataset2
    keep if inrange(x,start_x,end_x)

Note that any records in the first dataset that do not have acorresponding interval in the second will be excluded from theresult. A lower-memory variant of this is to work initially with onlythe variables x, start_x and end_x; once you've created your mapping,you can then merge your datasets in two steps (i.e., merge the mappingonto the first dataset, and then merge the result onto the second).

If your dataset is too large for this approach (as it sounds like itis in this case), then an alternative is the following:



    use dataset1
    merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen

    gen start = .
    gen end = .
    forv i=1/`c(N)' {
        if mi(start_x[`i']) continue, break

replace start = start_x[`i'] ifinrange(x,start_x[`i'],end_x[`i'])

        replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
    }

    drop start_x end_x
    ren start start_x
    ren end end_x
    merge m:1 start_x end_x using dataset2

Unlike the first approach, this approach will retain all records(including those from the first dataset without a correspondinginterval in the second, and those in the second without a matchingobservation in the first); you may use the -keep()- option on thesecond merge command to exclude one or both of these if you wish.

Note that this second approach does something you should in generalavoid doing; that is, using Stata code to loop manually through theobservations in a large dataset. I did this only to illustrate thetechnique; moving the loop into Mata would speed it up considerably.



-- Phil

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Merge by range of values
  - From: "Jeremy A. Grey" <[email protected]>

Prev by Date: Re: st: Different types of missing data and MI
Next by Date: st: R: How to generate a random variable with PDF gamma(a,b,g,x)
Previous by thread: st: Merge by range of values
Next by thread: st: Different types of missing data and MI
Index(es):
- Date
- Thread