Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: merge by range of values using mata


From   Luke Miner <lminer@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: merge by range of values using mata
Date   Wed, 7 Mar 2012 17:42:37 +0000

Like the thread posted below, I'm trying to merge datasets by ranges.


In dataset A each observation has a unique value of X


In dataset B, each observation contains a range of X values,
classified by start_X and end_X. The remaining variable, Y, in dataset
B contain values for all Xs that fall within this range.


I want to merge the Y values using the X ranges.


I saw the approach suggested below on the list. However, I have
millions of observations and it is prohibitively slow. I've been
trying to convert this into mata with no luck. Any suggestions on mata
code?




> On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:
>
> I am trying to find a way to merge data sets according to a range
> of values, sort of a combination of m:1 merge and inrange().
>
> In one data set, each observation represents a subject with
> the individual's value for variable X.
>
> In another data set, each observation represents a range of values for
> variable X. The start and end values of the range are separate variables,
> such as start_X and end_X. The remaining variables contain the values of Y
> and Z for all values of X within that range.
>
> Is there a way to merge the Y and Z data from the second data set into the
> first by comparing the value of X to the range specified by start_X and
> end_X?
>
> I thought of transforming the second data set in order to create
> new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2,
> end_X_2, Y_2, Z_2, etc., adding those data to each observation in the
> first dataset, and using a loop and inrange() in order to compute Y and
> Z for each subject, but there are about 3,000,000 different ranges of X in
> the second data set, so this is impractical.
>
> There are (at least) two ways to approach this. The first is only viable
> for small datasets, though it is worth knowing about. I'll take for granted
> that the intervals in your second dataset are non- overlapping; you should
> verify this, and if they are not, then you'll need to decide how to handle
> this. Also, I am ignoring the issue of numerical precision on the boundaries
> of your intervals; if the boundaries of your intervals are non-integer
> values, then you'll need to consider this issue as well.
>
> Here is the first approach:
>
>
>     use dataset1
>     cross using dataset2
>     keep if inrange(x,start_x,end_x)
>
>
> Note that any records in the first dataset that do not have
> a corresponding interval in the second will be excluded from the result. A
> lower-memory variant of this is to work initially with only the variables x,
> start_x and end_x; once you've created your mapping, you can then merge your
> datasets in two steps (i.e., merge the mapping onto the first dataset, and
> then merge the result onto the second).
>
> If your dataset is too large for this approach (as it sounds like it is in
> this case), then an alternative is the following:
>
>     use dataset1
>     merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen
>
>     gen start = .
>     gen end = .
>     forv i=1/`c(N)' {
>         if mi(start_x[`i']) continue, break
>
> replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>
>         replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>     }

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index