Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Luke Miner <lminer@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: merge by range of values using mata |

Date |
Wed, 7 Mar 2012 17:42:37 +0000 |

Like the thread posted below, I'm trying to merge datasets by ranges. In dataset A each observation has a unique value of X In dataset B, each observation contains a range of X values, classified by start_X and end_X. The remaining variable, Y, in dataset B contain values for all Xs that fall within this range. I want to merge the Y values using the X ranges. I saw the approach suggested below on the list. However, I have millions of observations and it is prohibitively slow. I've been trying to convert this into mata with no luck. Any suggestions on mata code? > On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote: > > I am trying to find a way to merge data sets according to a range > of values, sort of a combination of m:1 merge and inrange(). > > In one data set, each observation represents a subject with > the individual's value for variable X. > > In another data set, each observation represents a range of values for > variable X. The start and end values of the range are separate variables, > such as start_X and end_X. The remaining variables contain the values of Y > and Z for all values of X within that range. > > Is there a way to merge the Y and Z data from the second data set into the > first by comparing the value of X to the range specified by start_X and > end_X? > > I thought of transforming the second data set in order to create > new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2, > end_X_2, Y_2, Z_2, etc., adding those data to each observation in the > first dataset, and using a loop and inrange() in order to compute Y and > Z for each subject, but there are about 3,000,000 different ranges of X in > the second data set, so this is impractical. > > There are (at least) two ways to approach this. The first is only viable > for small datasets, though it is worth knowing about. I'll take for granted > that the intervals in your second dataset are non- overlapping; you should > verify this, and if they are not, then you'll need to decide how to handle > this. Also, I am ignoring the issue of numerical precision on the boundaries > of your intervals; if the boundaries of your intervals are non-integer > values, then you'll need to consider this issue as well. > > Here is the first approach: > > > use dataset1 > cross using dataset2 > keep if inrange(x,start_x,end_x) > > > Note that any records in the first dataset that do not have > a corresponding interval in the second will be excluded from the result. A > lower-memory variant of this is to work initially with only the variables x, > start_x and end_x; once you've created your mapping, you can then merge your > datasets in two steps (i.e., merge the mapping onto the first dataset, and > then merge the result onto the second). > > If your dataset is too large for this approach (as it sounds like it is in > this case), then an alternative is the following: > > use dataset1 > merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen > > gen start = . > gen end = . > forv i=1/`c(N)' { > if mi(start_x[`i']) continue, break > > replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i']) > > replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i']) > } * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: merge by range of values using mata***From:*Robert Picard <picard@netbox.com>

- Prev by Date:
**[no subject]** - Next by Date:
**st: Figure out in which do-file an error occurred** - Previous by thread:
**[no subject]** - Next by thread:
**Re: st: merge by range of values using mata** - Index(es):