Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Robert Picard <picard@netbox.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: merge by range of values using mata |

Date |
Wed, 7 Mar 2012 15:09:24 -0500 |

If the ranges in B do not overlap, then here is a simple and very fast way to doing this. Robert *----------- begin example ------------- version 12 set seed 1234 clear * Create observations with X between 0 and 100 set obs 20 gen idX = _n gen X = runiform() * 100 list tempfile A save "`A'" * Create non-overlapping ranges for X clear set obs 5 gen idB = _n gen start_X = (_n-1) * 15 gen end_X = start_X + 10 gen Y = runiform() * 10 list * Expand each range and define values for X expand 2 sort idB by idB: gen X = start_X if _n == 1 by idB: replace X = end_X if _n == 2 list, sepby(idB) * Create a trick id (see below) by idB: gen idB2 = cond(_n==1,idB,-idB) * Append observations from A and order by X append using "`A'" sort X gen idrange = sum(idB2) list, noobs sepby(idrange) * Target obs from A that fall within an idrange keep if idrange != 0 sort idrange Y by idrange: replace Y = Y[1] list, noobs sepby(idrange) drop if mi(idX) list, noobs sepby(idrange) *------------ end example -------------- On Wed, Mar 7, 2012 at 12:42 PM, Luke Miner <lminer@gmail.com> wrote: > Like the thread posted below, I'm trying to merge datasets by ranges. > > > In dataset A each observation has a unique value of X > > > In dataset B, each observation contains a range of X values, > classified by start_X and end_X. The remaining variable, Y, in dataset > B contain values for all Xs that fall within this range. > > > I want to merge the Y values using the X ranges. > > > I saw the approach suggested below on the list. However, I have > millions of observations and it is prohibitively slow. I've been > trying to convert this into mata with no luck. Any suggestions on mata > code? > > > > >> On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote: >> >> I am trying to find a way to merge data sets according to a range >> of values, sort of a combination of m:1 merge and inrange(). >> >> In one data set, each observation represents a subject with >> the individual's value for variable X. >> >> In another data set, each observation represents a range of values for >> variable X. The start and end values of the range are separate variables, >> such as start_X and end_X. The remaining variables contain the values of Y >> and Z for all values of X within that range. >> >> Is there a way to merge the Y and Z data from the second data set into the >> first by comparing the value of X to the range specified by start_X and >> end_X? >> >> I thought of transforming the second data set in order to create >> new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2, >> end_X_2, Y_2, Z_2, etc., adding those data to each observation in the >> first dataset, and using a loop and inrange() in order to compute Y and >> Z for each subject, but there are about 3,000,000 different ranges of X in >> the second data set, so this is impractical. >> >> There are (at least) two ways to approach this. The first is only viable >> for small datasets, though it is worth knowing about. I'll take for granted >> that the intervals in your second dataset are non- overlapping; you should >> verify this, and if they are not, then you'll need to decide how to handle >> this. Also, I am ignoring the issue of numerical precision on the boundaries >> of your intervals; if the boundaries of your intervals are non-integer >> values, then you'll need to consider this issue as well. >> >> Here is the first approach: >> >> >> use dataset1 >> cross using dataset2 >> keep if inrange(x,start_x,end_x) >> >> >> Note that any records in the first dataset that do not have >> a corresponding interval in the second will be excluded from the result. A >> lower-memory variant of this is to work initially with only the variables x, >> start_x and end_x; once you've created your mapping, you can then merge your >> datasets in two steps (i.e., merge the mapping onto the first dataset, and >> then merge the result onto the second). >> >> If your dataset is too large for this approach (as it sounds like it is in >> this case), then an alternative is the following: >> >> use dataset1 >> merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen >> >> gen start = . >> gen end = . >> forv i=1/`c(N)' { >> if mi(start_x[`i']) continue, break >> >> replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i']) >> >> replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i']) >> } > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: merge by range of values using mata***From:*Luke Miner <lminer@gmail.com>

- Prev by Date:
**Re: st: RE: Figure out in which do-file an error occurred** - Next by Date:
**st: RE: RE: Use of aweights command in xtivreg2** - Previous by thread:
**st: merge by range of values using mata** - Next by thread:
**st: Figure out in which do-file an error occurred** - Index(es):