Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merge by range of values using mata


From   Robert Picard <picard@netbox.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: merge by range of values using mata
Date   Wed, 7 Mar 2012 15:09:24 -0500

If the ranges in B do not overlap, then here is a simple and very fast
way to doing this. Robert

*----------- begin example -------------
version 12

set seed 1234
clear

* Create observations with X between 0 and 100
set obs 20
gen idX = _n
gen X = runiform() * 100
list
tempfile A
save "`A'"

* Create non-overlapping ranges for X
clear
set obs 5
gen idB = _n
gen start_X = (_n-1) * 15
gen end_X = start_X + 10
gen Y = runiform() * 10
list

* Expand each range and define values for X
expand 2
sort idB
by idB: gen X = start_X if _n == 1
by idB: replace X = end_X if _n == 2
list, sepby(idB)

* Create a trick id (see below)
by idB: gen idB2 = cond(_n==1,idB,-idB)

* Append observations from A and order by X
append using "`A'"
sort X
gen idrange = sum(idB2)
list, noobs sepby(idrange)

* Target obs from A that fall within an idrange
keep if idrange != 0
sort idrange Y
by idrange: replace Y = Y[1]
list, noobs sepby(idrange)
drop if mi(idX)
list, noobs sepby(idrange)

*------------ end example --------------


On Wed, Mar 7, 2012 at 12:42 PM, Luke Miner <lminer@gmail.com> wrote:
> Like the thread posted below, I'm trying to merge datasets by ranges.
>
>
> In dataset A each observation has a unique value of X
>
>
> In dataset B, each observation contains a range of X values,
> classified by start_X and end_X. The remaining variable, Y, in dataset
> B contain values for all Xs that fall within this range.
>
>
> I want to merge the Y values using the X ranges.
>
>
> I saw the approach suggested below on the list. However, I have
> millions of observations and it is prohibitively slow. I've been
> trying to convert this into mata with no luck. Any suggestions on mata
> code?
>
>
>
>
>> On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:
>>
>> I am trying to find a way to merge data sets according to a range
>> of values, sort of a combination of m:1 merge and inrange().
>>
>> In one data set, each observation represents a subject with
>> the individual's value for variable X.
>>
>> In another data set, each observation represents a range of values for
>> variable X. The start and end values of the range are separate variables,
>> such as start_X and end_X. The remaining variables contain the values of Y
>> and Z for all values of X within that range.
>>
>> Is there a way to merge the Y and Z data from the second data set into the
>> first by comparing the value of X to the range specified by start_X and
>> end_X?
>>
>> I thought of transforming the second data set in order to create
>> new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2,
>> end_X_2, Y_2, Z_2, etc., adding those data to each observation in the
>> first dataset, and using a loop and inrange() in order to compute Y and
>> Z for each subject, but there are about 3,000,000 different ranges of X in
>> the second data set, so this is impractical.
>>
>> There are (at least) two ways to approach this. The first is only viable
>> for small datasets, though it is worth knowing about. I'll take for granted
>> that the intervals in your second dataset are non- overlapping; you should
>> verify this, and if they are not, then you'll need to decide how to handle
>> this. Also, I am ignoring the issue of numerical precision on the boundaries
>> of your intervals; if the boundaries of your intervals are non-integer
>> values, then you'll need to consider this issue as well.
>>
>> Here is the first approach:
>>
>>
>>     use dataset1
>>     cross using dataset2
>>     keep if inrange(x,start_x,end_x)
>>
>>
>> Note that any records in the first dataset that do not have
>> a corresponding interval in the second will be excluded from the result. A
>> lower-memory variant of this is to work initially with only the variables x,
>> start_x and end_x; once you've created your mapping, you can then merge your
>> datasets in two steps (i.e., merge the mapping onto the first dataset, and
>> then merge the result onto the second).
>>
>> If your dataset is too large for this approach (as it sounds like it is in
>> this case), then an alternative is the following:
>>
>>     use dataset1
>>     merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen
>>
>>     gen start = .
>>     gen end = .
>>     forv i=1/`c(N)' {
>>         if mi(start_x[`i']) continue, break
>>
>> replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>>
>>         replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>>     }
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index