Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: conditional merging


From   Robert Picard <picard@netbox.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: conditional merging
Date   Wed, 7 Nov 2012 12:28:43 -0800

Here is one way to find the most recent inspection
condition for each home sale.

*----------- begin example -------------
clear
input home_id inspection_year condition
 50121 2002 4
 50121 2006 4
 50121 2011 3
 50681 2004 2
 50681 2010 3
 51040 2006 2
 51040 2010 2
 51040 2011 3
end
tempfile condition
qui save "`condition'"

clear
input home_id sale_year
 50121 2007
 50121 2011
 50681 2008
 51040 2003
 51040 2010
end

append using "`condition'"

gen year = cond(mi(sale_year), inspection_year, sale_year)
sort home_id year inspection_year
gen cond_sale = condition
by home_id: replace cond_sale = condition[_n-1] if mi(cond_sale)
list, noobs sepby(home_id)

keep if !mi(sale_year)
keep home_id sale_year cond_sale
list
*----------- end example -------------

On Wed, Nov 7, 2012 at 11:11 AM, Nick Cox <njcoxstata@gmail.com> wrote:
> I am not planning to implement weights. The point about
> nearest-neighbour as I define it is that unknown points get
> interpolated with the value of the nearest neighbour with a known
> value. I've got to think about ways of handling cases in which two
> neighbours tie for nearest.
>
> On Wed, Nov 7, 2012 at 7:03 PM, Ben Hoen <bhoen@lbl.gov> wrote:
>> I see.  I like the nearest neighbor approach in that one could calculate
>> separately a weight of the "interpolation" such that as one interpolated
>> values "further" (in time) away from the "known" values their weight would
>> decrease.
>>
>> Thanks for those insights.  As always, very interesting & helpful.
>>
>> I will see if anyone comes forward with a merge idea.
>>
>> Best,
>>
>> Ben
>>
>> Ben Hoen
>> LBNL
>> Office: 845-758-1896
>> Cell: 718-812-7589
>>
>>
>> -----Original Message-----
>> From: owner-statalist@hsphsun2.harvard.edu
>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
>> Sent: Wednesday, November 07, 2012 1:25 PM
>> To: statalist@hsphsun2.harvard.edu
>> Subject: Re: st: conditional merging
>>
>> I will split this into two:
>>
>> 0. Interpolation. Carry-forward is crude but has the advantage that
>> only legitimate values that occur can be carried forward.
>> I decided this morning to write a nearest-neighbour interpolation
>> program, which would have the same characteristic, except that the
>> nearest neighbour could be later as well as before.
>> The program would just be an analogue of -ipolate- and therefore not
>> assume spacing in time, but would assume position in one dimension
>> (not two).
>>
>> 1. Merging. I am not a merge-master. There should be others on this
>> list who merge day in, day out and can give better advice.
>>
>> Nick
>>
>> On Wed, Nov 7, 2012 at 3:37 PM, Ben Hoen <bhoen@lbl.gov> wrote:
>>> Thanks Nick.
>>>
>>> I am not sure there is a standard way that these "condition" values trend
>>> over time across the whole dataset, and therefore interpolating them might
>>> not be appropriate.  Moreover, for each home, there might not be many data
>>> points.  Finally, the values that are allowable for condition are discreet
>>> (non-continuous), and therefore would complicate a linear, cubic, cubic
>>> spline process (though, of course that could be dealt with by using
>> .=int(x)
>>> ).  Would the interpolation allow me to take into account all of these
>>> characteristics?
>>>
>>> For, in part, this reason, I was hoping to find some way to execute a
>>> "conditional merge" (again, my words).  Additionally, the process of
>>> learning how one might do it with this "condition" data, could be applied
>> to
>>> extracting other characteristic data that are also only present
>> sporadically
>>> across time (e.g., size of the home) but that also might periodically
>> change
>>> (e.g., the home might be added to).
>>>
>>> Is there a way to use if/then statements in a merge process?
>> Nick Cox
>>
>>> Carry forward can be as little as one line of code: see
>>>
>>> FAQ     . . . . . . . . . . . . . . . . . . . . . . . Replacing missing
>>> values
>>>         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J.
>>> Cox
>>>         2/03    How can I replace missing values with previous or
>>>                 following nonmissing values?
>>>
>> http://www.stata.com/support/faqs/data-management/replacing-missing-values/
>>>
>>> I don't see that this is an imputation problem at all. It calls for
>>> interpolation. Indeed, have you considered some kind of interpolation,
>>> say linear, cubic, cubic spline?
>>>
>>> On Tue, Nov 6, 2012 at 7:33 PM, Ben Hoen <bhoen@lbl.gov> wrote:
>>>
>>>> I have two files sales.dta and condition.dta.  sales.dta has two
>> variables
>>>> (home_id saleyear), and condition.dta has three variables (home_id
>>>> inspection_year condition).  The variable inspection_year can take the
>>> vales
>>>> of 2000-2011 for any home but for many homes only some years are present
>>> (in
>>>> many years the home was not inspected. Therefore a sample of the data
>>> might
>>>> look like:
>>>>
>>>> home_id inspection_year condition
>>>> 50121           2002                    4
>>>> 50121           2006                    4
>>>> 50121           2011                    3
>>>> 50681           2004                    2
>>>> 50681           2010                    3
>>>> 51040           2006                    2
>>>> 51040           2010                    2
>>>> 51040           2011                    3
>>>>
>>>> I would like to populate the sales.dta file with the condition of the
>>> parcel
>>>> in the inspection_year that is the closest to, but not following the
>>>> sale_year.
>>>>
>>>> So, for example, the following dataset would result
>>>>
>>>> home_id sale_year       condition
>>>> 50121           2007            4
>>>> 50121           2011            3
>>>> 50681           2008            2
>>>> 51040           2003            .
>>>> 51040           2010            3
>>>>
>>>> I know I am not the first person to have this problem, but could not find
>>>> threads on this.  Maybe I am using the wrong search terms.  Any help
>> would
>>>> be greatly appreciated.
>>>>
>>>> (As I wrote this I realized one not as elegant work-around would be to
>>>> fill-in missing data for each missing year in the condition.dta file,
>>>> potentially using the user-written "carryforward" or even imputing the
>>> data
>>>> using, e.g., mi impute, and then matching home_id sale_year to home_id
>>>> inspection_year.)
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index