Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified


From   Joseph Monte <hmjc66@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified
Date   Tue, 18 Sep 2012 18:40:23 +0100

Dear Austin and Gordon,

Thanks for your suggestions.

Sorry I was not clear in my email. I have a counter variable called
"identifier" which is unique for each observation. For each
"identifier" within a certain "region" (I have approx 4000 identifiers
and 40 regions), I basically want the average value of "var2" for the
five closest values of "var1" based on value. I tried the following
code below and it seems to "almost" solve my problem (although it is
most likely inefficient since I am creating 16 variables and then
using -reshape- to get what I want). My concern is that it does not
handle the possibility that "var1" may be the same with a given
"region". There are 3 instances when "var1" has the same value twice
within a given "region". In this case, I want the average "var2" to be
taken of the two observations. I am hence using values from 1 to 8 in
the loop below to capture the possibility that all three duplicates
may occur for a given observation (I would normally use 1 to 5 if
there were no duplicates of "var1"). I am not clear on how to capture
the average value of "var2" if "var1" is the same within a given
"region" and would appreciate any help.

-----------------------begin code ----------------------
forv i=1/8 {
	bys region (var1): gen diffvar1minus`i'=var1-var1[_n-`i']
	bys region (var1): gen diffvar1plus`i'=abs(var1-var1[_n+`i'])
	bys region (var1): gen var2minus`i'=var2[_n-`i']
	bys region (var1): gen var2plus`i'=var2[_n+`i']
	}
reshape long diffvar1 var2, i(identifier) j(minusplus5, string)
drop if diffvar1==.
bys identifier (diffvar1): gen id1=_N
drop if id1<5
bys identifier (diffvar1): gen id=_n	
keep if id<=5
drop id id1
bys identifier: egen avgvar2=mean(var2)
-------------------------end code----------------------------

Thanks,

Joe


On Tue, Sep 18, 2012 at 3:47 PM, Austin Nichols <austinnichols@gmail.com> wrote:
> Joseph Monte <hmjc66@gmail.com>:
> The best way to approach this depends on the data size and structure.
> If you have easy data like below, you can -cross- and compute
> directly; for a large dataset, you may want to loop over observations
> (cf. e.g. http://www.stata.com/statalist/archive/2007-10/msg00346.html).
> To loop over observations and sort repeatedly by distance based on one
> or more variables, it will behoove you to create a numeric id
> corresponding to the obs number at the outset, so you can re-sort when
> you are done with each iteration of the loop, which will make it easy
> to refer to a specific observation.  Something like:
>
> clear all
> input str1 reg v1 v2
> A  3.29515    47
> A  5.39742    38
> A  7.94641    43
> A  11.25495   235
> A  22.35908   61
> A  27.19206   76
> A  41.03306   66
> A  45.56846   89
> A  53.63861   116
> A  73.2925    76
> A  104.3025   63
> A  229.7772   74
> A  634.0973   61
> A  1053.78    80
> A  1163.681   47
> B  2.339128   55
> B  2.378151   46
> B  9.831361   47
> B  15.83442   57
> B  16.48956   42
> B  28.70144   44
> B  56.01777   29
> B  113.9736   103
> B  178.731    47
> B  340.715    103
> C  0.5892565  44
> C  2.016974   37
> C  3.041719   76
> C  4.009228   80
> C  5.856674   51
> C  7.587287   188
> C  8.827202   66
> C  11.53763   48
> C  11.67932   152
> C  11.86612   51
> C  12.95344   84
> C  14.85097   63
> C  17.12918   47
> C  17.74263   67
> C  17.97567   75
> C  20.60005   84
> C  22.13938   44
> C  28.99966   44
> C  31.23538   55
> C  31.52542   36
> end
> g long id=_n
> g double m=.
> forv i=1/`=_N' {
>  sort id
>  g d=(v1-v1[`i'])^2
>  g noti=_n==`i'
>  loc mr=reg[`i']
>  bys noti reg (d): g f5=(_n<6) if reg=="`mr'"&noti==0
>  qui count if f5==1
>  if r(N)==5 {
>   su v2 if f5==1, mean
>   replace m=r(mean) if id==`i'
>   }
>  drop d noti f5
>  }
> sort id
> list, noo
>
>
> On Mon, Sep 17, 2012 at 12:34 PM, Joseph Monte <hmjc66@gmail.com> wrote:
>> Dear Statalisters,
>>
>> The data below shows three variables:- region, var1 and var2. For each
>> observation in a given region, I want the 5 closest observations based
>> on var1 (not counting the observation in question). I basically need
>> the average value of var2 for the 5 observations that are identified.
>> I don't have any missing values in my data for all three variables
>> below. I can also confirm that I have a few regions with less than 6
>> observations each; hence these regions will be ignored. I am using
>> Stata 12.
>>
>> Thanks,
>>
>> Joe
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index