Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified Date Tue, 18 Sep 2012 10:47:50 -0400

Joseph Monte <hmjc66@gmail.com>:
The best way to approach this depends on the data size and structure.
If you have easy data like below, you can -cross- and compute
directly; for a large dataset, you may want to loop over observations
(cf. e.g. http://www.stata.com/statalist/archive/2007-10/msg00346.html).
To loop over observations and sort repeatedly by distance based on one
or more variables, it will behoove you to create a numeric id
corresponding to the obs number at the outset, so you can re-sort when
you are done with each iteration of the loop, which will make it easy
to refer to a specific observation.  Something like:

clear all
input str1 reg v1 v2
A  3.29515    47
A  5.39742    38
A  7.94641    43
A  11.25495   235
A  22.35908   61
A  27.19206   76
A  41.03306   66
A  45.56846   89
A  53.63861   116
A  73.2925    76
A  104.3025   63
A  229.7772   74
A  634.0973   61
A  1053.78    80
A  1163.681   47
B  2.339128   55
B  2.378151   46
B  9.831361   47
B  15.83442   57
B  16.48956   42
B  28.70144   44
B  56.01777   29
B  113.9736   103
B  178.731    47
B  340.715    103
C  0.5892565  44
C  2.016974   37
C  3.041719   76
C  4.009228   80
C  5.856674   51
C  7.587287   188
C  8.827202   66
C  11.53763   48
C  11.67932   152
C  11.86612   51
C  12.95344   84
C  14.85097   63
C  17.12918   47
C  17.74263   67
C  17.97567   75
C  20.60005   84
C  22.13938   44
C  28.99966   44
C  31.23538   55
C  31.52542   36
end
g long id=_n
g double m=.
forv i=1/`=_N' {
sort id
g d=(v1-v1[`i'])^2
g noti=_n==`i'
loc mr=reg[`i']
bys noti reg (d): g f5=(_n<6) if reg=="`mr'"&noti==0
qui count if f5==1
if r(N)==5 {
su v2 if f5==1, mean
replace m=r(mean) if id==`i'
}
drop d noti f5
}
sort id
list, noo

On Mon, Sep 17, 2012 at 12:34 PM, Joseph Monte <hmjc66@gmail.com> wrote:
> Dear Statalisters,
>
> The data below shows three variables:- region, var1 and var2. For each
> observation in a given region, I want the 5 closest observations based
> on var1 (not counting the observation in question). I basically need
> the average value of var2 for the 5 observations that are identified.
> I don't have any missing values in my data for all three variables
> below. I can also confirm that I have a few regions with less than 6
> observations each; hence these regions will be ignored. I am using
> Stata 12.
>
> Thanks,
>
> Joe
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/