Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Joseph Monte <hmjc66@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified |

Date |
Tue, 18 Sep 2012 18:40:23 +0100 |

Dear Austin and Gordon, Thanks for your suggestions. Sorry I was not clear in my email. I have a counter variable called "identifier" which is unique for each observation. For each "identifier" within a certain "region" (I have approx 4000 identifiers and 40 regions), I basically want the average value of "var2" for the five closest values of "var1" based on value. I tried the following code below and it seems to "almost" solve my problem (although it is most likely inefficient since I am creating 16 variables and then using -reshape- to get what I want). My concern is that it does not handle the possibility that "var1" may be the same with a given "region". There are 3 instances when "var1" has the same value twice within a given "region". In this case, I want the average "var2" to be taken of the two observations. I am hence using values from 1 to 8 in the loop below to capture the possibility that all three duplicates may occur for a given observation (I would normally use 1 to 5 if there were no duplicates of "var1"). I am not clear on how to capture the average value of "var2" if "var1" is the same within a given "region" and would appreciate any help. -----------------------begin code ---------------------- forv i=1/8 { bys region (var1): gen diffvar1minus`i'=var1-var1[_n-`i'] bys region (var1): gen diffvar1plus`i'=abs(var1-var1[_n+`i']) bys region (var1): gen var2minus`i'=var2[_n-`i'] bys region (var1): gen var2plus`i'=var2[_n+`i'] } reshape long diffvar1 var2, i(identifier) j(minusplus5, string) drop if diffvar1==. bys identifier (diffvar1): gen id1=_N drop if id1<5 bys identifier (diffvar1): gen id=_n keep if id<=5 drop id id1 bys identifier: egen avgvar2=mean(var2) -------------------------end code---------------------------- Thanks, Joe On Tue, Sep 18, 2012 at 3:47 PM, Austin Nichols <austinnichols@gmail.com> wrote: > Joseph Monte <hmjc66@gmail.com>: > The best way to approach this depends on the data size and structure. > If you have easy data like below, you can -cross- and compute > directly; for a large dataset, you may want to loop over observations > (cf. e.g. http://www.stata.com/statalist/archive/2007-10/msg00346.html). > To loop over observations and sort repeatedly by distance based on one > or more variables, it will behoove you to create a numeric id > corresponding to the obs number at the outset, so you can re-sort when > you are done with each iteration of the loop, which will make it easy > to refer to a specific observation. Something like: > > clear all > input str1 reg v1 v2 > A 3.29515 47 > A 5.39742 38 > A 7.94641 43 > A 11.25495 235 > A 22.35908 61 > A 27.19206 76 > A 41.03306 66 > A 45.56846 89 > A 53.63861 116 > A 73.2925 76 > A 104.3025 63 > A 229.7772 74 > A 634.0973 61 > A 1053.78 80 > A 1163.681 47 > B 2.339128 55 > B 2.378151 46 > B 9.831361 47 > B 15.83442 57 > B 16.48956 42 > B 28.70144 44 > B 56.01777 29 > B 113.9736 103 > B 178.731 47 > B 340.715 103 > C 0.5892565 44 > C 2.016974 37 > C 3.041719 76 > C 4.009228 80 > C 5.856674 51 > C 7.587287 188 > C 8.827202 66 > C 11.53763 48 > C 11.67932 152 > C 11.86612 51 > C 12.95344 84 > C 14.85097 63 > C 17.12918 47 > C 17.74263 67 > C 17.97567 75 > C 20.60005 84 > C 22.13938 44 > C 28.99966 44 > C 31.23538 55 > C 31.52542 36 > end > g long id=_n > g double m=. > forv i=1/`=_N' { > sort id > g d=(v1-v1[`i'])^2 > g noti=_n==`i' > loc mr=reg[`i'] > bys noti reg (d): g f5=(_n<6) if reg=="`mr'"¬i==0 > qui count if f5==1 > if r(N)==5 { > su v2 if f5==1, mean > replace m=r(mean) if id==`i' > } > drop d noti f5 > } > sort id > list, noo > > > On Mon, Sep 17, 2012 at 12:34 PM, Joseph Monte <hmjc66@gmail.com> wrote: >> Dear Statalisters, >> >> The data below shows three variables:- region, var1 and var2. For each >> observation in a given region, I want the 5 closest observations based >> on var1 (not counting the observation in question). I basically need >> the average value of var2 for the 5 observations that are identified. >> I don't have any missing values in my data for all three variables >> below. I can also confirm that I have a few regions with less than 6 >> observations each; hence these regions will be ignored. I am using >> Stata 12. >> >> Thanks, >> >> Joe > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified***From:*Joseph Monte <hmjc66@gmail.com>

**Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified***From:*Austin Nichols <austinnichols@gmail.com>

- Prev by Date:
**Re: st: nl: missing standar error** - Next by Date:
**st: changing bar order on twoway bar chart** - Previous by thread:
**Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified** - Next by thread:
**st: Using Global Variables to Specify directories in dictionaries** - Index(es):