# Re: st: question related to collapse

 From David Kantor To statalist@hsphsun2.harvard.edu Subject Re: st: question related to collapse Date Thu, 04 Dec 2008 20:11:25 -0500

```At 06:36 PM 12/4/2008, Laura Grigolon wrote:
```
```Dear Statalister,

```
I have a dataset with several variables, among which a discrete variable X that looks as follows.
```
-------------------
X
obs1    60
obs2    60
obs3    60
obs4    70
obs5    71
obs6    71
obs7    71
obs8    71
obs9    71
obs10   71
--------------------

```
My final purpose is to treat adjacent observations for which the variable X does not change by more than 10% as the same observation. In other words, I would like to collapse the dataset by X, but whenever the distance between two or more adjacent observations in X is less than 10%, I would like to collapse by a median of x. Before collapsing I tried to generate a median of X whenever the difference within X is less than 10%, and then collapse by X, but I am not succeding. Is this the right approach? Is there a way of collapsing specifying my requirement?
```
Laura
```
```
```
I don't have a solution, but I'll alert you to some potential problems that I can see. There may be some ambiguity in how your problem is defined. Suppose you have this sequence of values:
```60, 65, 70
65 is within 10% of 60; 70 is within 10% of 65; but 70 is not within 10% of 60.
```
So does this define a cluster of "close" values? Does the 70 get put together with 60 by virtue of being linked through a 65? If so, then the clusters of close values would be, in part, determined by the order of the data. Is that what you have in mind?
```Another example:
901, 1000 -- no, 1000 is not within 10% of 901.
1000, 901 -- yes, 901 is within 10% of 1000.
```
Or generally, if a is within 10% of b, it is not always the case that b is within 10% of a.
```Again, the order matters.
```
So you need to ask, do you want the order to matter, and do you want to allow "linking" as in the 60,65,70 example? I believe that you do, since you mentioned "adjacent". (And maybe you want to have sorted the values first -- or maybe not, in which case there may be some existing natural order.)
```If so, then you can do something like this (untested):
gen byte w10pct = abs(X/X[_n-1] -1) < .1 & _n >1
gen int cluster_id = sum(w10pct ==0)

```
This way, cluster_id takes a new value every time a value of X occurs that is >= 10% different from the predecessor. You can then take a mean or median or whatever you want -- by cluster_id -- using egen.
```
```
If, on the other hand, you don't want the order of data to matter, then you need to find some other way to group the X values into clusters. (Maybe sort, and them apply the algorithm described above.)
```
HTH
--David

```
P.S., there is an interesting phenomenon here, with the order and linking effects, particularly if the X are sorted. You seem to want to seek a middle value of a cluster of values. And hopefully, the values will be within 10% of that middle value. On the other hand, the detection of the cluster is based on its leading value (lowest, if data are sorted).
```
```
Another possibility is that you would want to avoid linking. In that case, the clusters should be determined whenever a value differs from its predecessor by more than 10%. But then you would test how close subsequent values are to that leading value. It's getting complicated.
```
Good luck.
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```