[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: question related to collapse

From	David Kantor <[email protected]>
To	[email protected]
Subject	Re: st: question related to collapse
Date	Thu, 04 Dec 2008 20:11:25 -0500

At 06:36 PM 12/4/2008, Laura Grigolon wrote:

Dear Statalister,
I have a dataset with several variables, among which a discretevariable X that looks as follows.
-------------------
        X
obs1    60
obs2    60
obs3    60
obs4    70
obs5    71
obs6    71
obs7    71
obs8    71
obs9    71
obs10   71
--------------------
My final purpose is to treat adjacent observations for which thevariable X does not change by more than 10% as the same observation.In other words, I would like to collapse the dataset by X, butwhenever the distance between two or more adjacent observations in Xis less than 10%, I would like to collapse by a median of x. Beforecollapsing I tried to generate a median of X whenever thedifference within X is less than 10%, and then collapse by X, but Iam not succeding. Is this the right approach? Is there a way ofcollapsing specifying my requirement?
Thank you in advance,
Laura

I don't have a solution, but I'll alert you to some potentialproblems that I can see.There may be some ambiguity in how your problem is defined. Supposeyou have this sequence of values:

60, 65, 70
65 is within 10% of 60; 70 is within 10% of 65; but 70 is not within 10% of 60.

So does this define a cluster of "close" values? Does the 70 get puttogether with 60 by virtue of being linked through a 65?If so, then the clusters of close values would be, in part,determined by the order of the data. Is that what you have in mind?

Another example:
901, 1000 -- no, 1000 is not within 10% of 901.
1000, 901 -- yes, 901 is within 10% of 1000.

Or generally, if a is within 10% of b, it is not always the case thatb is within 10% of a.

Again, the order matters.

So you need to ask, do you want the order to matter, and do you wantto allow "linking" as in the 60,65,70 example?I believe that you do, since you mentioned "adjacent". (And maybe youwant to have sorted the values first -- or maybe not, in which casethere may be some existing natural order.)

If so, then you can do something like this (untested):
gen byte w10pct = abs(X/X[_n-1] -1) < .1 & _n >1
gen int cluster_id = sum(w10pct ==0)

This way, cluster_id takes a new value every time a value of X occursthat is >= 10% different from the predecessor.You can then take a mean or median or whatever you want -- bycluster_id -- using egen.

If, on the other hand, you don't want the order of data to matter,then you need to find some other way to group the X values intoclusters. (Maybe sort, and them apply the algorithm described above.)


HTH
--David

P.S., there is an interesting phenomenon here, with the order andlinking effects, particularly if the X are sorted. You seem to wantto seek a middle value of a cluster of values. And hopefully, thevalues will be within 10% of that middle value. On the other hand,the detection of the cluster is based on its leading value (lowest,if data are sorted).

Another possibility is that you would want to avoid linking. In thatcase, the clusters should be determined whenever a value differs fromits predecessor by more than 10%. But then you would test how closesubsequent values are to that leading value. It's getting complicated.


Good luck.
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: question related to collapse
  - From: "Grigolon, Laura" <[email protected]>

Prev by Date: st: question related to collapse
Next by Date: st: ksmirnov on Johnson SB
Previous by thread: st: question related to collapse
Next by thread: RE: Re: st: question related to collapse
Index(es):
- Date
- Thread