Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <n.j.cox@durham.ac.uk> |

To |
"'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: extract values from kdensity graphic |

Date |
Fri, 4 May 2012 16:26:15 +0100 |

Just on the first sentence: I didn't say that or mean it. I see no relevance to looking at ranked data here. The whole point of Michael's problem is to group on measured values. The kernel estimates of the densities depend on neighbouring data points too within overlapping windows. I see no puzzle there. The heart of the problem is: Do grouping methods agree, one of those grouping methods being intuitive, judgment or subjectively chosen from looking at simple graphs? More positively, I like your emphasis on gaps. I would want to hear much more about how the data are generated before further suggestions on grouping can be made. Nick n.j.cox@durham.ac.uk Seed, Paul Dear Nick, I accept that x is not original data, and it would actually be better to use the rank of the observations, which as it happens is given in sampling_event. I like plotting against rank in this case because it allows me to see whether the 2 methods agree, without being distracted by how they fit to the true data. When I look at Nick's plot, I see 5 distinct groups, which confirms (to me) that my groups agree well with the data but I do not see why some observations are given higher densities than others, nor how the densities relate to the groups, so I am unsure whether my groups agree with -kdensity- Suppose I plot a slightly improved version of my earlier graph: graph twoway (line d sampling_event, lpattern(-) lwidth(thin) ) /// (connected d sampling_event if group == 1, msymbol(O) ) /// (connected d sampling_event if group == 2, msymbol(D) ) /// (connected d sampling_event if group == 3, msymbol(T) ) /// (connected d sampling_event if group == 4, msymbol(S) ) /// (connected d sampling_event if group == 5, msymbol(X) /// legend(order (2 "Group 1" 3 "Group 2" 4 "Group 3" 5 "Group 4" 6 "Group 5" ) ) ) It is clear (to me) that the 5 areas of higher density from -kdensity- do not correspond well to the 5 groups I found. Group 2 has 2 peaks in it, one spilling over into group 1; group 4 (one isolated observation) has none, and the peaks in group 1 and 5 are right on the edges. It is, of course, up to Mike to decide what method (if any) suits his real data and his real problem best. Best Wishes as ever Paul ------------------------------ Date: Thu, 3 May 2012 17:42:17 +0100 From: Nick Cox <njcoxstata@gmail.com> Subject: Re: st: extract values from kdensity graphic - -x- is by construction equally spaced and in any case not the original data. I suggest that a fairer graph is graph twoway (connected d size if group == 1) /// (connected d size if group == 2) /// (connected d size if group == 3) /// (connected d size if group == 4) /// (connected d size if group == 5) which shows that your method based on gaps agrees well with the kernel density default -- in this example. Nick On Thu, May 3, 2012 at 5:24 PM, Seed, Paul <paul.seed@kcl.ac.uk> wrote: > Dear Statalist, > > As Nick points out, this is becoming quite a complex problem. > I actually would not use -kdensity-, as it does > not capture the essential features of Mike's original data set. > > A simpler approach is to look at the differences between successive values, > and declare a new group whenever the gap is large (for a suitable value > of "large"). This can be quite easily done in version 8. > > > ***** Begin example ********** > > * Enter Mike's data set > set more off > clear > input sampling_event size > 1 94.74 > 2 94.89 > 3 94.95 > 4 94.97 > 5 95 > 6 95.05 > 7 95.08 > 8 96.11 > 9 96.22 > 10 96.24 > 11 96.27 > 12 96.27 > 13 96.27 > 14 96.32 > 15 96.34 > 16 97.19 > 17 97.26 > 18 97.26 > 19 97.32 > 20 97.34 > 21 97.39 > 22 98.41 > 23 100.62 > 24 100.69 > 25 100.69 > 26 100.76 > 27 100.76 > 28 100.76 > 29 100.84 > 30 100.91 > end > list > twoway (scatter size sampling_event) > > * Indentify groups > sort size > gen step = size -size[_n-1] > > * Use -stem- to quickly assess the step sizes > stem step > * In the example, steps are all <=0.1 or >= 0.85 > * I declare a new group for any step > 0.5 > * I could change this depending on the data set > > gen group = step >0.5 > replace group = sum(group) > > * Check groups are well defined > bys group : su size > > * Graph the various groups in different colours > graph twoway (connected size sampling_event if group == 1) /// > (connected size sampling_event if group == 2) /// > (connected size sampling_event if group == 3) /// > (connected size sampling_event if group == 4) /// > (connected size sampling_event if group == 5) > * That looks good > > * Now try out -kdensity-; pick up the plotted values in x and d > kdensity size , w(0.1) n(30) gen(x d) > > graph twoway (connected d x if group == 1) /// > (connected d x if group == 2) /// > (connected d x if group == 3) /// > (connected d x if group == 4) /// > (connected d x if group == 5) > * kdensity just does not seem to capture the groups I see in the simple scatter plot. > > > ********** End example ************** * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: extract values from kdensity graphic***From:*mcross@exemail.com.au

**References**:**Re: st: extract values from kdensity graphic***From:*"Seed, Paul" <paul.seed@kcl.ac.uk>

- Prev by Date:
**Re: st: read values of tempvar instead of variable label** - Next by Date:
**Re: st: shifting data inside panel data** - Previous by thread:
**Re: st: extract values from kdensity graphic** - Next by thread:
**RE: st: extract values from kdensity graphic** - Index(es):