Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: extract values from kdensity graphic


From   Nick Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   RE: st: extract values from kdensity graphic
Date   Fri, 4 May 2012 16:26:15 +0100

Just on the first sentence: I didn't say that or mean it. I see no relevance to looking at ranked data here. The whole point of Michael's problem is to group on measured values. 

The kernel estimates of the densities depend on neighbouring data points too within overlapping windows. I see no puzzle there. 

The heart of the problem is: Do grouping methods agree, one of those grouping methods being intuitive, judgment or subjectively chosen from looking at simple graphs? 

More positively, I like your emphasis on gaps. I would want to hear much more about how the data are generated before further suggestions on grouping can be made. 

Nick 
[email protected] 

Seed, Paul

Dear Nick, 

I accept that x is not original data, and it would 
actually be better to use the rank of the observations, 
which as it happens is given in sampling_event.

I like plotting against rank in this case because it allows me to 
see whether the 2 methods agree, without being distracted by 
how they fit to the true data.

When I look at Nick's plot, I see 5 distinct groups, 
which confirms (to me) that my groups agree well with the data
but I do not see why some observations are given higher densities than others, 
nor how the densities relate to the groups, 
so I am unsure whether my groups agree with -kdensity-

Suppose I plot a slightly improved version of my earlier graph:
		 
graph twoway (line d sampling_event, lpattern(-) lwidth(thin) ) ///
	(connected d sampling_event if group == 1, msymbol(O) ) ///
	(connected d sampling_event if group == 2, msymbol(D) ) ///
	(connected d sampling_event if group == 3, msymbol(T) ) ///
	(connected d sampling_event if group == 4, msymbol(S) ) ///
	(connected d sampling_event if group == 5, msymbol(X) ///
		legend(order (2 "Group 1"  3 "Group 2" 4 "Group 3" 5 "Group 4" 6 "Group 5" ) ) )

It is clear (to me) that the 5 areas of higher density from -kdensity- do not correspond 
well to the 5 groups I found. Group 2 has 2 peaks in it, one spilling over into group 1; 
group 4 (one isolated observation) has none, and the peaks in group 1 and 5 are right 
on the edges.

It is, of course, up to Mike to decide what method (if any) suits his 
real data and his real problem best.

Best Wishes as ever

Paul 

------------------------------

Date: Thu, 3 May 2012 17:42:17 +0100
From: Nick Cox <[email protected]>
Subject: Re: st: extract values from kdensity graphic

- -x- is by construction equally spaced and in any case not the original data.

I suggest that a fairer graph is

graph twoway (connected d size if group == 1) ///
         (connected d size if group == 2) ///
         (connected d size if group == 3) ///
         (connected d size if group == 4) ///
         (connected d size if group == 5)

which shows that your method based on gaps agrees well with the kernel
density default -- in this example.

Nick

On Thu, May 3, 2012 at 5:24 PM, Seed, Paul <[email protected]> wrote:
> Dear Statalist,
>
> As Nick points out, this is becoming quite a complex problem.
> I actually would not use -kdensity-, as it does
> not capture the essential features of Mike's original data set.
>
> A simpler approach is to look at the differences between successive values,
> and declare a new group whenever the gap is large (for a suitable value
> of "large").  This can be quite easily done in version 8.
>
>
> ***** Begin example **********
>
> * Enter Mike's data set
> set more off
> clear
> input sampling_event size
> 1 94.74
> 2 94.89
> 3 94.95
> 4 94.97
> 5 95
> 6 95.05
> 7 95.08
> 8 96.11
> 9 96.22
> 10 96.24
> 11 96.27
> 12 96.27
> 13 96.27
> 14 96.32
> 15 96.34
> 16 97.19
> 17 97.26
> 18 97.26
> 19 97.32
> 20 97.34
> 21 97.39
> 22 98.41
> 23 100.62
> 24 100.69
> 25 100.69
> 26 100.76
> 27 100.76
> 28 100.76
> 29 100.84
> 30 100.91
> end
> list
> twoway (scatter size sampling_event)
>
> * Indentify groups
> sort size
> gen step = size -size[_n-1]
>
> * Use -stem- to quickly assess the step sizes
> stem step
> * In the example, steps are all <=0.1 or >= 0.85
> * I declare a new group for any step > 0.5
> * I could change this depending on the data set
>
> gen group = step >0.5
> replace group = sum(group)
>
> * Check groups are well defined
> bys group : su size
>
> * Graph the various groups in different colours
> graph twoway (connected size sampling_event if group == 1) ///
>        (connected size sampling_event if group == 2) ///
>        (connected size sampling_event if group == 3) ///
>        (connected size sampling_event if group == 4) ///
>        (connected size sampling_event if group == 5)
> * That looks good
>
> * Now try out -kdensity-; pick up the plotted values in x and d
> kdensity size , w(0.1) n(30) gen(x d)
>
> graph twoway (connected d x if group == 1) ///
>        (connected d x if group == 2) ///
>        (connected d x if group == 3) ///
>        (connected d x if group == 4) ///
>        (connected d x if group == 5)
> * kdensity just does not seem to capture the groups I see in the simple scatter plot.
>
>
> ********** End example **************

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index