Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: extract values from kdensity graphic

 From "Seed, Paul" To "statalist@hsphsun2.harvard.edu" Subject Re: st: extract values from kdensity graphic Date Fri, 4 May 2012 15:24:12 +0100

```Dear Nick,

I accept that x is not original data, and it would
actually be better to use the rank of the observations,
which as it happens is given in sampling_event.

I like plotting against rank in this case because it allows me to
see whether the 2 methods agree, without being distracted by
how they fit to the true data.

When I look at Nick's plot, I see 5 distinct groups,
which confirms (to me) that my groups agree well with the data
but I do not see why some observations are given higher densities than others,
nor how the densities relate to the groups,
so I am unsure whether my groups agree with -kdensity-

Suppose I plot a slightly improved version of my earlier graph:

graph twoway (line d sampling_event, lpattern(-) lwidth(thin) ) ///
(connected d sampling_event if group == 1, msymbol(O) ) ///
(connected d sampling_event if group == 2, msymbol(D) ) ///
(connected d sampling_event if group == 3, msymbol(T) ) ///
(connected d sampling_event if group == 4, msymbol(S) ) ///
(connected d sampling_event if group == 5, msymbol(X) ///
legend(order (2 "Group 1"  3 "Group 2" 4 "Group 3" 5 "Group 4" 6 "Group 5" ) ) )

It is clear (to me) that the 5 areas of higher density from -kdensity- do not correspond
well to the 5 groups I found. Group 2 has 2 peaks in it, one spilling over into group 1;
group 4 (one isolated observation) has none, and the peaks in group 1 and 5 are right
on the edges.

It is, of course, up to Mike to decide what method (if any) suits his
real data and his real problem best.

Best Wishes as ever

Paul

------------------------------

Date: Thu, 3 May 2012 17:42:17 +0100
From: Nick Cox <njcoxstata@gmail.com>
Subject: Re: st: extract values from kdensity graphic

- -x- is by construction equally spaced and in any case not the original data.

I suggest that a fairer graph is

graph twoway (connected d size if group == 1) ///
(connected d size if group == 2) ///
(connected d size if group == 3) ///
(connected d size if group == 4) ///
(connected d size if group == 5)

which shows that your method based on gaps agrees well with the kernel
density default -- in this example.

Nick

On Thu, May 3, 2012 at 5:24 PM, Seed, Paul <paul.seed@kcl.ac.uk> wrote:
> Dear Statalist,
>
> As Nick points out, this is becoming quite a complex problem.
> I actually would not use -kdensity-, as it does
> not capture the essential features of Mike's original data set.
>
> A simpler approach is to look at the differences between successive values,
> and declare a new group whenever the gap is large (for a suitable value
> of "large").  This can be quite easily done in version 8.
>
>
> ***** Begin example **********
>
> * Enter Mike's data set
> set more off
> clear
> input sampling_event size
> 1 94.74
> 2 94.89
> 3 94.95
> 4 94.97
> 5 95
> 6 95.05
> 7 95.08
> 8 96.11
> 9 96.22
> 10 96.24
> 11 96.27
> 12 96.27
> 13 96.27
> 14 96.32
> 15 96.34
> 16 97.19
> 17 97.26
> 18 97.26
> 19 97.32
> 20 97.34
> 21 97.39
> 22 98.41
> 23 100.62
> 24 100.69
> 25 100.69
> 26 100.76
> 27 100.76
> 28 100.76
> 29 100.84
> 30 100.91
> end
> list
> twoway (scatter size sampling_event)
>
> * Indentify groups
> sort size
> gen step = size -size[_n-1]
>
> * Use -stem- to quickly assess the step sizes
> stem step
> * In the example, steps are all <=0.1 or >= 0.85
> * I declare a new group for any step > 0.5
> * I could change this depending on the data set
>
> gen group = step >0.5
> replace group = sum(group)
>
> * Check groups are well defined
> bys group : su size
>
> * Graph the various groups in different colours
> graph twoway (connected size sampling_event if group == 1) ///
>        (connected size sampling_event if group == 2) ///
>        (connected size sampling_event if group == 3) ///
>        (connected size sampling_event if group == 4) ///
>        (connected size sampling_event if group == 5)
> * That looks good
>
> * Now try out -kdensity-; pick up the plotted values in x and d
> kdensity size , w(0.1) n(30) gen(x d)
>
> graph twoway (connected d x if group == 1) ///
>        (connected d x if group == 2) ///
>        (connected d x if group == 3) ///
>        (connected d x if group == 4) ///
>        (connected d x if group == 5)
> * kdensity just does not seem to capture the groups I see in the simple scatter plot.
>
>
> ********** End example **************
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```