Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: extract values from kdensity graphic


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: extract values from kdensity graphic
Date   Wed, 2 May 2012 15:42:49 -0400

Mike <mcross@exemail.com.au>:
I agree with all Nick's points, but it is true that if you have chosen
the kernel width correctly, gaps in the data have estimated density
zero and this fact makes it easy to identify boundaries between groups
of data.  The choice of a different kernel width will produce
different group boundaries, of course.

clear
input sampling_event size
1 94.74
2 94.89
3 94.95
4 94.97
5 95
6 95.05
7 95.08
8 96.11
9 96.22
10 96.24
11 96.27
12 96.27
13 96.27
14 96.32
15 96.34
16 97.19
17 97.26
18 97.26
19 97.32
20 97.34
21 97.39
22 98.41
23 100.62
24 100.69
25 100.69
26 100.76
27 100.76
28 100.76
29 100.84
30 100.91
end
twoway__histogram_gen size, width(.1) gen(h z)
kdensity size, bw(.1) g(x f) nogr
g trough=sum(f==0&f[_n-1]!=0)
replace trough=0 if f>0
su trough, mean
loc m
forv i=1/`r(max)' {
 qui su x if trough==`i', d
 loc m `m' `r(p50)'
 }
tw bar h z, barw(.1)||line f x, xli(`m')
di "`m'"


On Wed, May 2, 2012 at 12:49 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> That problem is several orders of magnitude more difficult than what
> you originally asked.
>
> -kdensity- says nothing directly about the number of groups that
> really or notionally exist. If you are counting modes, that is
> evidence, but the number of modes is dependent on what kernel type and
> what kernel width are chosen and where you estimate the density
> function. Also, if the data are skewed, it may be a good idea to
> estimate the density on a transformed scale.
>
> You should never conclude anything from kernel density estimation
> without a sensitivity analysis on kernel type, width and where
> estimated. Know that the defaults for -kdensity- are pretty arbitrary.
>
> I would have said that for your original problem. I intensify this
> advice on now being told that you are trying to identify hundreds of
> modes in the real problem.
>
> If you persist in this you can look for troughs by there being local
> minima i.e. less than values on either side in a sorted set of values.
>
> On the contrary, cluster analysis methods have scope to address the
> question of how many groups exist. But they aren't likely to be
> practical for identifying hundreds of classes.
>
> My -round()- suggestion was a little flippant. Your example is one
> where five groups appear to exist unequivocally and many methods will
> find them. -round(, 1)- was one but I do agree that it is not a good
> method generally.
>
> Nick
>
> On Wed, May 2, 2012 at 5:27 PM,  <mcross@exemail.com.au> wrote:
>> Many thanks Nick,
>>
>> -group1d- doesn't suit my application (versions of Stata aside) as I don't
>> want to have to specify the number of groups. I really like the kdensity
>> plot because it automatically determines the number of groups (which are
>> in the hundreds for my real data sets).
>>
>> Unfortunately -round- often fails to group sizes appropriately in my full
>> data sets too, as the clusters don't always align with the rounding units.
>>
>> The kdensity plot shows exactly what I want, but alas I can't extract it's
>> data (trough coordinates).
>>
>> Any more thoughts from the list?
>>
>> Mike.
>>
>>
>>
>>
>> Another way of looking at these data is to apply -group1d- (SSC). In fact
>> Mike cannot do that himself because it needs Stata 9, but he can use the
>> results. With a least-squares criterion explained in the help and
>> references given, -group1d- yields as the best 5 groups
>>
>> Group Size    First            Last           Mean      SD
>>  5       8   23   100.62      30   100.91   100.75    0.09
>>  4       1   22    98.41      22 98.41 98.41    0.00
>>  3       6   16    97.19      21 97.39 97.29    0.06
>>  2       8    8    96.11      15    96.34    96.25    0.07
>>  1       7    1    94.74       7    95.08    94.95    0.11
>>
>> In fact, just about any method of cluster analysis should find the same
>> groups if they are genuine, e.g. -cluster kmeans-. Then use whatever
>> summary you prefer.
>>
>> Details follow for -group1d-.
>>
>> . sort size
>>
>> . group1d size, max(7)
>>
>>  Partitions of 30 data up to 7 groups
>>
>>  1 group:  sum of squares 143.60
>>  Group Size    First            Last           Mean      SD
>>  1      30    1    94.74      30   100.91    97.43    2.19
>>
>>  2 groups: sum of squares 23.00
>>  Group Size    First            Last           Mean      SD
>>  2       9   22    98.41      30   100.91   100.49    0.74
>>  1      21    1    94.74      21 97.39 96.12    0.93
>>
>>  3 groups: sum of squares 6.62
>>  Group Size    First            Last           Mean      SD
>>  3       8   23   100.62      30   100.91   100.75    0.09
>>  2      15    8    96.11      22    98.41    96.81    0.66
>>  1       7    1    94.74       7    95.08    94.95    0.11
>>
>>  4 groups: sum of squares 1.26
>>  Group Size    First            Last           Mean      SD
>>  4       8   23   100.62      30   100.91   100.75    0.09
>>  3       7   16    97.19      22    98.41    97.45    0.40
>>  2       8    8    96.11      15    96.34    96.25    0.07
>>  1       7    1    94.74       7    95.08    94.95    0.11
>>
>>  5 groups: sum of squares 0.20
>>  Group Size    First            Last           Mean      SD
>>  5       8   23   100.62      30   100.91   100.75    0.09
>>  4       1   22    98.41      22    98.41    98.41    0.00
>>  3       6   16    97.19      21    97.39    97.29    0.06
>>  2       8    8    96.11      15    96.34    96.25    0.07
>>  1       7    1    94.74       7    95.08    94.95    0.11
>>
>>  6 groups: sum of squares 0.14
>>  Group Size    First            Last           Mean      SD
>>  6       8   23   100.62      30   100.91   100.75    0.09
>>  5       1   22    98.41      22    98.41    98.41    0.00
>>  4       6   16    97.19      21    97.39    97.29    0.06
>>  3       8    8    96.11      15    96.34    96.25    0.07
>>  2       5    3    94.95       7    95.08    95.01    0.05
>>  1       2    1    94.74       2    94.89    94.81    0.08
>>
>>  7 groups: sum of squares 0.10
>>  Group Size    First            Last           Mean      SD
>>  7       2   29   100.84      30   100.91   100.88    0.04
>>  6       6   23   100.62      28   100.76   100.71    0.05
>>  5       1   22    98.41      22    98.41    98.41    0.00
>>  4       6   16    97.19      21    97.39    97.29    0.06
>>  3       8    8    96.11      15    96.34    96.25    0.07
>>  2       5    3    94.95       7    95.08    95.01    0.05
>>  1       2    1    94.74       2    94.89    94.81    0.08
>>
>>  Groups     Sums of squares
>>    1          143.60
>>    2           23.00
>>    3            6.62
>>    4            1.26
>>    5            0.20
>>    6            0.14
>>    7            0.10
>>
>>
>> On Wed, May 2, 2012 at 9:34 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>> In practice,
>>
>> gen sizer = round(size)
>>
>> is a simpler way of degrading your data. Check by
>>
>> scatter sizer size
>>
>> Nick

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index