Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: extract values from kdensity graphic


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: extract values from kdensity graphic
Date   Wed, 2 May 2012 17:49:16 +0100

That problem is several orders of magnitude more difficult than what
you originally asked.

-kdensity- says nothing directly about the number of groups that
really or notionally exist. If you are counting modes, that is
evidence, but the number of modes is dependent on what kernel type and
what kernel width are chosen and where you estimate the density
function. Also, if the data are skewed, it may be a good idea to
estimate the density on a transformed scale.

You should never conclude anything from kernel density estimation
without a sensitivity analysis on kernel type, width and where
estimated. Know that the defaults for -kdensity- are pretty arbitrary.

I would have said that for your original problem. I intensify this
advice on now being told that you are trying to identify hundreds of
modes in the real problem.

If you persist in this you can look for troughs by there being local
minima i.e. less than values on either side in a sorted set of values.

On the contrary, cluster analysis methods have scope to address the
question of how many groups exist. But they aren't likely to be
practical for identifying hundreds of classes.

My -round()- suggestion was a little flippant. Your example is one
where five groups appear to exist unequivocally and many methods will
find them. -round(, 1)- was one but I do agree that it is not a good
method generally.

Nick

On Wed, May 2, 2012 at 5:27 PM,  <[email protected]> wrote:
> Many thanks Nick,
>
> -group1d- doesn't suit my application (versions of Stata aside) as I don't
> want to have to specify the number of groups. I really like the kdensity
> plot because it automatically determines the number of groups (which are
> in the hundreds for my real data sets).
>
> Unfortunately -round- often fails to group sizes appropriately in my full
> data sets too, as the clusters don't always align with the rounding units.
>
> The kdensity plot shows exactly what I want, but alas I can't extract it's
> data (trough coordinates).
>
> Any more thoughts from the list?
>
> Mike.
>
>
>
>
> Another way of looking at these data is to apply -group1d- (SSC). In fact
> Mike cannot do that himself because it needs Stata 9, but he can use the
> results. With a least-squares criterion explained in the help and
> references given, -group1d- yields as the best 5 groups
>
> Group Size    First            Last           Mean      SD
>  5       8   23   100.62      30   100.91   100.75    0.09
>  4       1   22    98.41      22 98.41 98.41    0.00
>  3       6   16    97.19      21 97.39 97.29    0.06
>  2       8    8    96.11      15    96.34    96.25    0.07
>  1       7    1    94.74       7    95.08    94.95    0.11
>
> In fact, just about any method of cluster analysis should find the same
> groups if they are genuine, e.g. -cluster kmeans-. Then use whatever
> summary you prefer.
>
> Details follow for -group1d-.
>
> . sort size
>
> . group1d size, max(7)
>
>  Partitions of 30 data up to 7 groups
>
>  1 group:  sum of squares 143.60
>  Group Size    First            Last           Mean      SD
>  1      30    1    94.74      30   100.91    97.43    2.19
>
>  2 groups: sum of squares 23.00
>  Group Size    First            Last           Mean      SD
>  2       9   22    98.41      30   100.91   100.49    0.74
>  1      21    1    94.74      21 97.39 96.12    0.93
>
>  3 groups: sum of squares 6.62
>  Group Size    First            Last           Mean      SD
>  3       8   23   100.62      30   100.91   100.75    0.09
>  2      15    8    96.11      22    98.41    96.81    0.66
>  1       7    1    94.74       7    95.08    94.95    0.11
>
>  4 groups: sum of squares 1.26
>  Group Size    First            Last           Mean      SD
>  4       8   23   100.62      30   100.91   100.75    0.09
>  3       7   16    97.19      22    98.41    97.45    0.40
>  2       8    8    96.11      15    96.34    96.25    0.07
>  1       7    1    94.74       7    95.08    94.95    0.11
>
>  5 groups: sum of squares 0.20
>  Group Size    First            Last           Mean      SD
>  5       8   23   100.62      30   100.91   100.75    0.09
>  4       1   22    98.41      22    98.41    98.41    0.00
>  3       6   16    97.19      21    97.39    97.29    0.06
>  2       8    8    96.11      15    96.34    96.25    0.07
>  1       7    1    94.74       7    95.08    94.95    0.11
>
>  6 groups: sum of squares 0.14
>  Group Size    First            Last           Mean      SD
>  6       8   23   100.62      30   100.91   100.75    0.09
>  5       1   22    98.41      22    98.41    98.41    0.00
>  4       6   16    97.19      21    97.39    97.29    0.06
>  3       8    8    96.11      15    96.34    96.25    0.07
>  2       5    3    94.95       7    95.08    95.01    0.05
>  1       2    1    94.74       2    94.89    94.81    0.08
>
>  7 groups: sum of squares 0.10
>  Group Size    First            Last           Mean      SD
>  7       2   29   100.84      30   100.91   100.88    0.04
>  6       6   23   100.62      28   100.76   100.71    0.05
>  5       1   22    98.41      22    98.41    98.41    0.00
>  4       6   16    97.19      21    97.39    97.29    0.06
>  3       8    8    96.11      15    96.34    96.25    0.07
>  2       5    3    94.95       7    95.08    95.01    0.05
>  1       2    1    94.74       2    94.89    94.81    0.08
>
>  Groups     Sums of squares
>    1          143.60
>    2           23.00
>    3            6.62
>    4            1.26
>    5            0.20
>    6            0.14
>    7            0.10
>
>
> On Wed, May 2, 2012 at 9:34 AM, Nick Cox <[email protected]> wrote:
> In practice,
>
> gen sizer = round(size)
>
> is a simpler way of degrading your data. Check by
>
> scatter sizer size
>
> Nick
>
> On Wed, May 2, 2012 at 9:16 AM,  <[email protected]> wrote:
> * Hi Statalist,
> * I'm a beginner using version 8.
> * The following measurements were collected by a machine in my lab...
> clear
> input sampling_event size
> 1 94.74
> 2 94.89
> 3 94.95
> 4 94.97
> 5 95
> 6 95.05
> 7 95.08
> 8 96.11
> 9 96.22
> 10 96.24
> 11 96.27
> 12 96.27
> 13 96.27
> 14 96.32
> 15 96.34
> 16 97.19
> 17 97.26
> 18 97.26
> 19 97.32
> 20 97.34
> 21 97.39
> 22 98.41
> 23 100.62
> 24 100.69
> 25 100.69
> 26 100.76
> 27 100.76
> 28 100.76
> 29 100.84
> 30 100.91
> end
> list
> twoway (scatter size sampling_event)
>
> * My aim is to class these size values into categories (5 categories in
> * the example shown).
> * kdensity will generate the following graphic...
>
> kdensity size , w(0.1) n(30)
>
> * The troughs of this graphic are a good way to define the bounds of
> * each category.
> * Category_4, for example would include all size values larger than 98
> * and less than 99.
> * I'd like to extract these trough points as a kdensity post-estimation
> * and output them as a new variable.
> * Is this possible?
> * Look forward to any advice the list has to offer.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index