Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Joerg Luedicke <joerg.luedicke@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: testing for bimodality in survey data |

Date |
Thu, 10 Nov 2011 11:14:53 -0500 |

I agree with Nick's general remarks on bimodality. Dana, we do not have any background information on your project, but - generally speaking - only "testing" whether a distribution is bimodal or not looks like an empty excercise to me. Important is to understand the nature of your distribution, i.e. how is it shaped and what are possible explanations for it being shaped in a certain way. To illustrate this, consider the following example for the measurement of height, sampled from a population that contained male and female humans. When using the height measurement from the nhanes data and plot a kernel density, the distribution does not look exactly bimodal at first sight. That is because the difference between the means of the two distributions (for men and women) is somewhat too small in relation to their variances. So, for the sake of argument/illustration, let's just make the sampled men a bit taller (b). We can then estimate the density (c) and clearly find evidence for the distribution of height being bimodal, indicating a mixture of two normal distributions (as we assume at this point, one for male and one for female heights). We can now go ahead and fit a mixture model for two components to estimate the parameters for the mixed distributions (d). In this case the results look very reasonable and we if we crosstab the actual gender with the classification (d), we see that around 95% of men and women are correctly classified. In addition, we could also go ahead and plot the probability density function for the bimodal distribution, using the parameters that we estimated with the mixture model (e). So all this seems to make a lot of sense and we can conclude that the distribution at hand is bimodal and that the bimodality is caused by a mixture of two Gaussian distributions, revolving around males and females. Now, in reality, solutions are not always as clear and there is often not as much previous knowledge available. However, given this example, I wonder what it would help to formally "test" whether the distribution is bimodal or not (or whatever the null-hypothesis would be in such a "test", I do not know)? Given our example, I guess any such test would give you a green light regarding bimodality because of the clear-cut solution. However, we already knew it was bimodal and we also could explain it reasonably, so the "test" would not add a lot of information. Likewise, what if the "test" gave a red light? Clearly, the test result would not make a lot of sense. Something similar applies to small differences of the modes/means. If there is not much of a bimodal distribution to detect and neither is there a theory nor previous knowledge that would lead to a reasonable expectation of bimodality, and now the test says "green", what would that mean? On the other hand, if you have a strong theory/knowledge upon which you expect a bimodal distribution and you could also detect it in the data, but the test says "red", well, I would still go with the theoretically guided solution. Often it is just a good idea to do more data exploration and checking, than "testing". /*Example*/ //a) data webuse nhanes2, clear //(b) let's make men a bit taller, for the sake of argument clonevar height2=height replace height2=height2+10 if sex==1 //(c) inspecting the distribution using an adaptive kernel estimate //and saving 1000 grid point at which the density is evaluated kdens height2, adapt n(1000) g(den grid) //(d) now we can fit a mixture model, assuming bimodality fmm height2, comp(2) mix(normal) fmmlc, savec tab sex _class_1, row //(e) and can plot a probability density function using the parameters //and mixing probabilities as estimated from the ML fit (using the //grid that we saved earlier) mat par=e(b) gen mlmix=(e(pi1_est)*normalden(grid, par[1,1], e(sigma1_est)))+ /// (e(pi2_est)*normalden(grid, par[1,2], e(sigma2_est))) line mlmix grid, title("ML fit") ytitle("Density") /*End*/ Joerg On Thu, Nov 10, 2011 at 3:37 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > Joerg has in effect already answered your question. Bimodality implies some generating process that is bimodal, so should you want to investigate it formally, it is arguably best to think up a model with that kind of behaviour as one possibility and then estimate its parameters. > > For the most part, I have found that bimodality is convincing if and only if (a) it shows up consistently on density estimates with a range of kernels and a range of kernel widths and (b) there is some substantive expectation of a mix of two kinds (males and females, whatever). > > Nick > n.j.cox@durham.ac.uk > > Dana Shills > > Thank you Joerg. That was very helpful. If I understand this correctly, once you have the kdens plot you can visually see if there are two modes. So there is no statistical test thatconfirms the number of modes in the distribution? > >> From: joerg.luedicke@gmail.com > >> I am not sure what "testing" is supposed to mean in this context, but >> if you want to explore the possibility of a multimodal distribution >> you could indeed go for a non-parametric density estimation. I >> recommend using Ben Jann's -kdens- (available from SSC, -findit >> kdens-), which is a quite powerful package and supports probability >> weights. I would also recommend using an adaptive kernel estimate, as >> this is usually the best kernel estimate when dealing with multimodal >> data (at least in my experience). What you could do in addition is >> checking whether the multimodality is due to distributional mixtures >> (which is often the case when you find more than one mode). For >> example, say you find your distribution being bimodal, you could fit a >> 2-component mixture model to estimate the underlying parameters of the >> mixed distributions via maximum likelihood (to do this you could use >> -fmm- which is also available from SSC; if the model does not converge >> make sure you provide starting values; for Gaussian mixtures you could >> use the modes from the kernel estimate and guess the variance). You >> could also check how well the (in this case) 2 distributions can be >> separated with using an entropy measure which you could calculate with >> -fmmlc-, also available from SSC. > >> >> On Wed, Nov 9, 2011 at 1:50 PM, Dana Shills <shills52@hotmail.com> wrote: > >> > I am using survey data on firms in Ghana. The survey methodology uses stratified random sampling and I have the probability weights. I want to be able to plot a distribution of firm sizes (incorporating the weights) and test for bimodality in the firm size distribution. I looked at the "adgakern" program but I don't think it allows for survey weights. Could someone please point me to what commands I should be looking at? > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: testing for bimodality in survey data***From:*Dana Shills <shills52@hotmail.com>

**Re: st: testing for bimodality in survey data***From:*Joerg Luedicke <joerg.luedicke@gmail.com>

**RE: st: testing for bimodality in survey data***From:*Dana Shills <shills52@hotmail.com>

**RE: st: testing for bimodality in survey data***From:*Nick Cox <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: pooled OLS** - Next by Date:
**RE: st: RE: Error? xtdpdsys assigns explanatory power to fixed effects** - Previous by thread:
**Re: st: * mark indicating between group significans in box plots** - Next by thread:
**st: How to incorporate initial conditions in FE models that have more than two time points** - Index(es):