Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: query on testing uniform distributions |

Date |
Tue, 1 Nov 2011 08:53:01 +0000 |

Maarten has as usual discussed this very thoroughly and leaves only scope for some extra details. If the data span a long series of years, then every day of the year might have about the same chance of being an employment start date. I am guessing wildly at situations in which few people start work at weekends. But at least in Britain there are jobs in which starting on Mondays or (usually different) starting on the first day of a month are standard. Sergio should know much more about the country being studied and its practices. I'd expect the number of years spanned to be more limiting than the number of distinct people in the sample. Also what about 29 February? All that said, it is all too likely with a sample size with 3 million that statistically significant results might be scientifically insignificant. The main role of significance tests used to be to stop researchers making fools of themselves by overinterpreting very small samples. Now some researchers want to use significance tests to check for structure in very large samples. Often, there are better ways of doing that which use all of the information available. The desire to reduce all to a single badness-of-fit test or measure sometimes has to be resisted. The expected frequencies here are easy to calculate, so I'd move to Pearson residuals, (observed - expected) / sqrt(expected) and also plot those against day of year to see what fine structure there is. The quantile plot is a good starting point, but needs to be followed up. Nick On Tue, Nov 1, 2011 at 8:27 AM, Maarten Buis <maartenlbuis@gmail.com> wrote: > --- Sergio wrote me privately: >> I hope you can help me with the following query. > > Such question should be asked to the statalist and not to its members > privately. This is not a silly rule, there are good reasons for it, > which are listed here: > <http://www.stata.com/support/faqs/res/statalist.html#private>. > >> I have read your suggestions on testing whether >> observed data follow a uniform distribution: > <http://www.stata.com/statalist/archive/2010-10/msg00146.html> >> >> and I am a bit puzzled by the results I obtain when >> applying your syntax. >> >> I observe the dates people start their employment spells >> over each tax year and I want to check if these dates are >> distributed uniformly over the year. The dates are in >> numeric format so I observe 364 different numbers for >> each tax year. >> >> If I use the syntax you suggest: >> >> egen n = count(employment_start_dates) >> egen i = rank(employment_start_dates) >> gen hazen = (i - 0.5) / n >> drop n i >> >> quantile hazen , aspect(1) name(quantile, replace) > > This graph tests whether the variable hazen is uniformly distributed, > which is trivially the case since it is only based on the rank. I used > that graph to spot ties, not to check whether the variable of interest > (in your case employement_start_dates) is uniformly distributed. I > suspect that in your case you would see 365 little horizontal plateaus > on the 45 degree line. This may well be too subtle to easily see in > that graph, but given your sample size of almost 3 million > observations, I suspect that these ties might matter for your test. If > you want to graphically test whether your variable of interest is > uniformly distributed you would type in Stata: -quantile > employement_start_dates, aspect(1)-. > >> In my case this graph shows values which lie exactly on >> the 45 degree line (a histogram also shows data are more >> or less uniformely distributed). However, the output I get >> with the ksmirnov test is >> >> ksmirnov hazen=hazen >> >> One-sample Kolmogorov-Smirnov test against theoretical >> distribution >> hazen >> >> Smaller group D P-value Corrected >> ---------------------------------------------- >> hazen: 0.0081 0.000 >> Cumulative: -0.0081 0.000 >> Combined K-S: 0.0081 0.000 0.000 >> >>Note: ties exist in dataset; >> there are 365 unique values out of 2887994 observations. >> >> I understand this means I reject Ho and therefore the finding >> is that my data do not follow a uniform distribution. Can the >> ksmirnov tests and the quantile plot produce totally opposite >> results as in my case? Should the case of discrete values >> (my case) be treated differently from the continuous case >> you talk about? Here, I am assuming I have applied your >> syntax correctly. Many thanks for your help, very much >> appreciated. > > As I said above the graph does not test the same thing as the test, so > it can easily be that the two lead to different conclusion. Moreover, > there are two types of uniform distribution: a discrete and a > continuous uniform distribution. For example the results of throwing a > six sided die would follow a discrete uniform distribution, while the > -runiform()- function in Stata produces draws from a continuous > uniform distribution. The syntax you used tested against a continuous > uniform distribution. However, in your case, you would have a discrete > uniform distribution with 365 possible values. In what I would call > normal size samples (say 1,000 to 10,000 observations) I would suspect > that a continuous uniform distribution would be a perfectly acceptable > approximation, but in your case it might make a difference. > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: query on testing uniform distributions***From:*Maarten Buis <maartenlbuis@gmail.com>

**References**:**st: query on testing uniform distributions***From:*Maarten Buis <maartenlbuis@gmail.com>

- Prev by Date:
**Re: st: 3sls-fe regression for panel data** - Next by Date:
**Re: st: query on testing uniform distributions** - Previous by thread:
**st: query on testing uniform distributions** - Next by thread:
**Re: st: query on testing uniform distributions** - Index(es):