Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: query on testing uniform distributions

From	Maarten Buis <[email protected]>
To	[email protected]
Subject	st: query on testing uniform distributions
Date	Tue, 1 Nov 2011 09:27:33 +0100

--- Sergio wrote me privately:
> I hope you can help me with the following query.

Such question should be asked to the statalist and not to its members
privately. This is not a silly rule, there are good reasons for it,
which are listed here:
<http://www.stata.com/support/faqs/res/statalist.html#private>.

> I have read your suggestions on testing whether
> observed data follow a uniform distribution:
<http://www.stata.com/statalist/archive/2010-10/msg00146.html>
>
> and I am a bit puzzled by the results I obtain when
> applying your syntax.
>
> I observe the dates people start their employment spells
> over each tax year and I want to check if these dates are
> distributed uniformly over the year. The dates are in
> numeric format so I observe 364 different numbers for
> each tax year.
>
> If I use the syntax you suggest:
>
> egen n = count(employment_start_dates)
> egen i = rank(employment_start_dates)
> gen hazen = (i - 0.5) / n
> drop n i
>
> quantile hazen , aspect(1) name(quantile, replace)

This graph tests whether the variable hazen is uniformly distributed,
which is trivially the case since it is only based on the rank. I used
that graph to spot ties, not to check whether the variable of interest
(in your case employement_start_dates) is uniformly distributed. I
suspect that in your case you would see 365 little horizontal plateaus
on the 45 degree line. This may well be too subtle to easily see in
that graph, but given your sample size of almost 3 million
observations, I suspect that these ties might matter for your test. If
you want to graphically test whether your variable of interest is
uniformly distributed you would type in Stata: -quantile
employement_start_dates, aspect(1)-.

> In my case this graph shows values which lie exactly on
> the 45 degree line (a histogram also shows data are more
> or less uniformely distributed). However, the output I get
> with the ksmirnov test is
>
> ksmirnov hazen=hazen
>
> One-sample Kolmogorov-Smirnov test against theoretical
> distribution
>          hazen
>
> Smaller group       D       P-value  Corrected
> ----------------------------------------------
> hazen:              0.0081    0.000
> Cumulative:        -0.0081    0.000
> Combined K-S:       0.0081    0.000      0.000
>
>Note: ties exist in dataset;
>      there are 365 unique values out of 2887994 observations.
>
> I understand this means I reject Ho and therefore the finding
> is that my data do not follow a uniform distribution. Can the
> ksmirnov tests and the quantile plot produce totally opposite
> results as in my case? Should the case of discrete values
> (my case) be treated differently from the continuous case
> you talk about? Here, I am assuming I have applied your
> syntax correctly. Many thanks for your help, very much
> appreciated.

As I said above the graph does not test the same thing as the test, so
it can easily be that the two lead to different conclusion. Moreover,
there are two types of uniform distribution: a discrete and a
continuous uniform distribution. For example the results of throwing a
six sided die would follow a discrete uniform distribution, while the
-runiform()- function in Stata produces draws from a continuous
uniform distribution. The syntax you used tested against a continuous
uniform distribution. However, in your case, you would have a discrete
uniform distribution with 365 possible values. In what I would call
normal size samples (say 1,000 to 10,000 observations) I would suspect
that a continuous uniform distribution would be a perfectly acceptable
approximation, but in your case it might make a difference.

 Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany


http://www.maartenbuis.nl
--------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: RE: query on testing uniform distributions
  - From: Sergio Salis <[email protected]>
- Re: st: query on testing uniform distributions
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: RE: r(1000) error when running xi
Next by Date: Re: st: 3sls-fe regression for panel data
Previous by thread: Re: st: stratified Cox proportional hazards model and AIC
Next by thread: Re: st: query on testing uniform distributions
Index(es):
- Date
- Thread