Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: approximate quantiles in Stata


From   David Hoaglin <[email protected]>
To   [email protected]
Subject   Re: st: approximate quantiles in Stata
Date   Sun, 25 Aug 2013 09:28:58 -0400

Laszlo,

In line with the information you received from the CS people, running
time for a good sorting algorithm is O(N x log N) (assuming the data
are not the worst case).  If Stata is sorting the entire sample in
preparation for extracting quantiles, and N is "tens of millions,"
that probably explains the long execution times.

The earlier work that I had in mind was by John Chambers (1971, 1977)
on "partial sorting."  From a quick look at the blog post that you
cited, I think those folks are doing something similar.  I don't know
whether they connect with that earlier work.

If you are dealing with unstructured samples, you should be able to
get a good approximation of "vingtiles" from a subsample of manageable
size.  (If you know that a sample has structure, you can stratify on
that structure and then subsample.)  If you're prepared to consider
your large sample as a population, then you would be estimating its
quantiles from a sample, but not necessarily a small sample.  What
matters is the size of the sample (n), not (usually) the fraction of
the population (n/N), because the variance of a sample quantile is
proportional to 1/n.  Thus, a sample size of 10,000 or even 1,000
might be adequate, depending on how accurate you want the estimates to
be.

You mentioned 20 bins, but it is often useful to select a set of
quantiles that puts more emphasis on the tails of the sample and
distribution.  One such set, from Exploratory Data Analysis, consists
of the median and the pairs (lower and upper) whose tail areas are
(roughly) powers of 1/2.  The tails are often "where the action is."
More on this topic in another message.

David Hoaglin


Chambers, J.M. (1971). Algorithm 410: Partial sorting.  Communications
of the A.C.M. 14:357-358.

Chambers, J.M. (1977).  Computational Methods for Data Analysls.  New
York: Wiley.

On Sat, Aug 24, 2013 at 7:42 AM, László Sándor <[email protected]> wrote:
> Thanks, David,
>
> I think the typical use case is about tens of millions of
> observations. (And as I think it matters for precision, the typical
> case is about 20 bins, or vingtiles.)
>
> FWIW, I also tried to profile -xtile- with maximum number of
> observations possible. With Stata 13 running on 64 cores, it took 7
> hours to generate vingtiles.
>
> Laszlo

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index