Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Re: Kernel density estimation in a large dataset


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Re: Kernel density estimation in a large dataset
Date   Tue, 16 Nov 2004 18:34:04 -0000

Call me awkward, but if you have 20,000 observations 
I am not clear how much you are going to gain from 
kernel density estimation as compared with 
a fine-scale histogram. 

In addition, sampling --

because you have far more values than you need
in the middle of the distribution; OK, we buy 
that readily -- 

is going to make any problems where densities are low 
much worse. 

Linear interpolation will at best disguise how 
much you are messing things up in those areas, 
usually one or both tails, not fix it. 

In the old days, one waited overnight, or a few days, 
for the output to arrive. Not so long ago, you 
went for a cup of coffee (or a do-nut, in some 
cases). 

I don't know why Eviews is so much faster, but 
10 minutes is not outrageous for anything really 
interesting. You could read the manual or 
the Stata Journal in the meantime. 

Of course, if you want 100 (1000 ...) of these, there is 
a problem. 

Nick 
n.j.cox@durham.ac.uk 

Eva Poen
 
> Thanks a lot for this suggestion. I am not sure whether I need equally
> spaced intervals for the density estimate (this seems to be 
> standard). I
> ended up doing
> 
> kdensity x, n(1000) gen(grid dens)
> 
> sort grid
> gen density =.
> forvalues i = 2/19426 {
>    qui count if grid < x[`i']
>    qui replace density = (dens[r(N)] + dens[r(N)+1])/2 in `i'
> }
> 
> While this approach works, it turns out that it takes nearly 
> as long as
> computing the densitiy for all observations in the first place. In the
> meantime, I tried this in EViews (with exactly the same data, 
> bandwidth
> and N) and found that density estimation and interpolation 
> take about 3
> seconds (!) in EViews, while Stata has about 10 Minutes overall. I was
> very surprised by this huge difference in speed.

Nichols, Austin wrote:
> 
> > You could
> > . sort x
> > . gen y=x if mod(_n,20)==0 | _n==1 | _n==_N
> > . kdensity x, at(y) gen(xdens)
> > . ipolate xdens x, gen(f)

Eva Poen [mailto:eva.poen@unisg.ch]

> > I want to do Kernel density estimation and local polynomial 
> regression
> > on a dataset with 20'000 observations using Stata 8.2. Computations
> > using all
> > observations as a grid, like in
> >
> > - kdensity x, at(x) gen(xdens) -
> >
> > take quite a long time (between 10 and 15 minutes each). So I would
> > like to use a grid of, say, 1000 points, but still have density
> > estimates for all my observations. That is, I want to have 
> a variable
> > xdens which contains in observation i
> >
> > - the exact estimated density if x[i] happens to be a grid point
> > - the linear interpolation of the two densities estimated at the the
> > closest grid points to the left and right of x[i]
> >
> > for all 20'000 observations. I was told that this is the default
> > behaviour in EViews, but I have really no clue how to best implement
> > this in Stata.
> >

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index