[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: Re: Kernel density estimation in a large dataset |

Date |
Tue, 16 Nov 2004 18:34:04 -0000 |

Call me awkward, but if you have 20,000 observations I am not clear how much you are going to gain from kernel density estimation as compared with a fine-scale histogram. In addition, sampling -- because you have far more values than you need in the middle of the distribution; OK, we buy that readily -- is going to make any problems where densities are low much worse. Linear interpolation will at best disguise how much you are messing things up in those areas, usually one or both tails, not fix it. In the old days, one waited overnight, or a few days, for the output to arrive. Not so long ago, you went for a cup of coffee (or a do-nut, in some cases). I don't know why Eviews is so much faster, but 10 minutes is not outrageous for anything really interesting. You could read the manual or the Stata Journal in the meantime. Of course, if you want 100 (1000 ...) of these, there is a problem. Nick n.j.cox@durham.ac.uk Eva Poen > Thanks a lot for this suggestion. I am not sure whether I need equally > spaced intervals for the density estimate (this seems to be > standard). I > ended up doing > > kdensity x, n(1000) gen(grid dens) > > sort grid > gen density =. > forvalues i = 2/19426 { > qui count if grid < x[`i'] > qui replace density = (dens[r(N)] + dens[r(N)+1])/2 in `i' > } > > While this approach works, it turns out that it takes nearly > as long as > computing the densitiy for all observations in the first place. In the > meantime, I tried this in EViews (with exactly the same data, > bandwidth > and N) and found that density estimation and interpolation > take about 3 > seconds (!) in EViews, while Stata has about 10 Minutes overall. I was > very surprised by this huge difference in speed. Nichols, Austin wrote: > > > You could > > . sort x > > . gen y=x if mod(_n,20)==0 | _n==1 | _n==_N > > . kdensity x, at(y) gen(xdens) > > . ipolate xdens x, gen(f) Eva Poen [mailto:eva.poen@unisg.ch] > > I want to do Kernel density estimation and local polynomial > regression > > on a dataset with 20'000 observations using Stata 8.2. Computations > > using all > > observations as a grid, like in > > > > - kdensity x, at(x) gen(xdens) - > > > > take quite a long time (between 10 and 15 minutes each). So I would > > like to use a grid of, say, 1000 points, but still have density > > estimates for all my observations. That is, I want to have > a variable > > xdens which contains in observation i > > > > - the exact estimated density if x[i] happens to be a grid point > > - the linear interpolation of the two densities estimated at the the > > closest grid points to the left and right of x[i] > > > > for all 20'000 observations. I was told that this is the default > > behaviour in EViews, but I have really no clue how to best implement > > this in Stata. > > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: svyregress and single psu** - Next by Date:
**st: kernel density estimation in a large dataset** - Previous by thread:
**st: Re: Kernel density estimation in a large dataset** - Next by thread:
**st: IVREG2 - Lag instruments** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |