Stas et al.--
The point of using ln f(x) is not that I can't estimate F(x) using
either -kdens- and pweights, then summing (as in my example code), or
using _n/_N as an estimate of F, or maybe (_n-1/2)/_N, as Stas
suggests (though this needs fixing up for weighted data). The point
is, supposing the observed x comes from either the lognormal or Pareto
distribution, which distribution seems more likely given observed
statistics? Given I have no algebraic expression for F if x is
lognormally distributed (this is what I meant by "can't write down F
for the lognormal" in my prior post), whereas ln f is an explicit
polynomial in ln x for both distributions, it makes sense to use the
following two facts to derive a regression-based test:
For Pareto:
ln f(x) = ($B!](Ba $B!](B 1) ln x + a ln k + ln a.
For lognormal:
ln f(x) = $B!](B (ln x)^2/ 2$B&R(B^2 + ( $B&L&R(B^(-2) $B!](B 1)ln x $B!](B ln $B-u(B2$B&P&R(B $B!](B $B&L(B^2/2$B&R(B^2 .
implemented like so:
cap ssc install kdens
use http://www2.bc.edu/~gottscha/mobility.dta, clear
g lnc2=ln(c2)
kdens lnc2 [pw=wt], g(fx lnx) norm n(`=_N')
g lnfx=ln(fx)
g ln2x=lnx^2
reg lnfx lnx ln2x, r
di "significant coef on ln2x rejects Pareto"
I understand there are tests of normality, and tests of equality of
distributions, but I am under the impression that they tend to have
power too closely related to sample size (if you will forgive the
slang) as in the \chi^2 test of goodness-of-fit... and it is not clear
to me without a more nuanced argument, or some extensive simulations
suggesting otherwise, why I should not like the implementation I have
outlined above.
If someone can show me a better bit of code that tests lognormality
versus Pareto for the pweighted example data lnc2 above, perhaps using
F instead of f, and not using a kernel estimator, I am happy to change
my mind... is there a simple test statistic based on the quantile
plots that can test what my code does? My impression is that the
lognormal and Pareto families can look quite close for some parameter
values, and be hard to distinguish, which I suppose is a problem for
any test, including mine.
On 3/6/07, Stas Kolenikov <skolenik@gmail.com> wrote:
> On 3/6/07, Austin Nichols <austinnichols@gmail.com> wrote:
> > Stas, Patrick, et al.--
> > The rationale for using ln(f(x)) instead of ln(1-F) is that I can
> > write down ln(f(x)) for both the Pareto and lognormal families, and I
> > can't write down F for the lognormal.
>
> hmm... norm( (ln(x) - mu)/sigma ), in Stata's probability distributions slang?
>
> There's a wealth of theory and tests behind various versions of
> quantile plots (NJC mentioned some of those and their implementations
> in Stata), and I tend to think those are more reputable than tests
> based on kernel estimates, for which you have non-parametric
> convergence rates, and need to worry about the optimal bandwidths. So
> the theory and inference is moderately ugly there.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/