# Re: st: Pareto v. lognormal

 From "Stas Kolenikov" To statalist@hsphsun2.harvard.edu Subject Re: st: Pareto v. lognormal Date Wed, 7 Mar 2007 09:20:43 -0600

Well to be fair to the lognormal distribution, you would not only need
to test the coefficient at the squared ln x, but also the non-linear
equality involving the three regression parameters, as there are only
two parameters in lognormal -- essentially, re-expressing mu and sigma
from two of those and substituting to the third one. See a paper in an
early Stata Journal about sensitivity of Wald test to the specific
forms those non-linear restrictions can take -- there are problems.

And, as you noted, the standard errors are difficult to get --
whatever comes out of the -regress- command is obviously too small.
The neighboring observations become dependent, as the kernel density
estimates should span a few, sometimes dozens, of observations, so you
need something like -newey- at the very least even to get started --
but even then the sampling variability in \hat f(x) is undercounted,
anyway. I don't really know how well the bootstrap meshes with the
kernel methods; they have different rates of convergence, etc. -- I
wouldn't be so sure it is easy even to show that the bootstrap will
give consistent variance estimates.

The statistics based on quantile plots, as far as I can recall, are
Shapiro-Wilk test and its various modifications, looking at whether
the data are on the straight line on the quantile plot or not.

On 3/7/07, Austin Nichols <austinnichols@gmail.com> wrote:
> Stas et al.--
> The point of using ln f(x) is not that I can't estimate F(x) using
> either -kdens- and pweights, then summing (as in my example code), or
> using _n/_N as an estimate of F, or maybe (_n-1/2)/_N, as Stas
> suggests (though this needs fixing up for weighted data). The point
> is, supposing the observed x comes from either the lognormal or Pareto
> distribution, which distribution seems more likely given observed
> statistics?  Given I have no algebraic expression for F if x is
> lognormally distributed (this is what I meant by "can't write down F
> for the lognormal" in my prior post), whereas ln f is an explicit
> polynomial in ln x for both distributions, it makes sense to use the
> following two facts to derive a regression-based test:
>
> For Pareto:
>  ln f(x) = ($B!](Ba $B!](B 1) ln x + a ln k + ln a.
> For lognormal:
>  ln f(x) = $B!](B (ln x)^2/ 2$B&R(B^2 + ( $B&L&R(B^(-2) $B!](B 1)ln x $B!](B ln $B-u(B2$B&P&R(B $B!](B $B&L(B^2/2$B&R(B^2 .
>
> implemented like so:
>
> cap ssc install kdens
> use http://www2.bc.edu/~gottscha/mobility.dta, clear
> g lnc2=ln(c2)
> kdens lnc2 [pw=wt], g(fx lnx) norm n(=_N')
> g lnfx=ln(fx)
> g ln2x=lnx^2
> reg lnfx lnx ln2x, r
> di "significant coef on ln2x rejects Pareto"
>
> I understand there are tests of normality, and tests of equality of
> distributions, but I am under the impression that they tend to have
> power too closely related to sample size (if you will forgive the
> slang) as in the \chi^2 test of goodness-of-fit... and it is not clear
> to me without a more nuanced argument, or some extensive simulations
> suggesting otherwise, why I should not like the implementation I have
> outlined above.
>
> If someone can show me a better bit of code that tests lognormality
> versus Pareto for the pweighted example data lnc2 above, perhaps using
> F instead of f, and not using a kernel estimator, I am happy to change
> my mind...  is there a simple test statistic based on the quantile
> plots that can test what my code does?  My impression is that the
> lognormal and Pareto families can look quite close for some parameter
> values, and be hard to distinguish, which I suppose is a problem for
> any test, including mine.
>
> On 3/6/07, Stas Kolenikov <skolenik@gmail.com> wrote:
> > On 3/6/07, Austin Nichols <austinnichols@gmail.com> wrote:
> > > Stas, Patrick, et al.--
> > > The rationale for using ln(f(x)) instead of ln(1-F) is that I can
> > > write down ln(f(x)) for both the Pareto and lognormal families, and I
> > > can't write down F for the lognormal.
> >
> > hmm... norm( (ln(x) - mu)/sigma ), in Stata's probability distributions slang?
> >
> > There's a wealth of theory and tests behind various versions of
> > quantile plots (NJC mentioned some of those and their implementations
> > in Stata), and I tend to think those are more reputable than tests
> > based on kernel estimates, for which you have non-parametric
> > convergence rates, and need to worry about the optimal bandwidths. So
> > the theory and inference is moderately ugly there.
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--
Stas Kolenikov
http://stas.kolenikov.name
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
`