Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: RE: RE: Dependent variable [with zero mass point]

From   "Feiveson, Alan H. (JSC-SK311)" <>
To   <>
Subject   st: RE: RE: RE: RE: RE: Dependent variable [with zero mass point]
Date   Wed, 6 Sep 2006 09:14:23 -0500

Nick - I have no quarrel with your thoughts on absolutely continuous
distributions that have infinite range in both directions - (by the way,
a skew-normal is another alternative) however if there is a mass at zero
(or any other discrete point(s)) these models would not apply. In those
cases a separate model explaining the mass at zero is needed. One can
then use one of your proposed models to explain the distribution
conditional on a non-zero. From what I have seen on statalist, it sounds
as if the Heckman model might be useful for some of these situations.

Al F.

-----Original Message-----
[] On Behalf Of Nick Cox
Sent: Wednesday, September 06, 2006 8:08 AM
Subject: st: RE: RE: RE: RE: Dependent variable [with zero mass point]

Reminds me that I wrote a few expository paragraphs a while back on one
device for variables with values of both signs. 
Here is a slightly edited reprise of part of transint.hlp (downloadable
from SSC as -transint-): 

Transformations for variables that are both positive and negative 

    Most of the literature on transformations focuses on one or both of
two related
    situations: the variable concerned is strictly positive; or it is
zero or positive. If
    the first situation does not hold, some transformations do not yield
real number
    results (notably, logarithms and reciprocals); if the second
situation does not hold,
    then some other transformations do not yield real number results or
more generally do
    not appear useful (notably, cube roots, square roots or squares).

    However, in some situations response variables in particular can be
both positive and
    negative. This is common whenever the response is a balance, change,
difference or
    derivative. Although such variables are often skew, the most awkward
property that may
    invite transformation is heavy (long or fat) tails, high kurtosis in
one terminology.
    Zero usually has a strong substantive meaning, so that we wish to
preserve the
    distinction between negative, zero and positive values. (Note that
Celsius or
    Fahrenheit temperatures do not really qualify here, as their zero
points are
    statistically arbitrary, for all the importance of whether water
melts or freezes.)

    In these circumstances, experience with right-skewed and strictly
positive variables
    might suggest looking for a transformation that behaves like ln x
when x is positive
    and like ln(-x) when x is negative.  This still leaves the problem
of what to do with
    zeros. In addition, it is clear from any sketch that (in Stata

        cond(x <= 0, ln(-x), ln(x))

    would be useless. One way forward is to use

        ln(-x + 1)    if x <= 0, 
        ln(x + 1)     if x > 0.  

    This can also be written

        sign(x) ln(|x| + 1)

    where sign(x) is 1 if x > 0, 0 if x == 0 and -1 if x < 0.  This
function passes through
    the origin, behaves like x for small x, positive and negative, and
like sign(x)
    ln(abs(x)) for large |x|.  The gradient is steepest at 1 at x = 0,
so the
    transformation pulls in extreme values relative to those near the
origin.  It has
    recently been dubbed the neglog transformation (Whittaker et al.
2005).  An earlier
    reference is John and Draper (1980).  In Stata language, this could

        cond(x <= 0, ln(-x + 1), ln(x + 1))


        sign(x) * ln(abs(x) + 1) 

    The inverse transformation is

        cond(t <= 0, 1 - exp(-t), exp(t) - 1)

    A suitable generalisation of powers other than 0 is

        -[(-x + 1)^p - 1] / p    if x <= 0, 
          [(x + 1)^p - 1] / p    if x > 0. 

    Transformations that affect skewness as well as heavy tails in
variables that are both
    positive and negative were discussed by Yeo and Johnson (2000).

    Another possibility in this terrain is to apply the inverse
hyperbolic function arsinh
    (also known as arg sinh, sinh^-1 and arcsinh).  This is the inverse
of the sinh
    function, which in turn is defined as

        sinh(x) = (exp(x) - exp(-x)) / 2. 

    The sinh and arsinh functions can be computed in Mata as sinh(x) and
asinh(x) and in
    Stata as (exp(x) - exp(-x))/2 and ln(x + sqrt(x^2 + 1)).

    The arsinh function also too passes through the origin and is
steepest at the origin.
    For large |x| it behaves like sign(x) ln(|2x|).  So in practice
neglog(x) and arsinh(x)
    have loosely similar effects. See also Johnson (1949).


    John, J.A. and N.R. Draper. 1980.  An alternative family of
transformations.  Applied
        Statistics 29: 190-197.

    Johnson, N.L. 1949.  Systems of frequency curves generated by
methods of translation.
        Biometrika 36: 149-176.

    Whittaker, J., J. Whitehead and M. Somers. 2005.  The neglog
transformation and
        quantile regression for the analysis of a large credit scoring
database.  Applied
        Statistics 54: 863-878.

    Yeo, I. and R.A. Johnson. 2000.  A new family of power
transformations to improve
        normality or symmetry.  Biometrika 87: 954-959.



Nick Cox
> It may be that the zeros represent a _qualitatively_ subset deserving 
> separate modelling. But that would be substantive knowledge and isn't 
> given here.
Feiveson, Alan
> > I would think that you would need distinct models for the 
> > probability of a zero and for the the conditional non-zero 
> > distribution. Perhaps something like -heckman- might work. Even if 
> > the zero is not important, you can't use something like a normal 
> > distribution to model the variable unconditionally.
> Nick Cox
> > Without more known about the underlying science, it is difficult to
> > comment.
> > 
> > But one answer is that you don't necessarily need to do anything
> > special. It is the conditional distribution of response given 
> > predictors
> > that is the stochastic side of modelling, not the unconditional
> > distribution. Besides, a spike near the middle is not much of a
> > pathology compared with one at an extreme. 
> Francesca Gagliardi
> > > I would be grateful if anyone could give me suggestions on 
> > how to deal
> > 
> > > with a dependent variable that has a mass point at zero and is 
> > > continuosly distributed over negative and positive values. 
> > In such a 
> > > case, which is the most appropriate model to estimate?

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index