Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: RE: Dependent variable [with zero mass point]

From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: RE: RE: RE: Dependent variable [with zero mass point]
Date   Wed, 6 Sep 2006 14:07:52 +0100

Reminds me that I wrote a few expository paragraphs 
a while back on one device for variables with values of both signs. 
Here is a slightly edited reprise of part of transint.hlp
(downloadable from SSC as -transint-): 

Transformations for variables that are both positive and negative 

    Most of the literature on transformations focuses on one or both of two related
    situations: the variable concerned is strictly positive; or it is zero or positive. If
    the first situation does not hold, some transformations do not yield real number
    results (notably, logarithms and reciprocals); if the second situation does not hold,
    then some other transformations do not yield real number results or more generally do
    not appear useful (notably, cube roots, square roots or squares).

    However, in some situations response variables in particular can be both positive and
    negative. This is common whenever the response is a balance, change, difference or
    derivative. Although such variables are often skew, the most awkward property that may
    invite transformation is heavy (long or fat) tails, high kurtosis in one terminology.
    Zero usually has a strong substantive meaning, so that we wish to preserve the
    distinction between negative, zero and positive values. (Note that Celsius or
    Fahrenheit temperatures do not really qualify here, as their zero points are
    statistically arbitrary, for all the importance of whether water melts or freezes.)

    In these circumstances, experience with right-skewed and strictly positive variables
    might suggest looking for a transformation that behaves like ln x when x is positive
    and like ln(-x) when x is negative.  This still leaves the problem of what to do with
    zeros. In addition, it is clear from any sketch that (in Stata terms)

        cond(x <= 0, ln(-x), ln(x))

    would be useless. One way forward is to use

        ln(-x + 1)    if x <= 0, 
        ln(x + 1)     if x > 0.  

    This can also be written

        sign(x) ln(|x| + 1)

    where sign(x) is 1 if x > 0, 0 if x == 0 and -1 if x < 0.  This function passes through
    the origin, behaves like x for small x, positive and negative, and like sign(x)
    ln(abs(x)) for large |x|.  The gradient is steepest at 1 at x = 0, so the
    transformation pulls in extreme values relative to those near the origin.  It has
    recently been dubbed the neglog transformation (Whittaker et al. 2005).  An earlier
    reference is John and Draper (1980).  In Stata language, this could be

        cond(x <= 0, ln(-x + 1), ln(x + 1))


        sign(x) * ln(abs(x) + 1) 

    The inverse transformation is

        cond(t <= 0, 1 - exp(-t), exp(t) - 1)

    A suitable generalisation of powers other than 0 is

        -[(-x + 1)^p - 1] / p    if x <= 0, 
          [(x + 1)^p - 1] / p    if x > 0. 

    Transformations that affect skewness as well as heavy tails in variables that are both
    positive and negative were discussed by Yeo and Johnson (2000).

    Another possibility in this terrain is to apply the inverse hyperbolic function arsinh
    (also known as arg sinh, sinh^-1 and arcsinh).  This is the inverse of the sinh
    function, which in turn is defined as

        sinh(x) = (exp(x) - exp(-x)) / 2. 

    The sinh and arsinh functions can be computed in Mata as sinh(x) and asinh(x) and in
    Stata as (exp(x) - exp(-x))/2 and ln(x + sqrt(x^2 + 1)).

    The arsinh function also too passes through the origin and is steepest at the origin.
    For large |x| it behaves like sign(x) ln(|2x|).  So in practice neglog(x) and arsinh(x)
    have loosely similar effects. See also Johnson (1949).


    John, J.A. and N.R. Draper. 1980.  An alternative family of transformations.  Applied
        Statistics 29: 190-197.

    Johnson, N.L. 1949.  Systems of frequency curves generated by methods of translation.
        Biometrika 36: 149-176.

    Whittaker, J., J. Whitehead and M. Somers. 2005.  The neglog transformation and
        quantile regression for the analysis of a large credit scoring database.  Applied
        Statistics 54: 863-878.

    Yeo, I. and R.A. Johnson. 2000.  A new family of power transformations to improve
        normality or symmetry.  Biometrika 87: 954-959.


[email protected] 

Nick Cox
> It may be that the zeros represent a _qualitatively_ 
> subset deserving separate modelling. But that would 
> be substantive knowledge and isn't given here. 
Feiveson, Alan
> > I would think that you would need distinct models for the 
> > probability of
> > a zero and for the the conditional non-zero distribution. Perhaps
> > something like -heckman- might work. Even if the zero is not 
> > important,
> > you can't use something like a normal distribution to model 
> > the variable
> > unconditionally.
> Nick Cox
> > Without more known about the underlying science, it is difficult to
> > comment.
> > 
> > But one answer is that you don't necessarily need to do anything
> > special. It is the conditional distribution of response given 
> > predictors
> > that is the stochastic side of modelling, not the unconditional
> > distribution. Besides, a spike near the middle is not much of a
> > pathology compared with one at an extreme. 
> Francesca Gagliardi
> > > I would be grateful if anyone could give me suggestions on 
> > how to deal
> > 
> > > with a dependent variable that has a mass point at zero and is 
> > > continuosly distributed over negative and positive values. 
> > In such a 
> > > case, which is the most appropriate model to estimate?

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index