# st: RE: RE: RE: RE: RE: Dependent variable [with zero mass point]

 From "Feiveson, Alan H. (JSC-SK311)" <[email protected]> To <[email protected]> Subject st: RE: RE: RE: RE: RE: Dependent variable [with zero mass point] Date Wed, 6 Sep 2006 09:14:23 -0500

```Nick - I have no quarrel with your thoughts on absolutely continuous
distributions that have infinite range in both directions - (by the way,
a skew-normal is another alternative) however if there is a mass at zero
(or any other discrete point(s)) these models would not apply. In those
cases a separate model explaining the mass at zero is needed. One can
then use one of your proposed models to explain the distribution
conditional on a non-zero. From what I have seen on statalist, it sounds
as if the Heckman model might be useful for some of these situations.

Al F.

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Nick Cox
Sent: Wednesday, September 06, 2006 8:08 AM
To: [email protected]
Subject: st: RE: RE: RE: RE: Dependent variable [with zero mass point]

Reminds me that I wrote a few expository paragraphs a while back on one
device for variables with values of both signs.
from SSC as -transint-):

-------------------------------------
Transformations for variables that are both positive and negative

Most of the literature on transformations focuses on one or both of
two related
situations: the variable concerned is strictly positive; or it is
zero or positive. If
the first situation does not hold, some transformations do not yield
real number
results (notably, logarithms and reciprocals); if the second
situation does not hold,
then some other transformations do not yield real number results or
more generally do
not appear useful (notably, cube roots, square roots or squares).

However, in some situations response variables in particular can be
both positive and
negative. This is common whenever the response is a balance, change,
difference or
derivative. Although such variables are often skew, the most awkward
property that may
invite transformation is heavy (long or fat) tails, high kurtosis in
one terminology.
Zero usually has a strong substantive meaning, so that we wish to
preserve the
distinction between negative, zero and positive values. (Note that
Celsius or
Fahrenheit temperatures do not really qualify here, as their zero
points are
statistically arbitrary, for all the importance of whether water
melts or freezes.)

In these circumstances, experience with right-skewed and strictly
positive variables
might suggest looking for a transformation that behaves like ln x
when x is positive
and like ln(-x) when x is negative.  This still leaves the problem
of what to do with
zeros. In addition, it is clear from any sketch that (in Stata
terms)

cond(x <= 0, ln(-x), ln(x))

would be useless. One way forward is to use

ln(-x + 1)    if x <= 0,
ln(x + 1)     if x > 0.

This can also be written

sign(x) ln(|x| + 1)

where sign(x) is 1 if x > 0, 0 if x == 0 and -1 if x < 0.  This
function passes through
the origin, behaves like x for small x, positive and negative, and
like sign(x)
ln(abs(x)) for large |x|.  The gradient is steepest at 1 at x = 0,
so the
transformation pulls in extreme values relative to those near the
origin.  It has
recently been dubbed the neglog transformation (Whittaker et al.
2005).  An earlier
reference is John and Draper (1980).  In Stata language, this could
be

cond(x <= 0, ln(-x + 1), ln(x + 1))

or

sign(x) * ln(abs(x) + 1)

The inverse transformation is

cond(t <= 0, 1 - exp(-t), exp(t) - 1)

A suitable generalisation of powers other than 0 is

-[(-x + 1)^p - 1] / p    if x <= 0,
[(x + 1)^p - 1] / p    if x > 0.

Transformations that affect skewness as well as heavy tails in
variables that are both
positive and negative were discussed by Yeo and Johnson (2000).

Another possibility in this terrain is to apply the inverse
hyperbolic function arsinh
(also known as arg sinh, sinh^-1 and arcsinh).  This is the inverse
of the sinh
function, which in turn is defined as

sinh(x) = (exp(x) - exp(-x)) / 2.

The sinh and arsinh functions can be computed in Mata as sinh(x) and
asinh(x) and in
Stata as (exp(x) - exp(-x))/2 and ln(x + sqrt(x^2 + 1)).

The arsinh function also too passes through the origin and is
steepest at the origin.
For large |x| it behaves like sign(x) ln(|2x|).  So in practice
neglog(x) and arsinh(x)

References

John, J.A. and N.R. Draper. 1980.  An alternative family of
transformations.  Applied
Statistics 29: 190-197.

Johnson, N.L. 1949.  Systems of frequency curves generated by
methods of translation.
Biometrika 36: 149-176.

Whittaker, J., J. Whitehead and M. Somers. 2005.  The neglog
transformation and
quantile regression for the analysis of a large credit scoring
database.  Applied
Statistics 54: 863-878.

Yeo, I. and R.A. Johnson. 2000.  A new family of power
transformations to improve
normality or symmetry.  Biometrika 87: 954-959.

-----------------------------------------------------

Nick
[email protected]

Nick Cox
>
> It may be that the zeros represent a _qualitatively_ subset deserving
> separate modelling. But that would be substantive knowledge and isn't
> given here.

Feiveson, Alan

> > I would think that you would need distinct models for the
> > probability of a zero and for the the conditional non-zero
> > distribution. Perhaps something like -heckman- might work. Even if
> > the zero is not important, you can't use something like a normal
> > distribution to model the variable unconditionally.
>
> Nick Cox
>
> > Without more known about the underlying science, it is difficult to
> > comment.
> >
> > But one answer is that you don't necessarily need to do anything
> > special. It is the conditional distribution of response given
> > predictors
> > that is the stochastic side of modelling, not the unconditional
> > distribution. Besides, a spike near the middle is not much of a
> > pathology compared with one at an extreme.
>
> Francesca Gagliardi
>
> > > I would be grateful if anyone could give me suggestions on
> > how to deal
> >
> > > with a dependent variable that has a mass point at zero and is
> > > continuosly distributed over negative and positive values.
> > In such a
> > > case, which is the most appropriate model to estimate?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```