Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Reconcile Log Transformed with Untransformed Results

From	Erasmo Giambona <[email protected]>
To	[email protected]
Subject	Re: st: Reconcile Log Transformed with Untransformed Results
Date	Wed, 3 Mar 2010 10:20:08 +0100
Thank you all very much!
Erasmo

On Thu, Feb 25, 2010 at 11:09 PM, Nick Cox <[email protected]> wrote:
> I agree with Austin, but a little more can be said. Extracts follow from -transint.hlp- which is part of -transint- (SSC), updated slightly. I don't think any of this rescues Erasmo from his predicament, as he wants to keep all the properties of logarithms even though some of the requisite assumptions do not apply to his situation.
>
> % start extract
>
>    Most of the literature on transformations focuses on one or both of two related
>    situations: the variable concerned is strictly positive; or it is zero or positive.
>    If the first situation does not hold, some transformations do not yield real number
>    results (notably, logarithms and reciprocals); if the second situation does not
>    hold, then some other transformations do not yield real number results or more
>    generally do not appear useful (notably, square roots or squares).
>
>    However, in some situations response variables in particular can be both positive
>    and negative. This is common whenever the response is a balance, change, difference
>    or derivative. Although such variables are often skew, the most awkward property
>    that may invite transformation is heavy (long or fat) tails, high kurtosis in one
>    terminology.  Zero usually has a strong substantive meaning, so that we wish to
>    preserve the distinction between negative, zero and positive values. (Note that
>    Celsius or Fahrenheit temperatures do not really qualify here, as their zero points
>    are statistically arbitrary, for all the importance of whether water melts or
>    freezes.)
>
>    In these circumstances, experience with right-skewed and strictly positive variables
>    might suggest looking for a transformation that behaves like ln x when x is positive
>    and like -ln(-x) when x is negative.  This still leaves the problem of what to do
>    with zeros. In addition, it is clear from any sketch that (in Stata terms)
>
>        cond(x <= 0, -ln(-x), ln(x))
>
>    would be useless. One way forward is to use
>
>        -ln(-x + 1)    if x <= 0,
>        ln(x + 1)     if x > 0.
>
>    This can also be written
>
>        sign(x) ln(|x| + 1)
>
>    where sign(x) is 1 if x > 0, 0 if x == 0 and -1 if x < 0.  This function passes
>    through the origin, behaves like x for small x, positive and negative, and like
>    sign(x) ln(abs(x)) for large |x|.  The gradient is steepest at 1 at x = 0, so the
>    transformation pulls in extreme values relative to those near the origin.  It has
>    recently been dubbed the neglog transformation (Whittaker et al. 2005).  An earlier
>    reference is John and Draper (1980).  In Stata language, this could be
>
>        cond(x <= 0, -ln(-x + 1), ln(x + 1))
>
>    or
>
>        sign(x) * ln(abs(x) + 1)
>
>    The inverse transformation is
>
>        cond(t <= 0, 1 - exp(-t), exp(t) - 1)
>
>    A suitable generalisation of powers other than 0 is
>
>        -[(-x + 1)^p - 1] / p    if x <= 0,
>          [(x + 1)^p - 1] / p    if x > 0.
>
>    Transformations that affect skewness as well as heavy tails in variables that are
>    both positive and negative were discussed by Yeo and Johnson (2000).
>
>    Another possibility in this terrain is to apply the inverse hyperbolic function
>    arsinh (also known as arg sinh, sinh^-1 and arcsinh).  This is the inverse of the
>    sinh function, which in turn is defined as
>
>        sinh(x) = (exp(x) - exp(-x)) / 2.
>
>    The sinh and arsinh functions can be computed in Mata and Stata as sinh(x) and          asinh(x).
>
>    The arsinh function also too passes through the origin and is steepest at the
>    origin.  For large |x| it behaves like sign(x) ln(|2x|).  So in practice neglog(x)
>    and arsinh(x) have loosely similar effects. See also Johnson (1949).
>
> % end extract
>
> References
>
>    John, J.A. and N.R. Draper. 1980.  An alternative family of transformations.
>        Applied Statistics 29: 190-197.
>
>    Johnson, N.L. 1949.  Systems of frequency curves generated by methods of
>        translation.  Biometrika 36: 149-176.
>
>    Whittaker, J., J. Whitehead and M. Somers. 2005.  The neglog transformation and
>        quantile regression for the analysis of a large credit scoring database.
>        Applied Statistics 54: 863-878.
>
>    Yeo, I. and R.A. Johnson. 2000.  A new family of power transformations to improve
>        normality or symmetry.  Biometrika 87: 954-959.
>
> Nick
> [email protected]
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Austin Nichols
> Sent: 25 February 2010 21:59
> To: [email protected]
> Subject: Re: st: Reconcile Log Transformed with Untransformed Results
>
> Erasmo Giambona <[email protected]> :
> No need to add "in economic terms" as the result is simply not interpretable.
> To restate my objection from Feb 13:
> a regression of ln(y+1) on ln(x+1) does not estimate
> an elasticity, and a change from -0.45 to +0.4 does not correspond to
> any well-defined percentage point change.  If you are unsure of the
> correct functional form, consider -lpoly- or -fracpoly- or -mkspline-
> or -pspline- (on SSC).
>
> Why not simply estimate a linear regression with OLS and plot your 16
> points as well, both with and without the outlier you don't like?
>
> sysuse auto, clear
> keep in 1/16
> replace mpg=mpg/20-1
> replace weight=weight/3300-1
> sc mpg weight ||lfit mpg weight||lfit mpg weight if _n!=13
> g y=ln(mpg+1)
> g x=ln(weight+1)
> sc y x ||lfit y x||lfit y x if _n!=13, name(why)
>
> I can't see why you are ever adding one and taking logs--there is no
> justification for it that I have seen.
>
> On Thu, Feb 25, 2010 at 12:32 PM, Erasmo Giambona <[email protected]> wrote:
>> Thanks Tony. Actually, I take the log of 1+y. Yes, i tried glm with a
>> log link and that helps as well. The issue is that i found it
>> difficult to interpret the results in economic terms. All the details
>> are in the previous emails.
>> Erasmo
>>
>> On Thu, Feb 25, 2010 at 6:24 PM, Lachenbruch, Peter
>> <[email protected]> wrote:
>>> Since one of your y's is negative, -0.03, why should taking logs help? Would a glm with a log link help?
>>>
>>> Tony
>>>
>>> Peter A. Lachenbruch
>>> Department of Public Health
>>> Oregon State University
>>> Corvallis, OR 97330
>>> Phone: 541-737-3832
>>> FAX: 541-737-4001
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Erasmo Giambona
>>> Sent: Thursday, February 25, 2010 4:32 AM
>>> To: [email protected]
>>> Subject: Re: st: Reconcile Log Transformed with Untransformed Results
>>>
>>> Thanks Austin. I have been traveling so it has been difficult to look
>>> into this issue. To answer your question. I am using a two-step
>>> procedure that is used sometime in monetary policy research. My y is a
>>> coefficient estimated from a panel regression using firm level data.
>>> This is the first step. y ranges from -0.03 to +0.07 (with mean=0.023,
>>> median=0.024, st dev=0.028, skew=-.37, kurt= 2.52). I have 16 y's, one
>>> per year. In the secon step i regress y on x, where x is an annual
>>> interest rate spread ranging from -.95% to 1.15% (with mean=3.96e-07,
>>> median=.0004551, st dev=.6426913, skew=.1102487, kurt= 2.15). The
>>> scatter of y on x clearly shows that y increase with x, but there is
>>> one obs (out of the 16) with a very low x and a very high y. I am
>>> taking the logs to try to reduce the effetc of this obs. Thought this
>>> is more parimonious relative to the alternative of dropping hte obs
>>> and winsorizing seems unfeasible with 16 obs.
>>>
>>> Any additional thoughts would be appreciated,
>>>
>>> Erasmo
>>>
>>> On Tue, Feb 16, 2010 at 6:11 PM, Austin Nichols <[email protected]> wrote:
>>>> Erasmo Giambona <[email protected]>:
>>>> As I already pointed out, I doubt your estimates correspond to any
>>>> well-defined percentage point change.  Perhaps you can give us a
>>>> better sense of the distributions of the untransformed y and x (and
>>>> what they measure and in what units), and what the scatterplot of y
>>>> against x looks like.  You may also prefer to state your effects in
>>>> terms of standard deviations rather than the interquartile range.
>>>>
>>>> On Tue, Feb 16, 2010 at 9:39 AM, Erasmo Giambona <[email protected]> wrote:
>>>>> Thanks Maarten. In this example, OLS and GLM give very similar
>>>>> econimic effects. In fact, 74 cents for the OLS is really 9.52%
>>>>> relative to the mean wage of 7.77. This 9.52% is very much in line
>>>>> with the 9.7% found with GLM. In my case, the coeff. on X for the OLS
>>>>> is 0.0064. Relative to the mean for the LHS variable of 0.02. This is
>>>>> an economic effect of about 28%. With the GLS, using exactly your
>>>>> code, X gets a coefficient of 2.025 or a 102.5% increase in Y. Or
>>>>> perhaps, I am misinterpreting this coefficient.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Erasmo
>>>>>
>>>>> On Mon, Feb 15, 2010 at 9:22 AM, Maarten buis <[email protected]> wrote:
>>>>>> --- On Sun, 14/2/10, Erasmo Giambona wrote:
>>>>>>> I ran the regressions with both RHS and LHS untransformed
>>>>>>> using both OLS and GLM with link(log). With the OLS the
>>>>>>> coeff on X is 0.006 while with the GLM the coefficient is
>>>>>>> 0.700. I find a bit hard to intepret the GLM coefficient.
>>>>>>
>>>>>> Consider the example below:
>>>>>>
>>>>>> *--------------- begin example -----------------
>>>>>> sysuse nlsw88, clear
>>>>>> gen byte baseline =1
>>>>>>
>>>>>> reg wage grade
>>>>>> glm wage grade baseline,  ///
>>>>>>    link(log) eform nocons
>>>>>> *--------------- end example --------------------
>>>>>>
>>>>>>
>>>>>> The -regress- results are interpreted as follows:
>>>>>> People without education can expect a wage of
>>>>>> -1.96 dollars an hour (substantively we know that
>>>>>> people hardly ever pay for the privelege to work,
>>>>>> so this is a sign of bad model fit), and they get
>>>>>> 74 cents an hour more of every additional year of
>>>>>> education.
>>>>>>
>>>>>> The -glm- results are interpreted as follows:
>>>>>> People without education can expect a wage of
>>>>>> 2.25 dollars an hour, and for every additional
>>>>>> year of education they can expect an increase
>>>>>> of 9.7%.
>>>>>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Memory allocation and CPU issues
Next by Date: Re: st: AW: identify all variables with dates
Previous by thread: st: Memory allocation and CPU issues
Next by thread: st: AW: Reference column without knowing variable name?
Index(es):
- Date
- Thread