Adjusted summary statistics for logarithmic regressions ------------------------------------------------------- by Richard Goldstein, Qualitas, Brighton, MA EMAIL goldst@@harvarda.bitnet ^logsumm^ varlist Because of the types of calculations that must be made, ^if^ and ^in^ are ^NOT^ allowed; instead ^drop^ cases that you don't want to use in the regression (or ^keep^ only those cases that you want). Choice of functional form is one of the hardest, and least capable of automation, modeling decisions in regression analysis. Probably the most important criterion is the analyst's substantive, or theoretical, knowledge of the situation. However, this is rarely sufficient in itself. A number of tools have been devised to help analysts choose the appropriate functional form. This ado-file presents a number of those tools in one package for the special, but widely applicable, case of choosing between a linear and a log-linear form. The general situation involves a choice among at least the following four forms: 1. Linear: y = b0 + b1*X1 2. Semi-logarithmic: log(y) = b0 + b1*X1 3. Quadratic: y = b0 + b1*X1 + b2*(X1-squared) 4. Logarithmic: log(y) = b0 + b1*log(X1) This particular ado-file is primarily aimed at helping users to distinguish between the first two of these forms, but can also be helpful regarding the other two (some additional comments appear below). Although there are many discussions of how to make such a choice in the statistical literatures of several disciplines, many users just compare the summary statistics from the two regressions. However, when the dependent variable in a linear regression is a logarithmic transform, the summary statistics are not comparable to the summary statistics from an untransformed regression. G.S. Maddala (1988), Introduction to Econometrics, New York: Macmillan Publishing Co., puts it this way: When comparing the linear with the log-linear forms, we cannot compare the R-squared's because R-squared is the ratio of explained variance to the total variance and the variances of y and log y are different. Comparing R-squared's in this case is like comparing two individuals, A and B, where A eats 65% of a carrot cake and B eats 70% of a strawberry cake. The comparison does not make sense because there are two different cakes. (p. 177) Since many of the other summary statistics, including RMSE and the F-statistic are problems for the same reason (different amount of variation in the dependent variable), this program provides these statistics also. These summary statistics, as shown in the example below are provided for five (5) models: the raw variable model, the semi-log model (log of dependent variable), the adjusted output from the raw model (adjusted by taking the logs of the predicted values and the dependent variable and calculating the summary statistics), and two adjusted versions of the log model. Note that this presentation is not meant to imply that you should choose between these functional forms based solely on these summary statistics; lots of other things, including substantive knowledge (is a multiplicative or an additive scale preferable?) need to be taken into account. Another thing you may find helpful is a plot showing both the untransformed variable (using, say, the left scale) and the transformed variable (using, say, the right scale), with the ^rescale^ and ^rlog^ ^graph^ options. Two sets of adjusted statistics are provided: (1) called "adj. exp" is an adjustment of the anti-log to take account of the changing skewness; (2) called "exp" is just the anti-log. Many people re-transform the results from log-transformed equations by just using the anti-log (exponential); however, if the log transformation is correct, then this gives you the median rather than the mean (regression normally gives you an expected, or conditional, mean value). To get the mean you must adjust this by using the variance from the regression. See, e.g., D.M. Miller (1984), "Reducing Transformation Bias in Curve Fitting" The American Statistician, 38, pp. 124-6; W.H. Greene (1990), Econometric Analysis, New York: Macmillan Publishing Company, p. 168; Granger (cited above), p. 132; or, any of the papers cited in ^logdummy.hlp^. The summary statistics from all five models, including from the two regressions that are shown anyway, appear together in a table. The summary statistics shown are: R-squared, Adjusted R-squared, the F-value for the regression, the root mean squared error (RMSE) for the regression, and the coefficient of variation for the regression. Also, at the bottom of each regression output, I provide the Durbin-Watson statistic in unadjusted form; this is provided since often a log transform is used because of problems that will cause D-W to fail. The program automatically transforms the dependent variable for you. Note that this ado-file does not in anyway transform the right-hand-side, or independent variables, in any way. Thus, if you think the real competition is between the log-transformed model and an untransformed model with a quadratic effect on the right-hand side, then you will probably need to run this ado-file twice, once with the quadratic term included on the right, and once without it. As a side-benefit you might even find that the log-transformed model with a quadratic term is best! Similarly, if you want to compare a model that is transformed to logs on both the right and left sides, then again you should probably use this ado-file twice. I also include two other procedures in the output: (1) a "test" of whether it is possible to reject either the linear or the log-transformed version; and, (2) a simple run of the ^boxcoxg^ transformation ado-file (see its article and/or help file for more information). The Godfrey, et al., article (cited in the STB article) compares a number of tests and finds that the PE test, included here, and the Ramsey RESET test, included in STB-2 as ^ramsey^, are among the best tests even when assumptions are violated. There are other worthwhile things to do, at least two of which are possible in Stata. First, and very easy in Stata, is a graph showing both the transformed and untransformed dependent variable on one graph, with one y-axis in the untransformed scale and the other in the transformed scale. Two examples, one of made-up data and one of real data show this. The other procedure requires the use of the Bootstrap; Stata's ado-file for this was discussed in The Stata News, January 1991, Vol. 7, No. 1, p. 6; use of the bootstrap to help choose between non-nested models is discussed in B. Efron (1984), "Comparing Non-Nested Linear Models", _Journal of the American Statistical Association_, 79: 791-803. The ^logdummy.ado^ file canNOT be used at the end of a run using this file since the last regression actually estimated by this ado-file is for the ^boxcoxg^ run. Thus, to use ^logdummy^, you must actually re-estimate the log-transformed regression. (See ^logdummy.hlp^.) Example using ^nwk.dta^ (Neter, Wasserman, and Kutner. 1989. Applied Linear Regression Models. 2d ed. Homewood, IL: Irwin.) ^. use nwk^ (Neter, et al., 1989, p. 150) ^. logsumm plasma age^ Source | SS df MS Number of obs = 25 ---------+------------------------------ F( 1, 23) = 70.21 Model | 238.056198 1 238.056198 Prob > F = 0.0000 Residual | 77.9830691 23 3.39056822 R-square = 0.7532 ---------+------------------------------ Adj R-square = 0.7425 Total | 316.039267 24 13.1683028 Root MSE = 1.8413 Variable | Coefficient Std. Error t Prob > |t| Mean ---------+-------------------------------------------------------------- plasma | 9.1112 ---------+-------------------------------------------------------------- age | -2.182 .2604062 -8.379 0.000 2 _cons | 13.4752 .6378622 21.126 0.000 1 ---------+-------------------------------------------------------------- Durbin Watson Statistic = 1.6413435 Source | SS df MS Number of obs = 25 ---------+------------------------------ F( 1, 23) = 134.02 Model | 2.77338628 1 2.77338628 Prob > F = 0.0000 Residual | .475948075 23 .020693395 R-square = 0.8535 ---------+------------------------------ Adj R-square = 0.8472 Total | 3.24933435 24 .135388931 Root MSE = .14385 Variable | Coefficient Std. Error t Prob > |t| Mean ---------+-------------------------------------------------------------- logdepv | 2.141985 ---------+-------------------------------------------------------------- age | -.2355159 .0203437 -11.577 0.000 2 _cons | 2.613017 .0498318 52.437 0.000 1 ---------+-------------------------------------------------------------- Durbin Watson Statistic = 1.7528526 Following are some summary statistics for each of the above two models 3 of the 5 sets of statistics are 'adjusted', the other two just repeat what was shown above for ease of comparison. The first column shows the unadjusted statistics for the linear model, just as shown in the first regression above; the second column shows summary statistics for the same model but this time adjusted by transforming to logs; the third column repeats the unadjusted figures from the transformed regression (the second regression above); this is followed by two sets of adjusted statistics: (1) a less biased re-transformation than the standard one (see the help file or the STB article); (2) using the 'standard', biased, re-transformation by just exponentiating the predicted values from the log model. | Adjusted Better Standard | Raw Raw Log Adj'd Log Adj'd Log ---------------------------------------------------------------------- R-Square | 0.7532 0.7981 0.8535 0.7911 0.7945 Adjusted R-SQ| 0.7425 0.7893 0.8472 0.7820 0.7856 F-Value | 70.21 90.93 134.02 87.09 88.93 RMSE | 1.8413 0.1689 0.1439 1.6944 1.7037 CV (*100) | 20.21 7.88 6.72 18.59 20.00 Results of the MacKinnon-Davidson (PE) test: The t-statistic (p-value) for test of linearity is 2.068 0.050 The t-statistic (p-value) for test of log-linearity is -1.114 0.277 Note that it is quite possible that BOTH the above tests might be significant (non-significant)!! This means that this test is indeterminate for this model; in this case, the use of ^boxcoxg^, below, may be particularly helpful; regardless, you might also want to use ^ramsey.ado^ (STB-2). If only one test is significant, then we reject the functional form for which the test is significant and 'accept' the other form. Following is a crude look using ^boxcoxg^; if this appears to be informative, you might want to use ^boxcoxg^ again with a finer grid; see ^boxcox.hlp^ lambda SSE Log-likelihood -3.00 132.62 -61.0932 -2.50 89.93 -56.2381 -2.00 61.86 -51.5602 -1.50 44.01 -47.3064 -1.00 33.91 -44.0460 -0.50 30.56 -42.7460 0.00 34.52 -44.2690 0.50 48.37 -48.4862 1.00 77.98 -54.4561 1.50 135.17 -61.3314 2.00 243.05 -68.6657 2.50 446.86 -76.2781 3.00 835.78 -84.1046 A number of variables are kept, but not saved in your data file; here is the data after the above estimation, with automatic variable labels. You may want to use some of these; for example, comparing quantile graphs of the two different sets of residuals can be informative. ^. d, d^ Contains data from nwk.dta Obs: 25 (max= 28324) Neter, et al., 1989, p. 150 Vars: 21 (max= 254) Width: 108 (max= 510) 1. age float %9.0g 2. plasma float %9.0g 3. logdepv float %9.0g Log of Original D.V. 4. yhatr float %9.0g Pred. Values/Untransformed Reg. 5. yhatl float %9.0g Log of Pred. Values/Untransform 6. _resr double %10.0g Residuals/Untransformed Reg. 7. _DWr double %10.0g D-W/raw regression 8. _SSEr float %9.0g Log transformed SSE 9. _SSTr float %9.0g Log transformed SST 10. yhat double %10.0g Pred.Values/Transformed Reg, Lo 11. _res double %10.0g Residuals/Transformed Reg, Logs 12. stdf double %10.0g Forecast Err/Transformed Reg, L 13. yhata float %9.0g Retransformed, Adj., Pred. Valu 14. yhate float %9.0g Retransformed, UNadj., Pred. Va 15. _SSEa float %9.0g Retransformed, Adjusted, SSE 16. _SSEe float %9.0g Retransformed, UNadjusted, SSE 17. _SSTa float %9.0g Retransformed, Adjusted, SST 18. _SSTe float %9.0g Retransformed, UNadjusted, SST 19. _DW double %10.0g D-W from Transformed Reg. 20. lidiff float %9.0g Difference between Raw and Re-t 21. lodiff float %9.0g Difference between Log and Logg Sorted by: Note: Data has changed since last save