# Re: st: linear vs log-linear regression: specification test

 From Richard Williams <[email protected]> To [email protected] Subject Re: st: linear vs log-linear regression: specification test Date Tue, 28 Oct 2003 22:16:28 -0500

```At 09:28 PM 10/28/2003 -0500, David Miller wrote:

```
```However, this seems to be a general issue -- log transform or not log
transform the Y variable -- and I would be interested in hearing any
StataList views on this.  I have heard some (not Stata Listers) say that
if you get a better R2 with the transformed data or a better picture of
a regression plot (whatever that means), you should do it, but I am not
sure that I agree.
```
As you suggest later in your message, I'd want theory to guide me on this, not R^2. The goal is to correctly specify the model, not maximize R^2. But, if theory is totally ambivalent or lacking on this...

```If there is no underlying
specific or rigorous theory underlying the relationship and the purpose
is only  prediction,  is it then ok to simply empirically fit the data
and take the logs if this seems to result in  a better fit?
```
For prediction purposes, you could take the viewpoint that, so long as it works, what do you care whether the model makes any theoretical sense or not? If you can come up with a formula to predict the winning lottery numbers every time and the variables are rainfall in Idaho and the rating of this week's most watched TV show, I'll be happy to use that formula whether I can make any sense out of it or not.

But, the trick is, "works every time." Approaches which maximize R^2 via variable transformations, stepwise selection procedures, or whatever, may well be capitalizing on chance; and while they may have worked for the current data they may be worthless for the next piece of data that comes in. For example, I once observed a fairly high correlation between id number and another variable of interest, but I don't think I'd want to count on that being a universal truth.

One way of dealing with this, then, is to develop your model with one set of data, and then see how well it works with a second set of data. If you don't conveniently have two data sets, you might try splitting your current data in half.

Indeed, in principle, you ought to do this for just about any analyses. Most of us probably test all sorts of ideas that get discarded before we report anything. The significance tests for the winners that we do report may therefore be deceptive, because no matter how dumb our ideas were, if we test enough of them something has to come up significant even if it is just by chance. So, testing on a 2nd set of data to make sure everything still comes out the way it did in the first set of data is a great (albeit rarely done) thing to do. But, if there is no theory at all in the first place, then it seems especially important to do some double-checks.

-------------------------------------------
Richard Williams, Associate Professor
OFFICE: (574)631-6668, (574)631-6463
FAX: (574)288-4373
HOME: (574)289-5227
EMAIL: [email protected]
WWW (personal): http://www.nd.edu/~rwilliam
WWW (department): http://www.nd.edu/~soc

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

• References: