[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Richard Williams <[email protected]> |

To |
[email protected] |

Subject |
Re: st: linear vs log-linear regression: specification test |

Date |
Tue, 28 Oct 2003 22:16:28 -0500 |

At 09:28 PM 10/28/2003 -0500, David Miller wrote:

As you suggest later in your message, I'd want theory to guide me on this, not R^2. The goal is to correctly specify the model, not maximize R^2. But, if theory is totally ambivalent or lacking on this...However, this seems to be a general issue -- log transform or not log transform the Y variable -- and I would be interested in hearing any StataList views on this. I have heard some (not Stata Listers) say that if you get a better R2 with the transformed data or a better picture of a regression plot (whatever that means), you should do it, but I am not sure that I agree.

For prediction purposes, you could take the viewpoint that, so long as it works, what do you care whether the model makes any theoretical sense or not? If you can come up with a formula to predict the winning lottery numbers every time and the variables are rainfall in Idaho and the rating of this week's most watched TV show, I'll be happy to use that formula whether I can make any sense out of it or not.If there is no underlying specific or rigorous theory underlying the relationship and the purpose is only prediction, is it then ok to simply empirically fit the data and take the logs if this seems to result in a better fit?

But, the trick is, "works every time." Approaches which maximize R^2 via variable transformations, stepwise selection procedures, or whatever, may well be capitalizing on chance; and while they may have worked for the current data they may be worthless for the next piece of data that comes in. For example, I once observed a fairly high correlation between id number and another variable of interest, but I don't think I'd want to count on that being a universal truth.

One way of dealing with this, then, is to develop your model with one set of data, and then see how well it works with a second set of data. If you don't conveniently have two data sets, you might try splitting your current data in half.

Indeed, in principle, you ought to do this for just about any analyses. Most of us probably test all sorts of ideas that get discarded before we report anything. The significance tests for the winners that we do report may therefore be deceptive, because no matter how dumb our ideas were, if we test enough of them something has to come up significant even if it is just by chance. So, testing on a 2nd set of data to make sure everything still comes out the way it did in the first set of data is a great (albeit rarely done) thing to do. But, if there is no theory at all in the first place, then it seems especially important to do some double-checks.

-------------------------------------------

Richard Williams, Associate Professor

OFFICE: (574)631-6668, (574)631-6463

FAX: (574)288-4373

HOME: (574)289-5227

EMAIL: [email protected]

WWW (personal): http://www.nd.edu/~rwilliam

WWW (department): http://www.nd.edu/~soc

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**References**:

- Prev by Date:
**Re: st: Tables and row/column percentages** - Next by Date:
**st: Re: R-SQUARED AND XTGEE** - Previous by thread:
**Re: st: linear vs log-linear regression: specification test** - Next by thread:
**st: Re: linear vs log-linear regression: specification test** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |