Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: Systematic Estimations
Nick Cox <firstname.lastname@example.org>
RE: st: Systematic Estimations
Wed, 25 May 2011 16:34:14 +0100
Balancing the formal and informal, or objective indications and subjective judgement, is much of the difficulty of inference. Some people have solutions that satisfy themselves, but no solution satisfies everyone. Perhaps this is one reason for the unexplained quip quoted by the mathematician Ian Stewart that statistics is a branch of theology. (Stewart, I. 1975. Concepts of modern mathematics. Harmondsworth: Penguin, if I recall correctly.)
Here are some arbitrary but I hope not quite trivial remarks.
1. The idea that the model is specified in advance and the data are then presented to test the model is a nice ideal, but it presumes that the modellers are super-smart and able to think out everything in advance. I don't know about your field, but in mine the modellers don't have a monopoly on smartness, and even if they did their models are rarely that good.
2. Most models are more or less empirical any way, although there are many ways of disguising the fact. I've even seen claims that particular functional forms are "intuitive", but it's part of the ritual of many fields to make such claims as a kind of incantation. Economics seems to provide particularly extraordinary examples. Dig deeper, and most functional forms are chosen for the convenience of the modeller.
3. There is a chicken-and-egg question: In what sense can we ever learn from data? The purist idea seems to be: By learning that a model is wrong. But if a model is wrong, then the next model should be revised accordingly. So, the implication is that such acts of revision take place "between publications" as it were? This is absurd in my view.
4. Much of the concern is over the precise etiquette of what you write in a paper. It is common in many fields to spend a great deal of effort thinking about whether to transform, which variables to include, whether and how to model interactions, etc., etc., etc. Then the paper is written up as if the model you ended up after a lot of work was precisely the one you had in mind all along. This is a kind of hypocrisy but it is also widely taught and practised. The implications for P-values of data snooping are widely but not universally realised to be problematic.
5. The pious view is that judgment is expected on the kind of model you choose while the data should usually be allowed to indicate parameter estimates. In between situations are thought treacherous. Consider allowing the data to indicate a transformation, often thought to be arbitrary and ad hoc. Then Box and Cox showed that you could estimate the right transformation from the data. So, that's all right then. However, most people who choose transformations don't use Box-Cox, they use judgement, as in "this kind of variable is always better treated on a logarithmic scale".
6. The term "data snooping" is deliberately pejorative. "Learning from the data" sounds like a good idea to me (and it does not imply that it is the only way to learn).
From: email@example.com [mailto:firstname.lastname@example.org] On Behalf Of Maarten Buis
Sent: 25 May 2011 14:06
Subject: Re: st: Systematic Estimations
On Wed, May 25, 2011 at 2:41 PM, Barbara Engels wrote:
> I have a less technical and rather general question. I am a newbie regarding empirical evaluations of time-series. I am dealing with the relation between total factor productivity and research and development expenditure now. There are many variables that could play a role in determining total factor productivity. I have been trying to estimate regressions for quite some time, introducing variables, excluding them again. Estimated coefficients have changed dramatically in value and sign, and so did R^2.
> My question is: Is there any recommendable system of how to pick and drop variables again, making sure that THIS regression equation is better than the OTHER and not the other way around without getting lost in wild estimations?
That is a difficult problem. Some would say that you need to have a
prior theory and stick to whatever the data may say and should not
"snoop" around in your data. There is some truth in that, but often
that is just not a practical solution. A nice alternative is discussed
in: Edward E. Leamer (1983) Let's Take the Con Out of Econometrics.
The American Economic Review, Vol. 73, No. 1 (Mar., 1983), pp. 31-43.
* For searches and help try: