Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: ambiguity in -if- qualifier


From   Nick Winter <[email protected]>
To   [email protected]
Subject   Re: st: ambiguity in -if- qualifier
Date   Tue, 25 Mar 2014 10:56:11 -0400

You omitted secret option (e): blame the economists.  :-)

On 3/25/2014 5:21 AM, Nick Cox wrote:
I think this example highlights the core of Yu Chen's concern. I
reverse Yu's style and present a plausible example in facetious
manner.

Question. Professor Nobelordie is teaching an advanced econometrics
class, "Testing for heteros{c|k}edasticity under a full moon". He
presents students with a dataset for 1900-2012 but for reasons
compelling to economists tells them to use only data from 1970 on to
build an autoregressive model predicting something of interest.

Students Strict and Weak attempt this problem. Student Strict starts out by

keep if year >= 1970

and then fits her model. Student Weak omits this step but carefully puts

if year >= 1970

on all his statements. They get different results. Explain why, and
apportion blame between

(a) Professor Nobelordie

(b) Student Strict

(c) Student Weak

(d) Stata.

Answer. Student Strict is reasoning "only use data from 1970 on", but
following the -keep- L1. values are not available for 1970 because
1969 is not in the dataset any more, L2 values are not available for
1971 for the same reason, and so on and so forth. Student Weak can use
more data (much more if there are several lagged terms in his model).
Provided they keep and show their code, the discrepancy can be
unearthed and explained.

Professor Nobelordie is guilty of a vague instruction, unless the
point of the question was for students to discover the ambiguity hard
way.

Stata is blameless. It just sits there, trying very hard to do what
it's told. -if- pushes one way, time series operators push another
way.

Nick
[email protected]


On 25 March 2014 00:46, Nick Cox <[email protected]> wrote:
What the -mvsumm- help calls the "weak" interpretation will always be
followed unless you intervene afterwards to -replace- values that use
information outside the -if- restriction (or, equivalently, reduce the
dataset to the observations selected by -if-).

That's much of the point of those comments! The rest of the point is
to just to underline that that is what Stata does.


Nick
[email protected]


On 24 March 2014 23:01, Yu Chen, PhD <[email protected]> wrote:
Hi, Nick,
Thank you very much for the explanation. You mentioned in the Remarks
of -mvsumm- (SSC) that there are possibly two interpretations: a weak
interpretation and a strong interpretation. You chose to use the weak
interpretation in developing the -mvsumm-.
Do you know whether such weak interpretation is consistently followed
by Stata in developing its official commands? If some official
commands employ the weak interpretation, but others employ the strong
interpretation, that will be a potential trap for those unaware of the
distinction.
Thank you.

Yu



On Mon, Mar 24, 2014 at 12:06 PM, Nick Cox <[email protected]> wrote:
The reason for your puzzlement is becoming much clearer, so thanks for
providing an example that can be discussed.

Note, however, that your initial word description -- in your first
paragraph -- does not fully match your code example, as your code
example bites for a quite specific reason, which only the code makes
clear.

Naturally, Stata can calculate the previous value of a time series if
the previous observation is present in the dataset, but not otherwise.
(Similar remarks apply to the effects of any time series operator or
subscripting where such imply reaching outside the observations
selected by -if-.)

Said differently, -if- selects observations to be used, but neither
the -if- qualifier nor any other part of the syntax is thereby
prohibited from invoking information in the other part of the data set
whenever -if- selects a strict subset.

But the problem here is not that Stata is being ambiguous, or
inconsistent, or incorrect, but that users need to ask for what they
want and want what they ask for.

In your example, which we can all agree to be frivolous, you in effect
carry out a regression on part of a panel and **part of what you
calculate depends on values outside the data used**. That's at best
dubious and at worst meaningless, but either way the decision to do
that is yours, not Stata's.

Otherwise put, it's your code that says "use lagged values for part of
the data" and Stata does what it is told to the best of its ability.
It's a robot and you are its instructor, in this example at least.

I agree with you that people need to think about cases like this.
Indeed, if you look at the help file for -mvsumm- (SSC) you will see
"Remarks" written (by me, as it happens) on this very point in 2005.

There are many other examples. Here is another.

sysuse auto , clear

gen mpg2 = mpg/_N if foreign

keep if foreign
gen mpg3 = mpg/_N

-mpg2- and -mpg3- are quite different, as _N is the number of
observations in the current dataset.

The only clear rule needed here is to ask for exactly what you want.

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index