Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: ambiguity in -if- qualifier


From   "Yu Chen, PhD" <[email protected]>
To   [email protected]
Subject   Re: st: ambiguity in -if- qualifier
Date   Mon, 24 Mar 2014 11:36:28 -0500

Hi, Nick,
Suppose I want to do a regression only on foreign cars, using the
auto.dta data set. I have two possible ways to do that. (1). I can
-drop- the domestic cars at the beginning and then do the regression.
This way the regression is performed only on the foreign cars. (2) I
can use an -if- qualifier in the regression command to restrict the
sample to foreign cars.
Do you think these two methods produce the same results?

Try the code below, and you will see that results differ.

Code for method (1).
sysuse auto,clear
gen n=_n
tsset n
drop if foreign==0
reg price L.mpg headroom


Code for method (2).
sysuse auto,clear
gen n=_n
tsset n
reg price L.mpg headroom if foreign==1


I don't think many people are aware of this issue. So it is important
to make clear rules for the usage of -if- qualifier.
I also thank Joe for his help.





On Sat, Mar 22, 2014 at 8:09 PM, Nick Cox <[email protected]> wrote:
> Comments below.
>
> Nick
> [email protected]
>
>
> On 23 March 2014 00:44, Yu Chen, PhD <[email protected]> wrote:
>> Hi, Nick,
>> Let me clarify. For any assignment to a new variable, there are two
>> steps. Step 1, the expression should be evaluated; and Step2, the
>> result of the evaluation is assigned to the new variable. My question
>> is, what is the sample used in each step?
>> For -generate-, Step 1 uses the full sample. In other words, all
>> observations, regardless whether they meet the -if- condition, can be
>> used. But in Step 2, -generate- uses the subsample that meets the -if-
>> condition.
>
> I don't think this word treatment helps understanding. In your
> -generate- example two things are happening simultaneously:
>
> A. Stata is being instructed to put previous values of -mpg- in a new variable.
>
> B. Stata is being instructed to do that only if -foreign- is 1.
>
> You are surmising that A is done in a Step 1, which is followed by B
> in a Step 2. But it makes just as much sense  to imagine that Stata
> works out that the variable should receive non-missing values only
> when -foreign- is 1 and then works out what they should be. EIther
> way, the result is the same.
>
>> However, there may exist such commands that use a subsample in Step 1.
>> In other words, before the command does any thing, the sample is
>> reduced according to the -if- condition, so all other activities that
>> the command is going to do are on this reduced sample. It seems to me
>> that most commands work this way. But I found that -generate- is an
>> exception. It does not restrict the sample until the last step.
>> I think this is a little confusing. At least, there is no consistency
>> in when to restrict the sample.
>> Thank you.
>
> Sorry, but I don't catch your meaning here at all. You've presumably
> withdrawn your claim about -egen-, so you seem to be offering
> speculation, but no examples that anyone  else can discuss.
>
>> On Sat, Mar 22, 2014 at 6:45 PM, Nick Cox <[email protected]> wrote:
>>> I don't think the one precise example here is puzzling in any sense.
>>> Previous values of -mpg- are put in a new variable if and only
>>> -foreign- is 1. This is calculated observation by observation.
>>>
>>> You allude to different behaviour with -egen-. But the help for -egen- explains
>>>
>>> "Explicit subscripting (using _N and _n), which is commonly used with
>>>     generate, should not be used with egen; see subscripting."
>>>
>>> That may illuminate your puzzlement.
>>>
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 22 March 2014 21:26, Yu Chen, PhD <[email protected]> wrote:
>>>> I think there is some ambiguity in the meaning and usage of the -if-
>>>> qualifier. Generally, the command is performed on a subset that meets
>>>> the -if- condition. However, a command may perform many tasks, and the
>>>> subset for each task is not clear sometimes. For example, for the
>>>> -generate- command, it seems to calculate the result of the expression
>>>> on the full sample first, and then that result is assigned to a
>>>> subsample that meets the -if- condition. However, for the -egen-
>>>> command, the calculation is performed on a subset that meets the -if-
>>>> condition, not the full sample, and then that result is assigned to
>>>> the new variable on that subsample.
>>>>
>>>> For example, see the code below.
>>>>
>>>> sysuse auto
>>>> gen mpg2=mpg[_n-1] if foreign==1
>>>>
>>>> Notice that observation number 53 has a value of 24 for mpg2. This
>>>> indicates that the task of taking a lagged value is performed on the
>>>> full sample first. Otherwise, this value should be missing. But -egen-
>>>> works differently.
>>>>
>>>> There may exist other cases that have similar ambiguities. I would
>>>> suggest that Stata have a clear rule to address this issue. If the
>>>> rule is already out there, please tell me.
>>>> Thank you very much.
>>>>
>>>>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index