Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Joe Canner <jcanner1@jhmi.edu> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: ambiguity in -if- qualifier |
Date | Sun, 23 Mar 2014 01:31:12 +0000 |
Yu, I think I understand what you're asking and perhaps I can explain it in a different way that might be helpful. Think about what the purpose of the -generate- command is. As per the documentation the purpose of -generate- is to "create a new variable". If there is an -if- qualifier this variable is only created for observations included in the -if- condition. (Well, technically it is created for all observations, but it is missing for every observation not in the -if- condition.) The fact that Stata has to do some calculations to put something into the new variable is irrelevant. From the standpoint of the -generate- statement it is going to create a variable and put values in it for every observation in the -if- condition, regardless of what it has to do to achieve that goal. I would also point out that you can't say that Stata is evaluating the right hand side of a -generate- statement on the entire data set. -generate- is a built-in command, so I can't say for sure, either, but I doubt that this is what it does, as that would be very inefficient. As implied above, I suspect that Stata identifies which observations it needs to use and then only attempts to assign values for those observations. If Stata needs to go outside of the -if- condition to do that, so be it. Regards, Joe Canner Johns Hopkins University School of Medicine ________________________________________ From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com] Sent: Saturday, March 22, 2014 9:09 PM To: statalist@hsphsun2.harvard.edu Subject: Re: st: ambiguity in -if- qualifier Comments below. Nick njcoxstata@gmail.com On 23 March 2014 00:44, Yu Chen, PhD <profyuchen@gmail.com> wrote: > Hi, Nick, > Let me clarify. For any assignment to a new variable, there are two > steps. Step 1, the expression should be evaluated; and Step2, the > result of the evaluation is assigned to the new variable. My question > is, what is the sample used in each step? > For -generate-, Step 1 uses the full sample. In other words, all > observations, regardless whether they meet the -if- condition, can be > used. But in Step 2, -generate- uses the subsample that meets the -if- > condition. I don't think this word treatment helps understanding. In your -generate- example two things are happening simultaneously: A. Stata is being instructed to put previous values of -mpg- in a new variable. B. Stata is being instructed to do that only if -foreign- is 1. You are surmising that A is done in a Step 1, which is followed by B in a Step 2. But it makes just as much sense to imagine that Stata works out that the variable should receive non-missing values only when -foreign- is 1 and then works out what they should be. EIther way, the result is the same. > However, there may exist such commands that use a subsample in Step 1. > In other words, before the command does any thing, the sample is > reduced according to the -if- condition, so all other activities that > the command is going to do are on this reduced sample. It seems to me > that most commands work this way. But I found that -generate- is an > exception. It does not restrict the sample until the last step. > I think this is a little confusing. At least, there is no consistency > in when to restrict the sample. > Thank you. Sorry, but I don't catch your meaning here at all. You've presumably withdrawn your claim about -egen-, so you seem to be offering speculation, but no examples that anyone else can discuss. > On Sat, Mar 22, 2014 at 6:45 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> I don't think the one precise example here is puzzling in any sense. >> Previous values of -mpg- are put in a new variable if and only >> -foreign- is 1. This is calculated observation by observation. >> >> You allude to different behaviour with -egen-. But the help for -egen- explains >> >> "Explicit subscripting (using _N and _n), which is commonly used with >> generate, should not be used with egen; see subscripting." >> >> That may illuminate your puzzlement. >> >> Nick >> njcoxstata@gmail.com >> >> >> On 22 March 2014 21:26, Yu Chen, PhD <profyuchen@gmail.com> wrote: >>> I think there is some ambiguity in the meaning and usage of the -if- >>> qualifier. Generally, the command is performed on a subset that meets >>> the -if- condition. However, a command may perform many tasks, and the >>> subset for each task is not clear sometimes. For example, for the >>> -generate- command, it seems to calculate the result of the expression >>> on the full sample first, and then that result is assigned to a >>> subsample that meets the -if- condition. However, for the -egen- >>> command, the calculation is performed on a subset that meets the -if- >>> condition, not the full sample, and then that result is assigned to >>> the new variable on that subsample. >>> >>> For example, see the code below. >>> >>> sysuse auto >>> gen mpg2=mpg[_n-1] if foreign==1 >>> >>> Notice that observation number 53 has a value of 24 for mpg2. This >>> indicates that the task of taking a lagged value is performed on the >>> full sample first. Otherwise, this value should be missing. But -egen- >>> works differently. >>> >>> There may exist other cases that have similar ambiguities. I would >>> suggest that Stata have a clear rule to address this issue. If the >>> rule is already out there, please tell me. >>> Thank you very much. >>> >>> Yu Chen >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > > > -- > Yu Chen, Ph.D. > Assistant Professor of Accounting > A. R. Sanchez, Jr. School of Business, WHTC 218D > Texas A&M International University > 5201 University Boulevard > Laredo, Texas 78041-1900 > USA > 956-326-2513 (office) > 956-326-2479 (fax) > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/