How do I perform multiple operations on data records if a condition is met?
|
Title
|
|
Multiple operations on data records
|
|
Author
|
David Kantor, Johns Hopkins University
|
|
Date
|
October 2001; updated February 2003
|
Question
I'm a SAS user new to Stata. I have not been able to find any references on
how to perform multiple operations on data records if a condition is met.
For example, if I want to reset var1 and var2 based on CONDITION1 and
CONDITION2, I've so far only been able to use redundant code:
. replace var1 = 1 if CONDITION1 & CONDITION2
. replace var2 = 'Y' if CONDITION1 & CONDITION2
In SAS I would write
if CONDITION1 and CONDITION2 then do;
var1 = 1;
var2 = 'Y';
end;
I'd also like to figure out how to nest IF statements. In SAS, I could
write
if CONDITION1 and CONDITION2 then do;
var1 = 1;
var2 = 'Y';
if CONDITION3 then var3 = 100;
end;
Answer
First, you need to understand the distinction between the if
statement and the if qualifier.
The if qualifier is a clause you tack onto a statement or
program call, such as in your example:
. replace var1 = 1 if CONDITION1 & CONDITION2
Assuming that CONDITION1 & CONDITION2 involve variables, then this
operation will apply to some subset of the observations—possibly some,
but not necessarily all the observations, depending on how these conditions
evaluate on the data. Think of this as a filter that screens which
observations the statement applies.
If, on the other hand, CONDITION1 & CONDITION2 are constant (do not
depend on variables), then it is still a filter, but you are filtering in
either all or none of the observations. See
http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for more
information. Here it might be better to use an if statement, which
will be explained later.
The repetition of if qualifiers you cited is a common practice
in Stata, and it is usually not considered a problem. If the condition is
complex and you don't want to waste computer time recalculating it for each
statement (or risk not typing it exactly the same in each statement), then
you would want to capture its values in a new variable. You would do
something like the following statements:
. generate byte cond7 = CONDITION1 & CONDITION2
. replace var1 = 1 if cond7
. replace var2 = 'Y' if cond7
The if qualifier cannot be nested in the same way as SAS. In Stata,
the equivalent of your nesting example would be, in addition to the
statements above,
. replace var3 = 100 if cond7 & CONDITION3
(You may want to drop cond7 later. Or, if your code is in a program or
do-file, use a tempvar, and it will be automatically dropped when the
program or do-file exits.)
The if statement is something entirely different. It controls
whether a statement or block of statements gets executed. In this
situation, the if keyword is at the beginning of the statement:
if CONDITION4 {
... OTHER STATEMENTS ...
}
The condition controlling it usually does not involve variables. (If it
does and the variable is not subscripted, then the value in the first
observation is taken. It is unlikely that you would really want to do such
a thing, though one might code it by mistake.) You can combine several
statements under an if statement, but the whole block will either be
executed or skipped. I recommend that you see [P] if or
help ifcmd.
if statements can be nested.
An if statement can optionally be followed by an else
statement. (But, the if qualifier does not have a
corresponding else part. Although for assigning values, there is
something analagous in the cond() function, which will be described
below.)
Finally, and this is key to understanding the distinction between the
if statement and the if qualifier, as well as to the
difference between SAS and Stata, be aware that Stata applies each
operation, in turn, to the whole dataset, subject to filtering by if
qualifiers. Thus in the example above involving the if qualifier,
the first replace command is applied to all observations (subject to
the filtering imposed by its if qualifier); then the second
replace command is applied to all observations (subject to the
filtering imposed by its if qualifier). That is why you can't
combine several commands under one if qualifier. SAS does it the
other way: the whole sequence of statements is executed for the first
observation, then the second observation, etc. This is a significant
difference.
(Another way to look at this is to note that any statement that applies to
the whole set of observations involves an implicit loop that steps through
all the observations. In Stata, that loop occurs separately for each
statement. In SAS, it surrounds the whole sequence of statements.)
I also recommend that you look up the
cond() function,
which can make certain constructs much more efficient. As a novice, I would
write the following code:
. generate byte a = 1 if y <= 20
. replace a = 2 if y > 20 & y <= 30
. replace a = 3 if y > 30 & y <= 40
. replace a = 4 if y > 40 & y <.
I now do it this way (though some people would debate whether this is
better):
. #delim ;
. generate byte a = cond(y<=20, 1,
cond(y<=30, 2,
cond(y<=40, 3,
cond(y<., 4, . ))));
One more thing: beware of missing values in conditions. They are taken as
true. Also they are greater than any normal number in comparison operations.
See the FAQ
http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/ for details.
|