Interaction expansion (STB-20: sg25) --------------------- ^xi^ term(s) ^xi:^ any_stata_command varlist_with_terms ... where a term is of the form: ^i.^varname or ^I.^varname ^i.^varname1^*i.^varname2 ^I.^varname1^*I.^varname2 ^i.^varname1^*^varname2 ^I.^varname1^*^varname2 ^i.^varname1^|^varname2 ^I.^varname1^|^varname2 Description ----------- ^xi^ expands terms containing categorical variables into dummy-variable sets by creating new variables and, in the second syntax, executes the specified command on the expanded terms. Description, continued ---------------------- ^xi^ provides a convenient way to include dummy or indicator variables when estimating a model (say with ^regress^, ^logistic^, etc.). For instance, assume the categorical variable agegrp contains 1 for ages 20-24, 2 for ages 25-39, 3 for ages 40-44, etc. Typing . ^xi: logistic outcome weight i.agegrp bp^ estimates a logistic regression of outcome on weight, dummies for each agegrp category, and bp. If you also had a string variable race containing "white", "black", and "other", typing . ^xi: logistic outcome weight bp i.agegrp i.race^ includes indicator variables for the race group as well. The "^i.^" indicator variables ^xi^ expands may appear anywhere in the varlist, so . ^xi: logistic outcome i.agegrp weight i.race bp^ would estimate the same model. Description, continued ---------------------- You can also create interactions of categorical variables; typing . ^xi: logistic outcome weight bp i.agegrp*i.race^ estimates a model including indicator variables for all agegrp and race combinations. You can interact categorical variables with continuous variables: . ^xi: logistic outcome bp weight i.agegrp*weight i.race^ And, of course, you can include multiple interactions: . ^xi: logistic outcome bp weight i.agegrp*weight i.agegrp*i.race^ We will now back up and consider each of ^xi^'s features in detail. Indicator variables for simple effects -------------------------------------- When you type "^i.^varname", ^xi^ internally tabulates varname (which may be a string or a numeric variable) and creates dummy (or indicator) variables for each observed value, omitting the dummy for the smallest value. For instance, say agegrp takes on the values 1, 2, 3, and 4. Typing . ^xi: logistic outcome i.agegrp^ creates indicator variables named Iagegr_2, Iagegr_3, and Iagegr4. (^xi^ chooses the names and tries to make them readable; ^xi^ guarantees that the names are unique). The expanded logistic model then is: . ^logistic outcome Iagegr_1 Iagegr_3 Iagegr_4^ Afterwards, you can drop the new variables ^xi^ leaves behind by typing "^drop I*^" (note capitalization). Indicator variables for simple effects (continued) -------------------------------------------------- ^xi^ provides the following features when you type "^i.^varname": 1) varname may be string or numeric. 2) Dummy variables are created automatically. 3) By default, the dummy-variable set is identified by dropping the dummy corresponding to the smallest value of the variable (how to specify otherwise is discussed below). 4) The new dummy variables are left in your data set. You can drop them by typing "^drop I*^". You do not have to do this; each time you use the ^xi^ prefix (or command), any previously created auto- matically generated dummies are dropped and new ones created. 5) The new dummy variables have variable labels so you can determine to what they correspond by typing "^describe^" or "^describe I*^" 6) ^xi^ may be used with any Stata command (not just ^logistic^). Controlling the omitted dummy ----------------------------- By default, ^i.^varname omits the dummy corresponding to the smallest value of varname; in the case of a string variable, this is interpreted as dropping the first in an alphabetical, case-sensitive sort. ^xi^ provides two alternatives to dropping the first: ^xi^ will drop the dummy corresponding to the most prevalent value of varname or ^xi^ will let you choose the particular dummy to be dropped. To change ^xi^'s behavior to dropping the most prevalent, you type, . ^global S_XIMODE "prevalent"^ although whether you type "prevalent" inside the quotes or "yes" or anything else does not matter. You need to type this command only once per session and, once typed, it affects the expansion of all categorical variables. If, during the session, you want to change the behavior back to the default drop- the-first rule, you type: . ^global S_XIMODE^ Controlling the omitted dummy, continued ---------------------------------------- Once you set S_XIMODE, i.varname omits the dummy corresponding to the most prevalent value of varname. Thus, the coefficients on the dummies have the interpretation of change from the most prevalent group. E.g., . ^global S_XIMODE "prevalent"^ . ^xi: regress y i.agegrp^ might create Iagegr_1 through Iagegr_4 and would result in Iagegr_2 being omitted if agegr==2 is most common. The model is then, y = a + b*Iagegr_1 + c*Iagegr_3 + d*Iagegr_4 + u Then, Pred. y for agegrp==1 = a + b Pred. y for agegrp==3 = a + c Pred. y for agegrp==2 = a Pred. y for agegrp==4 = a + d Thus, the model's reported t or z statistics are for a test of whether each group is different from the most prevalent group. Controlling the omitted dummy (continued) ----------------------------------------- Perhaps you wish to omit the dummy for agegrp==3 instead. You do this by creating a global macro with the same name as the variable containing: . ^global agegrp "xi omit 3"^ Now when you type . ^xi: regress y i.agegrp^ Iagegr_3 will be omitted and you will estimate the model: y = a + b*Iagegr_1 + c*Iagegr_2 + d*Iagegr_4 + u If you want to return to the default omission in the future, you type: . ^global agegrp^ thus clearing the macro. Controlling the omitted dummy (continued) ----------------------------------------- In summary, ^i.^varname omits the first group by default but if you define . ^global S_XIMODE "prevalent"^ then the default behavior changes to that of dropping the most prevalent group. Either way, if you define a macro of the form . ^global^ varname "^xi omit^ #" or, if varname is a string, . ^global^ varname "^xi omit^ string_literal" then the specified value will be omitted. Examples: . ^global agegrp "xi omit 1"^ . ^global race "xi omit White"^ (for race a string variable) . ^global agegrp^ (to restore default) Categorical variable interactions --------------------------------- ^i.^varname1^*i.^varname2 creates the dummy variables associated with the inter- action of the categorical varname1 with varname2. The identification rules -- which categories are omitted -- are the same as for ^i.^varname. For instance, assume agegrp takes on four values and race takes on three. Typing, . ^xi: regress y i.agegrp*i.race^ results in the model: y = a + b*Iagegr_2 + c*Iagegr_3 + d*Iagegrp_4 (agegrp dummies) + e*Irace_2 + f*Irace_3 (race dummies) + g*IaXr_2_2 + h*IaXr_2_3 + i*IaXr_3_2 (agegrp*race dummies) + j*IaXr_3_3 + k*IaXr_4_2 + l*IaXr_4_3 + u Categorical variable interactions (continued) --------------------------------------------- That is, . ^xi: regress y i.agegrp*i.race^ results in the same model as typing . ^xi: regress y i.agegrp i.race i.agegrp*i.race^ While there are lots of other ways the interaction could have been parameter- ized, this method has the advantage that one can test the joint significance of the interactions by typing . ^testparm IaXr*^ (see ^help test^). Returning to the estimation step, whether you specify ^i.agegrp*i.race^ or ^i.race*i.agegrp^ makes no difference other than in the names given to the interaction terms; in the first case, the names will begin with IaXr; in the second, IrXa). Categorical variable interactions (continued) --------------------------------------------- You may include multiple interactions simultaneously: . ^xi: regress y i.agegrp*i.race i.agegrp*i.sex^ The model estimated is: y = a + b*Iagegr_2 + c*Iagegr_3 + d*Iagegrp_4 (agegrp dummies) + e*Irace_2 + f*Irace_3 (race dummies) + g*IaXr_2_2 + h*IaXr_2_3 + i*IaXr_3_2 (agegrp*race dummies) + j*IaXr_3_3 + k*IaXr_4_2 + l*IaXr_4_3 + m*Isex_2 (sex dummy) + n*IaXs_2_2 + o*IaXs_3_2 + p*IaXs_4_2 (agegrp*sex dummies) + u Note that the agegrp dummies are (correctly) included only once. Interactions with continuous variables -------------------------------------- ^i.^varname1^*^varname2 (as distinguished from ^i.^varname1^*i.^varname2, note the second ^i.^) specifies interactions of a categorical variable with a continuous variable. Typing "^xi: regress y i.agegr*wgt^" results in the model, y = a + b*Iagegr_2 + c*Iagegr_3 + d*Iagegrp_4 (agegrp dummies) + e*wgt (continuous wgt effect) + d*IaXwgt_2 + e*IaXwgt_3 + d*IaXwgt_4 (agegrp*wgt dummies) A variation on this notation, using ^|^ rather than ^*^, omits the agegrp dummies. Typing "^xi: regress y i agegr|wgt^" results in the model: y = a + e*wgt (continuous wgt effect) + d*IaXwgt_2 + e*IaXwgt_3 + d*IaXwgt_4 (agegrp*wgt dummies) Interactions with continuous variables, continued ------------------------------------------------- That is, typing . ^xi: regress y i.agegr*wgt^ is equivalent to typing: . ^xi: regress y i.agegrp i.agegr|wgt^ Also note that in either case, it is not necessary to specify separately the wgt variable; it is included automatically. Interpreting output ------------------- . ^xi: regress mpg i.rep78^ i.rep78 Irep78_1-5 (naturally coded; Irep78_1 omitted) (output from regress appears) Interpretation: ^i.rep78^ expanded to the dummies Irep78_1, Irep78_2, ..., Irep78_5. The numbers on the end are "naturally" coded in the sense that Irep78_1 corresponds to rep78==1, Irep78_2 to rep78==2, etc. Finally, the dummy for rep78==1 was omitted. . ^xi: regress mpg i.make^ i.make Imake_1-74 (Imake_1 for make==AMC Concord omitted) (output from regress appears) Interpretation: ^i.make^ expanded to Imake_1, Imake_2, ..., Imake_74. The coding is not natural because make is a string variable. Imake_1 corresponds to one make, Imake_2 another, and so on. We can find out the coding by typing "^describe^". Imake_1 for the AMC Concord was chosen to be omitted. How ^xi^ names variables ---------------------- The names ^xi^ assigns to the dummy variables it creates are of the form: ^I^^_^ You may subsequently refer to the entire set of variables by ^I^^*^. For example: name = ^I^ + + ^_^ + Entire set ------------------------------------------------------------------ Iagegr_1 I agegr _ 1 Iagegr* Iagegr_2 I agegr _ 2 Iagegr* IaXwgt_1 I aXwgt _ 1 IaXwgt* IaXr_1_2 I aXr _ 1_2 IaXr* IaXr_2_1 I aXr _ 2_1 IaXr* ^xi^ as a command rather than a command prefix -------------------------------------------- ^xi^ can be used as a command prefix or as a command by itself. In the latter form, ^xi^ merely creates the indicator and interaction variables. Equivalent to typing, . ^xi: regress y i.agegrp*wgt^ is: . ^xi i.agegrp*wgt^ i.agegrp Iagegr_1-4 (naturally coded; Irep78_2 omitted) i.agegrp*wgt IaXwgt_1-4 (coded as above) . ^regress y Iagegr* IaXwgt*^ Warnings -------- - When you use ^xi^, either as a prefix or a command by itself, ^xi^ first drops all previously created interaction variables -- variables starting with capital ^I^. Do not name your variables starting with this letter. - ^xi^ creates new variables in your data; most are ^byte^s but interactions with continuous variables will have the storage type of the underlying continuous variable. You may get the message "no room to add more var- iables". If so, you must repartion memory; see ^help memsize^ or [4] memory. - when using ^xi^ with an estimation command, you may get the message "matsize too small". If so, see ^help matsize^. Also see -------- STB: sg25 (STB-20) Manual: [4] estimate On-line: ^help^ for any estimation command