| Title | How to add factor-variable support to a command | |
| Author | Theresa Boswell, StataCorp |
Why is my command not working with factor-variable syntax and/or collinear variables in Stata 11?
As of Stata 11, variables are no longer dropped because of collinearity. Instead, these variables are omitted and are labeled with an “o.” operator in the column names of the resulting parameter vector. Also factor variables are supported in most official Stata commands, and community-contributed commands will need to understand this new syntax for factor variables to be accepted.
Various steps are required to allow factor variables to be specified with a command and to correctly handle the new collinearity behavior. The difficulty in doing so will depend on the type of assumptions that are made with the current command’s code. Below we provide many of the basic situations and their solutions.
First, here is a list of programmer’s commands you will find helpful for supporting factor variables and adding the new collinear behavior to your commands:
To familiarize yourself with the new factor-variable syntax, review the factor-variable manual entry in [U] 11.4.3 Factor variables. For information on the new collinearity behavior, review the manual entry [P] _rmcoll.
Once you are familiar with these new changes, you are ready to begin altering your ado-file so that your commands will work in Stata 11 or newer versions and make use of these new additions.
Note: If you do not want to allow factor variables with your command and want the old collinearity behavior, you can use version control (version < 11) on estimation commands so that the old behavior is obtained. For example,
. sysuse auto
(1978 Automobile Data)
. generate weight2 = weight + 2
. logistic foreign mpg turn weight weight2
note: weight2 omitted because of collinearity
Logistic regression Number of obs = 74
LR chi2(3) = 43.00
Prob > chi2 = 0.0000
Log likelihood = -23.535653 Pseudo R2 = 0.4774
------------------------------------------------------------------------------
foreign | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | .835845 .0783689 -1.91 0.056 .6955319 1.004464
turn | .6707935 .1096432 -2.44 0.015 .4869198 .9241028
weight | .9976149 .0011761 -2.03 0.043 .9953125 .9999228
weight2 | (omitted)
------------------------------------------------------------------------------
. matrix list e(b)
e(b)[1,5]
foreign: foreign: foreign: foreign: foreign:
o.
mpg turn weight weight2 _cons
y1 -.17931203 -.39929391 -.0023879 0 24.859261
. version 10.1: logistic foreign mpg turn weight weight2
note: weight2 dropped because of collinearity
Logistic regression Number of obs = 74
LR chi2(3) = 43.00
Prob > chi2 = 0.0000
Log likelihood = -23.535653 Pseudo R2 = 0.4774
------------------------------------------------------------------------------
foreign | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | .835845 .0783681 -1.91 0.056 .6955332 1.004462
turn | .6707935 .109641 -2.44 0.015 .4869229 .9240969
weight | .9976149 .0011761 -2.03 0.043 .9953125 .9999227
------------------------------------------------------------------------------
. matrix list e(b)
e(b)[1,4]
mpg turn weight _cons
y1 -.17931203 -.39929391 -.0023879 24.859261
Note that under version control, we have the old behavior of handling collinearity. Therefore, for authors who simply want to use the old behavior, the solution is as simple as adding version control to any command that checks collinearity.
For those who want their commands to use the features introduced in Stata 11, the following steps are adequate for most cases.
syntax varlist syntax varlist(ts) syntax varlist(numeric) syntax varlist(default=none)to
syntax varlist(fv) syntax varlist(ts fv) syntax varlist(numeric fv) syntax varlist(default=none fv)To check if factor variables were specified in the varlist, you may use the saved macro s(fvops) from syntax, which will be equal to “true” when factor variables are specified and empty otherwise.
syntax varlist(fv)
local fvops = "`s(fvops)'" == "true" | _caller() >= 11
if `fvops' {
// within this loop, you can expand the factor variable list,
// create a local for version control and perform any other
// steps needed for factor operated varlists
}
. sysuse auto, clear (1978 Automobile Data) . local 0 "i(1/4).rep78" . syntax varlist(fv) . display "varlist = `varlist'" varlist = i(1 2 3 4).rep78You will usually want to work with an expanded list. If you would like to expand the varlist into all levels of the factor variables, you may use the fvexpand command. For example,
. fvexpand i(1/4).rep78
. return list
macros:
r(fvops) : "true"
r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78"
This command will allow both the
if and
in
operators. Thus if you have marked your estimation sample, it is good
programming to only expand the factor variable to the levels that are
included in your sample. Notice the macro r(fvops) is returned
from the fvexpand command. Thus you could also use this command as a
check if the variable list contains factor-operated variables.
. local list "L.mpg weight i.rep78"
. fvrevar `list', list
. return list
macros:
r(varlist) : "mpg weight rep78"
Notice the list option does not create any temporary variables. A
similar command to fvrevar, list is unopvarlist. Using the list
created above, we have the following:
. unopvarlist `list'
. return list
macros:
r(varlist) : "mpg weight rep78"
program prog1
syntax varlist(numeric fv)
local fvops = "`s(fvops)'" == "true" | _caller() >= 11
if `fvops' {
local vv: di "version " string(max(11,_caller())) ", missing: "
}
end
This sets the macro vv to contain the maximum of 11 or the current
version number. Common commands that will need to be called under version
control are as follows:
_rmcoll `varlist' if `touse'to
`vv' _rmcoll `varlist' if `touse'Please see step 8 for more information on the changes to the _rmcoll command. If you are not sure if passing the version number is needed, try running the command without adding the version number. You will normally see error messages similar to the following if this was needed:
. version 10 . regress mpg weight i.rep78 factor variables not allowed r(101); . regress mpg weight rep78#foreign interactions not allowed r(101); . mat colnames b = `coln' rep78#0b: operator invalid r(198);Once you see an error message similar to these, you can trace your code and see where the version control is needed.
cmd1 depvar [indepvars] [if] [in] [,options]and a user typed the following:
cmd1 i.y x1 x2You would want to give an error message if factor variables are not allowed in the dependent variable. Instead of creating your own error message, you can use the Stata’s default error message. Consider the following:
program cmd1, rclass
syntax varlist(fv ts) [if] [in] , [*]
local fvops = "`s(fvops)'" == "true" | _caller() >= 11
if `fvops' {
local vv: di "version " ///
string(max(11,_caller())) ", missing: "
gettoken lhs rest : varlist
_fv_check_depvar `lhs'
}
end
This will give the following error message:
. cmd1 i.y x1 x2 depvar may not be a factor variable r(198);If you have more than one dependent variable, you can use the k(#) option with _fv_check_depvar to specify how many variables to check.
sysuse auto, clear gen mpg2 = mpg+2 _rmcoll weight mpg mpg2 turnwill return the following:
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 1
macros:
r(varlist) : "weight mpg o.mpg2 turn"
Because of this, you will no longer be able to use the column count of
e(b) or e(V) to determine the model degrees of freedom. A solution that
will often work is to use the rank of e(V). If the rank is not
already returned as e(rank), you may use the undocumented command
_post_vce_rank to do this.
sysuse auto, clear
gen mpg2 = mpg+2
regress weight mpg mpg2 turn o.trunk
local coln : colnames e(b)
local count 0
foreach var of local coln {
_ms_parse_parts `var'
if !`r(omit)' {
local list `list' `var'
local ++count
}
}
local list : subinstr local list "_cons" ""
display "non-omitted variables are `list'"
display "number of noon-omitted variables is `count'"
_ms_parse_parts will also return other values that can be used to
determine levels of factor variables and interactions, for instance.
sysuse auto, clear gen mpg2 = mpg+2 regress weight mpg mpg2 turn o.trunk _ms_omit_info e(b) return list mat list r(omit)If you only want to know if a matrix contains factor variables or omitted variables, you can use the command _ms_op_info, which will return the scalars r(fvops) and r(tsops) indicating whether the matrix contains factor variables and/or time-series operators in the column names.
program prog2 syntax varlist(fv) _ms_extract_varlist `varlist' local varlist `r(varlist)' noi di "varlist = `varlist'" endIf the specified levels are not contained in e(b), a similar error message as shown below will be given:
. _ms_extract_varlist 6.rep78 level 6 of factor rep78 not found in e(b) r(111);This command can replace loops where a manual check is done to see if the specified variable is contained in e(b) and/or e(V). This can be helpful even if factor variables are not specified.
. sysuse auto, clear
(1978 Automobile Data)
. generate mpg2 = mpg+2
. _rmcoll mpg mpg2 turn
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 1
macros:
r(varlist) : "mpg o.mpg2 turn"
In Stata 11 or newer versions of Stata,
the sweep order for checking collinearity is stable, and
the list returned by _rmcoll will match the results after estimation
commands that use _rmcoll. This was not always true in previous
versions of Stata. If you have factor-operated variables in your varlist,
you must specify the expand() option along with _rmcoll for
the returned list to be expanded into each level of the factors. Here is
an example showing what you would obtain with and without this option:
. _rmcoll mpg mpg2 i.rep78
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 2
macros:
r(varlist) : "mpg o.mpg2 i(1 2 3 4 5)b1.rep78"
. _rmcoll mpg mpg2 i.rep78, expand
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 2
macros:
r(varlist) : "mpg o.mpg2 1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78"
varname#0b: operator invalid r(198);
tempname b V noomit matrix `b' = e(b) matrix `V' = e(V) _ms_omit_info `b' local cols = colsof(`b') matrix `noomit' = J(1,`cols',1) - r(omit)We created a matrix named noomit which will mark columns that should be kept with the number one and zero otherwise. Now we can use the select() function in Mata to return the reduced matrix.
/* reduce matrix V */
mata: newV = select(st_matrix(st_local("V")),(st_matrix(st_local("noomit"))))
mata: newV = select(newV, (st_matrix(st_local("noomit")))')
mata: st_matrix(st_local("V"),newV)
/* reduce matrix b */
mata: newB = select(st_matrix(st_local("b")),(st_matrix(st_local("noomit"))))
mata: st_matrix(st_local("b"),newB)
The function st_matrix() was used to overwrite the temporary
matrices b and V with the reduced matrices obtained in Mata. If you want
the column names to still be in place on the reduced matrix, you will
need to reassign the names. Now that we have the reduced matrices, we
can perform computations that require the reduced form.