Title | How to add factor-variable support to a command | |
Author | Theresa Boswell, StataCorp |
Why is my command not working with factor-variable syntax and/or collinear variables in Stata 11?
As of Stata 11, variables are no longer dropped because of collinearity. Instead, these variables are omitted and are labeled with an “o.” operator in the column names of the resulting parameter vector. Also factor variables are supported in most official Stata commands, and user-written commands will need to understand this new syntax for factor variables to be accepted.
Various steps are required to allow factor variables to be specified with a command and to correctly handle the new collinearity behavior. The difficulty in doing so will depend on the type of assumptions that are made with the current command’s code. Below we provide many of the basic situations and their solutions.
First, here is a list of programmer’s commands you will find helpful for supporting factor variables and adding the new collinear behavior to your commands:
To familiarize yourself with the new factor-variable syntax, review the factor-variable manual entry in [U] 11.4.3 Factor variables. For information on the new collinearity behavior, review the manual entry [P] _rmcoll.
Once you are familiar with these new changes, you are ready to begin altering your ado-file so that your commands will work in Stata 11 or newer versions and make use of these new additions.
Note: If you do not want to allow factor variables with your command and want the old collinearity behavior, you can use version control (version < 11) on estimation commands so that the old behavior is obtained. For example,
. sysuse auto (1978 Automobile Data) . generate weight2 = weight + 2 . logistic foreign mpg turn weight weight2 note: weight2 omitted because of collinearity Logistic regression Number of obs = 74 LR chi2(3) = 43.00 Prob > chi2 = 0.0000 Log likelihood = -23.535653 Pseudo R2 = 0.4774 ------------------------------------------------------------------------------ foreign | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .835845 .0783689 -1.91 0.056 .6955319 1.004464 turn | .6707935 .1096432 -2.44 0.015 .4869198 .9241028 weight | .9976149 .0011761 -2.03 0.043 .9953125 .9999228 weight2 | (omitted) ------------------------------------------------------------------------------ . matrix list e(b) e(b)[1,5] foreign: foreign: foreign: foreign: foreign: o. mpg turn weight weight2 _cons y1 -.17931203 -.39929391 -.0023879 0 24.859261 . version 10.1: logistic foreign mpg turn weight weight2 note: weight2 dropped because of collinearity Logistic regression Number of obs = 74 LR chi2(3) = 43.00 Prob > chi2 = 0.0000 Log likelihood = -23.535653 Pseudo R2 = 0.4774 ------------------------------------------------------------------------------ foreign | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .835845 .0783681 -1.91 0.056 .6955332 1.004462 turn | .6707935 .109641 -2.44 0.015 .4869229 .9240969 weight | .9976149 .0011761 -2.03 0.043 .9953125 .9999227 ------------------------------------------------------------------------------ . matrix list e(b) e(b)[1,4] mpg turn weight _cons y1 -.17931203 -.39929391 -.0023879 24.859261
Note that under version control, we have the old behavior of handling collinearity. Therefore, for authors who simply want to use the old behavior, the solution is as simple as adding version control to any command that checks collinearity.
For those who want their commands to use the features introduced in Stata 11, the following steps are adequate for most cases.
syntax varlist syntax varlist(ts) syntax varlist(numeric) syntax varlist(default=none)to
syntax varlist(fv) syntax varlist(ts fv) syntax varlist(numeric fv) syntax varlist(default=none fv)To check if factor variables were specified in the varlist, you may use the saved macro s(fvops) from syntax, which will be equal to “true” when factor variables are specified and empty otherwise.
syntax varlist(fv) local fvops = "`s(fvops)'" == "true" | _caller() >= 11 if `fvops' { // within this loop, you can expand the factor variable list, // create a local for version control and perform any other // steps needed for factor operated varlists }
. sysuse auto, clear (1978 Automobile Data) . local 0 "i(1/4).rep78" . syntax varlist(fv) . display "varlist = `varlist'" varlist = i(1 2 3 4).rep78You will usually want to work with an expanded list. If you would like to expand the varlist into all levels of the factor variables, you may use the fvexpand command. For example,
. fvexpand i(1/4).rep78 . return list macros: r(fvops) : "true" r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78"This command will allow both the if and in operators. Thus if you have marked your estimation sample, it is good programming to only expand the factor variable to the levels that are included in your sample. Notice the macro r(fvops) is returned from the fvexpand command. Thus you could also use this command as a check if the variable list contains factor-operated variables.
. local list "L.mpg weight i.rep78" . fvrevar `list', list . return list macros: r(varlist) : "mpg weight rep78"Notice the list option does not create any temporary variables. A similar command to fvrevar, list is unopvarlist. Using the list created above, we have the following:
. unopvarlist `list' . return list macros: r(varlist) : "mpg weight rep78"
program prog1 syntax varlist(numeric fv) local fvops = "`s(fvops)'" == "true" | _caller() >= 11 if `fvops' { local vv: di "version " string(max(11,_caller())) ", missing: " } endThis sets the macro vv to contain the maximum of 11 or the current version number. Common commands that will need to be called under version control are as follows:
_rmcoll `varlist' if `touse'to
`vv' _rmcoll `varlist' if `touse'Please see step 8 for more information on the changes to the _rmcoll command. If you are not sure if passing the version number is needed, try running the command without adding the version number. You will normally see error messages similar to the following if this was needed:
. version 10 . regress mpg weight i.rep78 factor variables not allowed r(101); . regress mpg weight rep78#foreign interactions not allowed r(101); . mat colnames b = `coln' rep78#0b: operator invalid r(198);Once you see an error message similar to these, you can trace your code and see where the version control is needed.
cmd1 depvar [indepvars] [if] [in] [,options]and a user typed the following:
cmd1 i.y x1 x2You would want to give an error message if factor variables are not allowed in the dependent variable. Instead of creating your own error message, you can use the Stata’s default error message. Consider the following:
program cmd1, rclass syntax varlist(fv ts) [if] [in] , [*] local fvops = "`s(fvops)'" == "true" | _caller() >= 11 if `fvops' { local vv: di "version " /// string(max(11,_caller())) ", missing: " gettoken lhs rest : varlist _fv_check_depvar `lhs' } endThis will give the following error message:
. cmd1 i.y x1 x2 depvar may not be a factor variable r(198);If you have more than one dependent variable, you can use the k(#) option with _fv_check_depvar to specify how many variables to check.
sysuse auto, clear gen mpg2 = mpg+2 _rmcoll weight mpg mpg2 turnwill return the following:
note: mpg2 omitted because of collinearity . return list scalars: r(k_omitted) = 1 macros: r(varlist) : "weight mpg o.mpg2 turn"Because of this, you will no longer be able to use the column count of e(b) or e(V) to determine the model degrees of freedom. A solution that will often work is to use the rank of e(V). If the rank is not already returned as e(rank), you may use the undocumented command _post_vce_rank to do this.
sysuse auto, clear gen mpg2 = mpg+2 regress weight mpg mpg2 turn o.trunk local coln : colnames e(b) local count 0 foreach var of local coln { _ms_parse_parts `var' if !`r(omit)' { local list `list' `var' local ++count } } local list : subinstr local list "_cons" "" display "non-omitted variables are `list'" display "number of noon-omitted variables is `count'"_ms_parse_parts will also return other values that can be used to determine levels of factor variables and interactions, for instance.
sysuse auto, clear gen mpg2 = mpg+2 regress weight mpg mpg2 turn o.trunk _ms_omit_info e(b) return list mat list r(omit)If you only want to know if a matrix contains factor variables or omitted variables, you can use the command _ms_op_info, which will return the scalars r(fvops) and r(tsops) indicating whether the matrix contains factor variables and/or time-series operators in the column names.
program prog2 syntax varlist(fv) _ms_extract_varlist `varlist' local varlist `r(varlist)' noi di "varlist = `varlist'" endIf the specified levels are not contained in e(b), a similar error message as shown below will be given:
. _ms_extract_varlist 6.rep78 level 6 of factor rep78 not found in e(b) r(111);This command can replace loops where a manual check is done to see if the specified variable is contained in e(b) and/or e(V). This can be helpful even if factor variables are not specified.
. sysuse auto, clear (1978 Automobile Data) . generate mpg2 = mpg+2 . _rmcoll mpg mpg2 turn note: mpg2 omitted because of collinearity . return list scalars: r(k_omitted) = 1 macros: r(varlist) : "mpg o.mpg2 turn"In Stata 11 or newer versions of Stata, the sweep order for checking collinearity is stable, and the list returned by _rmcoll will match the results after estimation commands that use _rmcoll. This was not always true in previous versions of Stata. If you have factor-operated variables in your varlist, you must specify the expand() option along with _rmcoll for the returned list to be expanded into each level of the factors. Here is an example showing what you would obtain with and without this option:
. _rmcoll mpg mpg2 i.rep78 note: mpg2 omitted because of collinearity . return list scalars: r(k_omitted) = 2 macros: r(varlist) : "mpg o.mpg2 i(1 2 3 4 5)b1.rep78" . _rmcoll mpg mpg2 i.rep78, expand note: mpg2 omitted because of collinearity . return list scalars: r(k_omitted) = 2 macros: r(varlist) : "mpg o.mpg2 1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78"
varname#0b: operator invalid r(198);
tempname b V noomit matrix `b' = e(b) matrix `V' = e(V) _ms_omit_info `b' local cols = colsof(`b') matrix `noomit' = J(1,`cols',1) - r(omit)We created a matrix named noomit which will mark columns that should be kept with the number one and zero otherwise. Now we can use the select() function in Mata to return the reduced matrix.
/* reduce matrix V */ mata: newV = select(st_matrix(st_local("V")),(st_matrix(st_local("noomit")))) mata: newV = select(newV, (st_matrix(st_local("noomit")))') mata: st_matrix(st_local("V"),newV) /* reduce matrix b */ mata: newB = select(st_matrix(st_local("b")),(st_matrix(st_local("noomit")))) mata: st_matrix(st_local("b"),newB)The function st_matrix() was used to overwrite the temporary matrices b and V with the reduced matrices obtained in Mata. If you want the column names to still be in place on the reduced matrix, you will need to reassign the names. Now that we have the reduced matrices, we can perform computations that require the reduced form.