How do I modify an ado-file created for previous versions of Stata to support factor variables and the collinearity behavior introduced in Stata 11?
| Title |
|
How to add factor-variable support to a command |
| Author |
Theresa Boswell, StataCorp
|
| Date |
August 2009; minor revisions July 2011 |
Question:
Why is my command not working with factor-variable syntax and/or collinear
variables in Stata 11?
Answer:
As of Stata 11, variables are no longer dropped because of collinearity.
Instead, these variables are omitted and are labeled with an “o.”
operator in the column names of the resulting parameter vector. Also
factor variables are supported in most official Stata commands, and
user-written commands will need to understand this new syntax for
factor variables to be accepted.
Various steps are required to allow factor variables to be
specified with a command and to correctly handle the new collinearity
behavior. The difficulty in doing so will depend on the type of assumptions
that are made with the current command’s code. Below we provide many of the
basic situations and their solutions.
First, here is a list of programmer’s commands you will find helpful
for supporting factor variables and adding the new collinear behavior to your
commands:
To familiarize yourself with the new factor-variable syntax, see
help
fvvarlist and review the factor-variable manual entry in
[U] 11.4.3 Factor variables. For information on the new
collinearity behavior, see help _rmcoll
and review the manual entry [P] _rmcoll.
Once you are familiar with these new changes, you are ready to begin
altering your ado-file so that your commands will work in Stata 11
or newer versions and make use of these new additions.
Note: If you do not want to allow factor variables with your command and
want the old collinearity behavior, you can use version control (version <
11) on estimation commands so that the old behavior is obtained. For
example,
. sysuse auto
(1978 Automobile Data)
. generate weight2 = weight + 2
. logistic foreign mpg turn weight weight2
note: weight2 omitted because of collinearity
Logistic regression Number of obs = 74
LR chi2(3) = 43.00
Prob > chi2 = 0.0000
Log likelihood = -23.535653 Pseudo R2 = 0.4774
------------------------------------------------------------------------------
foreign | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | .835845 .0783689 -1.91 0.056 .6955319 1.004464
turn | .6707935 .1096432 -2.44 0.015 .4869198 .9241028
weight | .9976149 .0011761 -2.03 0.043 .9953125 .9999228
weight2 | (omitted)
------------------------------------------------------------------------------
. matrix list e(b)
e(b)[1,5]
foreign: foreign: foreign: foreign: foreign:
o.
mpg turn weight weight2 _cons
y1 -.17931203 -.39929391 -.0023879 0 24.859261
. version 10.1: logistic foreign mpg turn weight weight2
note: weight2 dropped because of collinearity
Logistic regression Number of obs = 74
LR chi2(3) = 43.00
Prob > chi2 = 0.0000
Log likelihood = -23.535653 Pseudo R2 = 0.4774
------------------------------------------------------------------------------
foreign | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | .835845 .0783681 -1.91 0.056 .6955332 1.004462
turn | .6707935 .109641 -2.44 0.015 .4869229 .9240969
weight | .9976149 .0011761 -2.03 0.043 .9953125 .9999227
------------------------------------------------------------------------------
. matrix list e(b)
e(b)[1,4]
mpg turn weight _cons
y1 -.17931203 -.39929391 -.0023879 24.859261
Note that under version control, we have the old behavior of handling
collinearity. Therefore, for authors who simply want to use the old
behavior, the solution is as simple as adding version control to any command
that checks collinearity.
For those who want their commands to use the features introduced in
Stata 11, the following steps are adequate for most cases.
- If syntax varlist() is used and factor variables are to be allowed in the
varlist, then add fv to the varlist criteria. For example, change
syntax varlist
syntax varlist(ts)
syntax varlist(numeric)
syntax varlist(default=none)
to
syntax varlist(fv)
syntax varlist(ts fv)
syntax varlist(numeric fv)
syntax varlist(default=none fv)
To check if factor variables were specified in the varlist, you may use the
saved macro s(fvops) from syntax, which will be equal to
“true” when factor variables are specified and empty otherwise.
Note: The macro s(tsops) is similarly set for time-series operators.
You will probably want to include code similar to the following after
calling syntax:
syntax varlist(fv)
local fvops = "`s(fvops)'" == "true" | _caller() >= 11
if `fvops' {
// within this loop, you can expand the factor variable list,
// create a local for version control and perform any other
// steps needed for factor operated varlists
}
- syntax varlist(fv) will return the factor variables in a
level-specific but unexpanded list. For example,
. sysuse auto, clear
(1978 Automobile Data)
. local 0 "i(1/4).rep78"
. syntax varlist(fv)
. display "varlist = `varlist'"
varlist = i(1 2 3 4).rep78
You will usually want to work with an expanded list. If you would
like to expand the varlist into all levels of the factor variables, you
may use the fvexpand
command. For example,
. fvexpand i(1/4).rep78
. return list
macros:
r(fvops) : "true"
r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78"
This command will allow both the
if and
in
operators. Thus if you have marked your estimation sample, it is good
programming to only expand the factor variable to the levels that are
included in your sample. Notice the macro r(fvops) is returned
from the fvexpand command. Thus you could also use this command as a
check if the variable list contains factor-operated variables.
- Common commands that are used with time-series operators that will need to
be changed for factor variables are as follows:
tsrevar -> fvrevar
tsunab -> fvunab
Note fvrevar will create temporary variables for all levels of the
factors. This will take up similar memory as the xi: prefix required
and is not efficient if not needed. You will notice that most commands
now accept factor variables; thus you usuallly do not need to create temporary
variables. However, this is the most direct method of
allowing the factor-variable syntax in any command.
If you need to replace tsrevar with fvrevar but only want
to create temporary variables for time-series–operated variables, you can
use the tsonly option. Also, if you want to obtain a list of
unoperated variables (base variables), then you can use the list option.
For example,
. local list "L.mpg weight i.rep78"
. fvrevar `list', list
. return list
macros:
r(varlist) : "mpg weight rep78"
Notice the list option does not create any temporary variables. A
similar command to fvrevar, list is unopvarlist. Using the list
created above, we have the following:
. unopvarlist `list'
. return list
macros:
r(varlist) : "mpg weight rep78"
- Many commands will need to be called under version 11 or greater when
factor variables are specified if the command sets the version lower at
the beginning. One solution that works in most situations is to check if
the varlist contains factor variables and then set a macro containing the
version number as follows:
program prog1
syntax varlist(numeric fv)
local fvops = "`s(fvops)'" == "true" | _caller() >= 11
if `fvops' {
local vv: di "version " string(max(11,_caller())) ", missing: "
}
end
This sets the macro vv to contain the maximum of 11 or the current
version number. Common commands that will need to be called under version
control are as follows:
- _rmcoll
- _rmdcoll
- matrix colnames
- matrix rownames
- _vce_parserun
- makecns
- ml
- Any other estimation command called within your program
You will need to pass this version number to the commands if you set the
version to a value less than 11 at the beginning of your program. Once
you have set a local vv to be equal to “version # ”,
you can then change commands as follows:
_rmcoll `varlist' if `touse'
to
`vv' _rmcoll `varlist' if `touse'
Please see step 8 for more information on the changes to the _rmcoll
command. If you are not sure if passing the version number is needed, try
running the command without adding the version number. You will normally
see error messages similar to the following if this was needed:
. version 10
. regress mpg weight i.rep78
factor variables not allowed
r(101);
. regress mpg weight rep78#foreign
interactions not allowed
r(101);
. mat colnames b = `coln'
rep78#0b: operator invalid
r(198);
Once you see an error message similar to these, you can trace your code
and see where the version control is needed.
- Usually, you will not want the dependent variable to contain
factor-operated variables. To check that the dependent variable(s) do not
contain factor variables, use the _fv_check_depvar command. Suppose
that your command is called cmd1 and accepts the syntax
cmd1 depvar [indepvars] [if] [in] [,options]
and a user typed the following:
cmd1 i.y x1 x2
You would want to give an error message if factor variables are not
allowed in the dependent variable. Instead of creating your own error
message, you can use the Stata’s default error message. Consider the
following:
program cmd1, rclass
syntax varlist(fv ts) [if] [in] , [*]
local fvops = "`s(fvops)'" == "true" | _caller() >= 11
if `fvops' {
local vv: di "version " ///
string(max(11,_caller())) ", missing: "
gettoken lhs rest : varlist
_fv_check_depvar `lhs'
}
end
This will give the following error message:
. cmd1 i.y x1 x2
depvar may not be a factor variable
r(198);
If you have more than one dependent variable, you can use the k(#)
option with _fv_check_depvar to specify how many variables to check.
- Collinear variables are no longer dropped from the estimation results.
Instead, omitted collinear variables are marked with the
“o.” operator. For example,
sysuse auto, clear
gen mpg2 = mpg+2
_rmcoll weight mpg mpg2 turn
will return the following:
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 1
macros:
r(varlist) : "weight mpg o.mpg2 turn"
Because of this, you will no longer be able to use the column count of
e(b) or e(V) to determine the model degrees of freedom. A solution that
will often work is to use the rank of e(V). If the rank is not
already returned as e(rank), you may use the undocumented command
_post_vce_rank to do this.
Another solution is to loop over the elements in e(b) and check the
column names for omitted variables. The command _ms_parse_parts
will help with this. Here is an example:
sysuse auto, clear
gen mpg2 = mpg+2
regress weight mpg mpg2 turn o.trunk
local coln : colnames e(b)
local count 0
foreach var of local coln {
_ms_parse_parts `var'
if !`r(omit)' {
local list `list' `var'
local ++count
}
}
local list : subinstr local list "_cons" ""
display "non-omitted variables are `list'"
display "number of noon-omitted variables is `count'"
_ms_parse_parts will also return other values that can be used to
determine levels of factor variables and interactions, for instance.
The command _ms_omit_info is useful if you have a matrix
containing descriptive column names and you want to know the number of
omitted columns and/or you want to obtain a matrix containing 0/1
indicators for omitted columns. For example,
sysuse auto, clear
gen mpg2 = mpg+2
regress weight mpg mpg2 turn o.trunk
_ms_omit_info e(b)
return list
mat list r(omit)
If you only want to know if a matrix contains factor variables or omitted
variables, you can use the command _ms_op_info, which will return
the scalars r(fvops) and r(tsops) indicating whether the matrix contains
factor variables and/or time-series operators in the column names.
- Say you are working on a postestimation command and would like to
allow a user to specify levels of the factor variable in the varlist. If
you add fv to varlist() in the syntax
statement, a base level will be assumed and the varlist may not match the
names in e(b) and e(V). To get around this, you can use the
_ms_extract_varlist command to check that the specified levels are in
e(b). For example,
program prog2
syntax varlist(fv)
_ms_extract_varlist `varlist'
local varlist `r(varlist)'
noi di "varlist = `varlist'"
end
If the specified levels are not contained in e(b), a similar error
message as shown below will be given:
. _ms_extract_varlist 6.rep78
level 6 of factor rep78 not found in e(b)
r(111);
This command can replace loops where a manual check is done to see if the
specified variable is contained in e(b) and/or e(V). This can be helpful
even if factor variables are not specified.
- Because Stata no longer removes collinear variables from e(b) and e(V), you
will want to begin looking at the column names after estimation commands
to check if variables were omitted. The command _rmcoll returns a
list of the variables with collinear variables marked with the omitted
“o.” notation. Here is an example:
. sysuse auto, clear
(1978 Automobile Data)
. generate mpg2 = mpg+2
. _rmcoll mpg mpg2 turn
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 1
macros:
r(varlist) : "mpg o.mpg2 turn"
In Stata 11 or newer versions of Stata,
the sweep order for checking collinearity is stable, and
the list returned by _rmcoll will match the results after estimation
commands that use _rmcoll. This was not always true in previous
versions of Stata. If you have factor-operated variables in your varlist,
you must specify the expand() option along with _rmcoll for
the returned list to be expanded into each level of the factors. Here is
an example showing what you would obtain with and without this option:
. _rmcoll mpg mpg2 i.rep78
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 2
macros:
r(varlist) : "mpg o.mpg2 i(1 2 3 4 5)b1.rep78"
. _rmcoll mpg mpg2 i.rep78, expand
note: mpg2 omitted because of collinearity
. return list
scalars:
r(k_omitted) = 2
macros:
r(varlist) : "mpg o.mpg2 1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78"
- If you call _rmcoll under version 11 or higher and obtain the full
expanded list containing the omitted categories and variables, you can
usually use this list to name the column stripes of your results.
If you do not use version control for this and set the version lower than
11 at the beginning of your program, you will receive an error message
stating that the operator is not valid:
varname#0b: operator invalid
r(198);
- If you need to remove columns and rows corresponding to
omitted variables or levels, then you can use the _ms_omit_info
command, along with Mata. Keep in mind that _ms_omit_info needs
the matrix to contain column names displaying if the variable was
omitted. As an example, let’s assume we need to perform a
Cholesky decomposition (requires the matrix to be full rank) on the
matrix e(V). Let’s also assume we want to reduce e(b) to add
to the example. First, we must create a copy of e(b) and e(V) and then
run _ms_omit_info.
tempname b V noomit
matrix `b' = e(b)
matrix `V' = e(V)
_ms_omit_info `b'
local cols = colsof(`b')
matrix `noomit' = J(1,`cols',1) - r(omit)
We created a matrix named noomit which will mark columns that should be
kept with the number one and zero otherwise. Now we can use the
select() function in Mata to return the reduced matrix.
/* reduce matrix V */
mata: newV = select(st_matrix(st_local("V")),(st_matrix(st_local("noomit"))))
mata: newV = select(newV, (st_matrix(st_local("noomit")))')
mata: st_matrix(st_local("V"),newV)
/* reduce matrix b */
mata: newB = select(st_matrix(st_local("b")),(st_matrix(st_local("noomit"))))
mata: st_matrix(st_local("b"),newB)
The function st_matrix() was used to overwrite the temporary
matrices b and V with the reduced matrices obtained in Mata. If you want
the column names to still be in place on the reduced matrix, you will
need to reassign the names. Now that we have the reduced matrices, we
can perform computations that require the reduced form.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|