Home  /  Resources & support  /  FAQs  /  How to add factor-variable support to a command

How do I modify an ado-file created for previous versions of Stata to support factor variables and the collinearity behavior introduced in Stata 11?

Title   How to add factor-variable support to a command
Author Theresa Boswell, StataCorp

Question:

Why is my command not working with factor-variable syntax and/or collinear variables in Stata 11?

Answer:

As of Stata 11, variables are no longer dropped because of collinearity. Instead, these variables are omitted and are labeled with an “o.” operator in the column names of the resulting parameter vector. Also factor variables are supported in most official Stata commands, and community-contributed commands will need to understand this new syntax for factor variables to be accepted.

Various steps are required to allow factor variables to be specified with a command and to correctly handle the new collinearity behavior. The difficulty in doing so will depend on the type of assumptions that are made with the current command’s code. Below we provide many of the basic situations and their solutions.

First, here is a list of programmer’s commands you will find helpful for supporting factor variables and adding the new collinear behavior to your commands:

To familiarize yourself with the new factor-variable syntax, review the factor-variable manual entry in [U] 11.4.3 Factor variables. For information on the new collinearity behavior, review the manual entry [P] _rmcoll.

Once you are familiar with these new changes, you are ready to begin altering your ado-file so that your commands will work in Stata 11 or newer versions and make use of these new additions.

Note: If you do not want to allow factor variables with your command and want the old collinearity behavior, you can use version control (version < 11) on estimation commands so that the old behavior is obtained. For example,

. sysuse auto
(1978 Automobile Data)

. generate weight2 = weight + 2

. logistic foreign mpg turn weight weight2

note: weight2 omitted because of collinearity

Logistic regression                               Number of obs   =         74
                                                  LR chi2(3)      =      43.00
                                                  Prob > chi2     =     0.0000
Log likelihood = -23.535653                       Pseudo R2       =     0.4774

------------------------------------------------------------------------------
     foreign | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |    .835845   .0783689    -1.91   0.056     .6955319    1.004464
        turn |   .6707935   .1096432    -2.44   0.015     .4869198    .9241028
      weight |   .9976149   .0011761    -2.03   0.043     .9953125    .9999228
     weight2 |  (omitted)
------------------------------------------------------------------------------

. matrix list e(b)

e(b)[1,5]
       foreign:    foreign:    foreign:    foreign:    foreign:
                                                 o.            
           mpg        turn      weight     weight2       _cons
y1  -.17931203  -.39929391   -.0023879           0   24.859261

. version 10.1: logistic foreign mpg turn weight weight2

note: weight2 dropped because of collinearity

Logistic regression                               Number of obs   =         74
                                                  LR chi2(3)      =      43.00
                                                  Prob > chi2     =     0.0000
Log likelihood = -23.535653                       Pseudo R2       =     0.4774

------------------------------------------------------------------------------
     foreign | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |    .835845   .0783681    -1.91   0.056     .6955332    1.004462
        turn |   .6707935    .109641    -2.44   0.015     .4869229    .9240969
      weight |   .9976149   .0011761    -2.03   0.043     .9953125    .9999227
------------------------------------------------------------------------------

. matrix list e(b)

e(b)[1,4]
           mpg        turn      weight       _cons
y1  -.17931203  -.39929391   -.0023879   24.859261

Note that under version control, we have the old behavior of handling collinearity. Therefore, for authors who simply want to use the old behavior, the solution is as simple as adding version control to any command that checks collinearity.

For those who want their commands to use the features introduced in Stata 11, the following steps are adequate for most cases.

  1. If syntax varlist() is used and factor variables are to be allowed in the varlist, then add fv to the varlist criteria. For example, change
    syntax varlist
    syntax varlist(ts)
    syntax varlist(numeric) 
    syntax varlist(default=none)
    
    to
    syntax varlist(fv)
    syntax varlist(ts fv)
    syntax varlist(numeric fv) 
    syntax varlist(default=none fv)
    
    To check if factor variables were specified in the varlist, you may use the saved macro s(fvops) from syntax, which will be equal to “true” when factor variables are specified and empty otherwise.

    Note: The macro s(tsops) is similarly set for time-series operators.

    You will probably want to include code similar to the following after calling syntax:
    syntax varlist(fv)
    local fvops = "`s(fvops)'" == "true" | _caller() >= 11
    if `fvops' {
    	// within this loop, you can expand the factor variable list,
    	// create a local for version control and perform any other
    	// steps needed for factor operated varlists
    }
    
  2. syntax varlist(fv) will return the factor variables in a level-specific but unexpanded list. For example,
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . local 0 "i(1/4).rep78"
    
    . syntax varlist(fv)
    
    . display "varlist = `varlist'"
    varlist = i(1 2 3 4).rep78
    
    You will usually want to work with an expanded list. If you would like to expand the varlist into all levels of the factor variables, you may use the fvexpand command. For example,
    . fvexpand i(1/4).rep78
    
    . return list
    
    macros:
                  r(fvops) : "true"
                r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78"
    
    This command will allow both the if and in operators. Thus if you have marked your estimation sample, it is good programming to only expand the factor variable to the levels that are included in your sample. Notice the macro r(fvops) is returned from the fvexpand command. Thus you could also use this command as a check if the variable list contains factor-operated variables.
  3. Common commands that are used with time-series operators that will need to be changed for factor variables are as follows:
    tsrevar -> fvrevar
    tsunab -> fvunab
    Note fvrevar will create temporary variables for all levels of the factors. This will take up similar memory as the xi: prefix required and is not efficient if not needed. You will notice that most commands now accept factor variables; thus you usuallly do not need to create temporary variables. However, this is the most direct method of allowing the factor-variable syntax in any command.

    If you need to replace tsrevar with fvrevar but only want to create temporary variables for time-series–operated variables, you can use the tsonly option. Also, if you want to obtain a list of unoperated variables (base variables), then you can use the list option. For example,
    . local list "L.mpg weight i.rep78"
    
    . fvrevar `list', list
    
    . return list
    
    macros:
                r(varlist) : "mpg weight rep78"
    
    Notice the list option does not create any temporary variables. A similar command to fvrevar, list is unopvarlist. Using the list created above, we have the following:
    . unopvarlist `list'
    
    . return list
    
    macros:
                r(varlist) : "mpg weight rep78"
    
  4. Many commands will need to be called under version 11 or greater when factor variables are specified if the command sets the version lower at the beginning. One solution that works in most situations is to check if the varlist contains factor variables and then set a macro containing the version number as follows:
    	program prog1
    		syntax varlist(numeric fv) 
    		local fvops = "`s(fvops)'" == "true" | _caller() >= 11 
    		if `fvops' { 
    			local vv: di "version " string(max(11,_caller())) ", missing: " 
    		} 
    	end 
    
    This sets the macro vv to contain the maximum of 11 or the current version number. Common commands that will need to be called under version control are as follows:
    • _rmcoll
    • _rmdcoll
    • matrix colnames
    • matrix rownames
    • _vce_parserun
    • makecns
    • ml
    • Any other estimation command called within your program
    You will need to pass this version number to the commands if you set the version to a value less than 11 at the beginning of your program. Once you have set a local vv to be equal to “version # ”, you can then change commands as follows:
    _rmcoll `varlist' if `touse'
    
    to
    	
    `vv' _rmcoll `varlist' if `touse'
    
    Please see step 8 for more information on the changes to the _rmcoll command. If you are not sure if passing the version number is needed, try running the command without adding the version number. You will normally see error messages similar to the following if this was needed:
    . version 10
    
    . regress mpg weight i.rep78
    factor variables not allowed
    r(101);
    
    . regress mpg weight rep78#foreign
    interactions not allowed
    r(101);
    
    . mat colnames b = `coln'
    rep78#0b:  operator invalid
    r(198);
    
    Once you see an error message similar to these, you can trace your code and see where the version control is needed.
  5. Usually, you will not want the dependent variable to contain factor-operated variables. To check that the dependent variable(s) do not contain factor variables, use the _fv_check_depvar command. Suppose that your command is called cmd1 and accepts the syntax
    cmd1 depvar [indepvars] [if] [in] [,options]
    
    and a user typed the following:
    cmd1 i.y x1 x2
    
    You would want to give an error message if factor variables are not allowed in the dependent variable. Instead of creating your own error message, you can use the Stata’s default error message. Consider the following:
    program cmd1, rclass
    	syntax varlist(fv ts) [if] [in] , [*]
    	local fvops = "`s(fvops)'" == "true" | _caller() >= 11 
    	if `fvops' { 
    		local vv: di "version " ///
    		string(max(11,_caller())) ", missing: " 
    		gettoken lhs rest : varlist
    		_fv_check_depvar `lhs'
    	} 
    end
    
    This will give the following error message:
    . cmd1 i.y x1 x2
    depvar may not be a factor variable
    r(198);
    
    If you have more than one dependent variable, you can use the k(#) option with _fv_check_depvar to specify how many variables to check.
  6. Collinear variables are no longer dropped from the estimation results. Instead, omitted collinear variables are marked with the “o.” operator. For example,
    sysuse auto, clear
    gen mpg2 = mpg+2
    _rmcoll weight mpg mpg2 turn
    
    will return the following:
    note: mpg2 omitted because of collinearity
    
    . return list
    
    scalars:
              r(k_omitted) =  1
    
    macros:
                r(varlist) : "weight mpg o.mpg2 turn"
    
    Because of this, you will no longer be able to use the column count of e(b) or e(V) to determine the model degrees of freedom. A solution that will often work is to use the rank of e(V). If the rank is not already returned as e(rank), you may use the undocumented command _post_vce_rank to do this.

    Another solution is to loop over the elements in e(b) and check the column names for omitted variables. The command _ms_parse_parts will help with this. Here is an example:
    sysuse auto, clear
    gen mpg2 = mpg+2
    regress weight mpg mpg2 turn o.trunk
    local coln : colnames e(b)
    local count 0
    foreach var of local coln {
    	_ms_parse_parts `var'
    	if !`r(omit)' {
    		local list `list' `var'
    		local ++count
    	}	
    }
    local list : subinstr local list "_cons" ""
    display "non-omitted variables are `list'"
    display "number of noon-omitted variables is `count'"
    
    _ms_parse_parts will also return other values that can be used to determine levels of factor variables and interactions, for instance.

    The command _ms_omit_info is useful if you have a matrix containing descriptive column names and you want to know the number of omitted columns and/or you want to obtain a matrix containing 0/1 indicators for omitted columns. For example,
    sysuse auto, clear
    gen mpg2 = mpg+2
    regress weight mpg mpg2 turn o.trunk
    _ms_omit_info e(b)
    return list
    mat list r(omit)
    
    If you only want to know if a matrix contains factor variables or omitted variables, you can use the command _ms_op_info, which will return the scalars r(fvops) and r(tsops) indicating whether the matrix contains factor variables and/or time-series operators in the column names.
  7. Say you are working on a postestimation command and would like to allow a user to specify levels of the factor variable in the varlist. If you add fv to varlist() in the syntax statement, a base level will be assumed and the varlist may not match the names in e(b) and e(V). To get around this, you can use the _ms_extract_varlist command to check that the specified levels are in e(b). For example,
    program prog2
    	syntax varlist(fv)
    	_ms_extract_varlist `varlist'
    	local varlist `r(varlist)'
    	noi di "varlist = `varlist'"
    end
    
    If the specified levels are not contained in e(b), a similar error message as shown below will be given:
    . _ms_extract_varlist 6.rep78
    level 6 of factor rep78 not found in e(b)
    r(111);
    
    This command can replace loops where a manual check is done to see if the specified variable is contained in e(b) and/or e(V). This can be helpful even if factor variables are not specified.
  8. Because Stata no longer removes collinear variables from e(b) and e(V), you will want to begin looking at the column names after estimation commands to check if variables were omitted. The command _rmcoll returns a list of the variables with collinear variables marked with the omitted “o.” notation. Here is an example:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . generate mpg2 = mpg+2
    
    . _rmcoll mpg mpg2 turn
    note: mpg2 omitted because of collinearity
    
    . return list
    
    scalars:
              r(k_omitted) =  1
    
    macros:
                r(varlist) : "mpg o.mpg2 turn"
    
    In Stata 11 or newer versions of Stata, the sweep order for checking collinearity is stable, and the list returned by _rmcoll will match the results after estimation commands that use _rmcoll. This was not always true in previous versions of Stata. If you have factor-operated variables in your varlist, you must specify the expand() option along with _rmcoll for the returned list to be expanded into each level of the factors. Here is an example showing what you would obtain with and without this option:
    . _rmcoll mpg mpg2 i.rep78
    note: mpg2 omitted because of collinearity
    
    . return list
    
    scalars:
              r(k_omitted) =  2
    
    macros:
                r(varlist) : "mpg o.mpg2 i(1 2 3 4 5)b1.rep78"
    
    . _rmcoll mpg mpg2 i.rep78, expand
    note: mpg2 omitted because of collinearity
    
    . return list
    
    scalars:
              r(k_omitted) =  2
    
    macros:
                r(varlist) : "mpg o.mpg2 1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78"
    
  9. If you call _rmcoll under version 11 or higher and obtain the full expanded list containing the omitted categories and variables, you can usually use this list to name the column stripes of your results. If you do not use version control for this and set the version lower than 11 at the beginning of your program, you will receive an error message stating that the operator is not valid:
    varname#0b:  operator invalid
    r(198);
    
  10. If you need to remove columns and rows corresponding to omitted variables or levels, then you can use the _ms_omit_info command, along with Mata. Keep in mind that _ms_omit_info needs the matrix to contain column names displaying if the variable was omitted. As an example, let’s assume we need to perform a Cholesky decomposition (requires the matrix to be full rank) on the matrix e(V). Let’s also assume we want to reduce e(b) to add to the example. First, we must create a copy of e(b) and e(V) and then run _ms_omit_info.
    tempname b V noomit
    matrix `b' = e(b)
    matrix `V' = e(V)
    _ms_omit_info `b'
    local cols = colsof(`b')
    matrix `noomit' =  J(1,`cols',1) - r(omit)
    
    We created a matrix named noomit which will mark columns that should be kept with the number one and zero otherwise. Now we can use the select() function in Mata to return the reduced matrix.
    /* reduce matrix V */
    mata: newV = select(st_matrix(st_local("V")),(st_matrix(st_local("noomit"))))
    mata: newV = select(newV, (st_matrix(st_local("noomit")))')
    mata: st_matrix(st_local("V"),newV)
    	
    /* reduce matrix b */
    mata: newB = select(st_matrix(st_local("b")),(st_matrix(st_local("noomit"))))
    mata: st_matrix(st_local("b"),newB)
    
    The function st_matrix() was used to overwrite the temporary matrices b and V with the reduced matrices obtained in Mata. If you want the column names to still be in place on the reduced matrix, you will need to reassign the names. Now that we have the reduced matrices, we can perform computations that require the reduced form.