Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Removing the limit to 31 variables from stata -impute- ado


From   jpitblado@stata.com (Jeff Pitblado, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Removing the limit to 31 variables from stata -impute- ado
Date   Thu, 13 Nov 2003 13:26:57 -0600

Renzo Comolli <renzo.comolli@yale.edu> asks about the limit on the number of
variables allowed by -impute-:

> I know this behavior is strictly "at my own risk". Anyway I (copied with a
> different name and) removed the limitation to 31 variables in the impute.ado
> It works with no waiting time at all even with 52 variables.
> I wonder whether StataCorp has been too risk averse when they now updated it
> from version 3.1 to version 8 of the ado.

> Anybody had similar experiences of removing the limitation?
> From the explanation in the manual of what -impute- does, it is possible
> that I could get away with so many variables because almost all of them
> where dummies and therefore easy to order. (counting the categorical
> variables before the dummy expansion I am way below 15)

The -impute- command runs regressions by best-subset regression, looking at
the pattern of missing values in the predictors.  It is conceivable that
-impute- must run a regression for each combinations of the predictor
variables, depending upon the patter of missingness.

In order to enumerate all best-subset combinations, -impute- looks at the 0's
and 1's in the binary representation of a long integer.  In Stata, a long
integer contains 32 bits--one of which is used for the sign.  Thus each of the
remaining bits are used to identify whether to include a predictor variable in
a given regression, and increasing this limit beyond 31 will not have a
desirable result (even thought the modified -impute- will not exit with an
error).

To illustrate how -impute- determines which variables to include in a
regression, suppose there are 3 predictors and that the pattern of missing
values among them requires a regression for each combination.  In this--albeit
worst case scenario--there are 2^3 = 8 regressions to run.  We can determine
which predictor to include in a regression by looking at the binary
representation of the regression index (starting from 0):

                integer (base 10)       integer (binary)
                    0                        000
                    1                        001
                    2                        010
                    3                        011
                    4                        100
                    5                        101
                    6                        110
                    7                        111

If the names of the predictor variables are x1 x2 and x3, we can interpret the
binary number like this

               x3             x2             x1    
               -------------------------------------
               <digit>        <digit>        <digit>

Thus 001 mean include x1, 011 means include x1 and x2, ...

Given this implementation, there has to be a limit on how many predictors are
allowed by the -impute- command before the generated -long integer- variable
becomes automatically -recast- to a -float- or -double-, thus breaking the
implementation.

By increasing the limit, all variables beyond the first 31 (possibly fewer)
will not be used in any of the regressions.

One way to get around this limit would be to add an option to -impute-, say
-nomissings()-, that will take a varlist.  These variables will be assumed
missing-value-free so that they could be present in all regressions.

We will look into adding this as a future update.

--Jeff
jpitblado@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index