[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
wgould@stata.com (William Gould, Stata) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Missing values in Mata functions |

Date |
Thu, 08 Sep 2005 11:38:38 -0500 |

Ben Jann <ben.jann@soz.gess.ethz.ch> asked > I have a general question about programming Mata functions: > How should we deal with missing values? > > Up to now, my strategy simply was not to allow missing values. > That is: > > : void myfunction(x) { > > if (missing(x)) _error(3351) > > } > : myfunction(.) > myfunction(): 3351 argument has missing values > <istmt>: - function returned error > r(3351); I do not think -if (missing(x)) _error(3351)- is the right way to go in general. Here's the current StataCorp theory on missing values: 1. Mata functions and missing values ------------------------------------- Let's divide numeric functions into three categories, (M) Mathematical functions (S) Statistical functions (U) Utility functions Examples of (M) include sqrt(), cholsolve(), eigensystem(), etc. Examples of (S) include mean(), cross(), etc. Examples of (U) include st_view(), st_dropvar(), fwrite(), etc. The distinction between (M) and (S) is a subtle one. For our purposes, we will define as in (S) any function that accepts raw data (e.g., a data matrix) as input. Any function that is used in statistics but accepts summary statistics (values calculated from raw data) we will classify as (M). In this classification, normal() is a mathematical function, not a statistical function. 2. Treatment of missing values by (M) -------------------------------------- In general, one let's the calculation go through and lets Mata handle the missing-value issues. For instance, try this in Mata: : (1,2\3,.) * (4,5\6,7) The result will be that the second row is missing, but the first row is nonmissing. It is seldom necessary to code any special action for missing values, but if it is, the action of the function should be to return a result of missing values of the appropriate dimension. Try this in Mata, : eigenvalues((1,2\3,.)) The result will be a 1 x 2, and each element will be missing value. As an aside, it is of great importance that category (M) functions be tolerant of missing values, because they often are used with calculated results, which may themselves have become missing because of numerical problems or extreme conditions. 3. Treatment of missing values by (U) -------------------------------------- Utility functions perform a service, and the numeric values passed to them usually index something, such as, drop variable i, to read from file j. These functions should abort on missing value unless they assign a special meaning to it, such as "drop variable i" meaning drop all variables if i is missing. 4. Treatment of missing values by (S) -------------------------------------- These functions act in one of three ways, 1. They act the same as (M). 2. They disallow missing values. 3. They perform casewise deletion of missing values. All of that is to say, pretty much anything goes. (3), however, is considered the exception, and StataCorp policy is that, in coding, DEPENDING ON (as imposed to introducing) behavior (3) in the use of a function is considered sloppy, barely acceptable programming. Behavior (3) was introduced on a few functions (namely -cross()- and -mean()-) because we knew that some users would not be as careful as we intended to be. The problem will missing values in data calculations is the multitude of the ways it could be dealt with. Usually, the easiest (and therefore, most natural) is to engage in casewise deletion. The problem is, the data matrix SUPERX is often split into separate matrices X, Y, and Z and sometimes, only a subset of SUPERX is passed to subfunctions. There is simply no way they can know which observations to exclude. For instance, pretend you are writing a program to calculate invsym(X'X)*X'y You write a lovely program, including a part that reports results. You decide to report the means of the RHS variables, so you write a subfunction reportmeans(X). Problem is, if missing values are automatically excluded, reportmeans(X) does not have enough information to exclude the appropriate observations, because y might be missing and X not. If you think this is a silly example, think again. Let's pretend you write regress(X, y) to calculate invsym(X'X)*X'y, presumably in some numerically stable way. You go the work of adding casewise deletion to regress(). Now, at a later date, you are writing another routine, to calculate a form of multivariate regression of Y = (y1, y2) on X = (X1, X2). In your normal, sloppy way, you depend on automatic removal of missing values. At one point in the new routine, you need regress(X1, y1), so you code that. You have just introduced a hidden bug: the coefficients will not necessarily be calculated on the same sample as the rest of your statistics. As a programmer, the best thing to do is to avoid the problem: Do not depend on subfunctions handling missing values. Moreover, excluding missing values is easy. Category (S) functions, by definition, accept data matrices as arguments. Data are usually stored in the Stata data set. The functions to access those data, st_data() and st_view(), make it easy to exclude missing values. The best approach is to create a (temporary) to-use variable in the dataset that marks the observations to be used, and then use that variable all calls to st_data() and st_view(). If you set up all of your matrices in this way at the outset, you will have no hidden bugs. Another question ---------------- Ben also complained about an earlier question of his being ignored: > This also raises another question I already posted some days > ago on statalist: How can we quickly select a submatrix (e.g. > all non-missing elements of a vector, all rows of a matrix > that do not contain any missing values) in Mata? The short answer is that Mata needs a high-speed -select()- function, and we will write that. In the meantime, what Ben was doing looked reasonable to me. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: how to use lag function to create sequential number***From:*mzhang@nd.edu

- Prev by Date:
**st: RE: new metan command question** - Next by Date:
**st: RE: RE: new metan command question** - Previous by thread:
**st: Missing values in Mata functions** - Next by thread:
**st: how to use lag function to create sequential number** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |