[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Missing values in Mata functions

From	[email protected] (William Gould, Stata)
To	[email protected]
Subject	Re: st: Missing values in Mata functions
Date	Thu, 08 Sep 2005 11:38:38 -0500
Ben Jann <[email protected]> asked

> I have a general question about programming Mata functions: 
> How should we deal with missing values? 
>
> Up to now, my strategy simply was not to allow missing values. 
> That is:
>
>  : void myfunction(x) {
>  >         if (missing(x)) _error(3351)
>  > }
>  : myfunction(.)
>              myfunction():  3351  argument has missing values
>                   <istmt>:     -  function returned error
>  r(3351);

I do not think -if (missing(x)) _error(3351)- is the right way to go in
general.  Here's the current StataCorp theory on missing values:


1.  Mata functions and missing values
-------------------------------------

Let's divide numeric functions into three categories, 

      (M)  Mathematical functions 

      (S)  Statistical functions

      (U)  Utility functions

Examples of (M) include sqrt(), cholsolve(), eigensystem(), etc.

Examples of (S) include mean(), cross(), etc.

Examples of (U) include st_view(), st_dropvar(), fwrite(), etc.

The distinction between (M) and (S) is a subtle one.  For our purposes, we
will define as in (S) any function that accepts raw data (e.g., a data matrix)
as input.  Any function that is used in statistics but accepts summary
statistics (values calculated from raw data) we will classify as (M).  In this
classification, normal() is a mathematical function, not a statistical
function.



2.  Treatment of missing values by (M)
--------------------------------------

In general, one let's the calculation go through and lets Mata handle 
the missing-value issues.  For instance, try this in Mata:

        : (1,2\3,.) * (4,5\6,7)

The result will be that the second row is missing, but the first row 
is nonmissing.

It is seldom necessary to code any special action for missing values, 
but if it is, the action of the function should be to return a result 
of missing values of the appropriate dimension.  Try this in Mata, 

	: eigenvalues((1,2\3,.))

The result will be a 1 x 2, and each element will be missing value.

As an aside, it is of great importance that category (M) functions be 
tolerant of missing values, because they often are used with calculated 
results, which may themselves have become missing because of numerical 
problems or extreme conditions. 


3.  Treatment of missing values by (U)
--------------------------------------

Utility functions perform a service, and the numeric values passed to them 
usually index something, such as, drop variable i, to read from file j.

These functions should abort on missing value unless they assign a 
special meaning to it, such as "drop variable i" meaning drop all variables
if i is missing.


4.  Treatment of missing values by (S)
--------------------------------------

These functions act in one of three ways, 

     1.  They act the same as (M).

     2.  They disallow missing values.

     3.  They perform casewise deletion of missing values.

All of that is to say, pretty much anything goes.

(3), however, is considered the exception, and StataCorp policy is that, in
coding, DEPENDING ON (as imposed to introducing) behavior (3) in the use of a
function is considered sloppy, barely acceptable programming.  Behavior (3)
was introduced on a few functions (namely -cross()- and -mean()-) because we
knew that some users would not be as careful as we intended to be.

The problem will missing values in data calculations is the multitude of 
the ways it could be dealt with.  Usually, the easiest (and therefore, 
most natural) is to engage in casewise deletion.  The problem is, the data
matrix SUPERX is often split into separate matrices X, Y, and Z and sometimes,
only a subset of SUPERX is passed to subfunctions.  There is simply no way
they can know which observations to exclude.  For instance, pretend you are
writing a program to calculate

                     invsym(X'X)*X'y

You write a lovely program, including a part that reports results.  You 
decide to report the means of the RHS variables, so you write a subfunction
reportmeans(X).  Problem is, if missing values are automatically excluded, 
reportmeans(X) does not have enough information to exclude the appropriate 
observations, because y might be missing and X not.  

If you think this is a silly example, think again.  Let's pretend you 
write 

		regress(X, y) 

to calculate invsym(X'X)*X'y, presumably in some numerically stable way.  You
go the work of adding casewise deletion to regress().  Now, at a later date,
you are writing another routine, to calculate a form of multivariate
regression of Y = (y1, y2) on X = (X1, X2).  In your normal, sloppy way, you
depend on automatic removal of missing values.  At one point in the new
routine, you need regress(X1, y1), so you code that.  You have just introduced
a hidden bug:  the coefficients will not necessarily be calculated on the same
sample as the rest of your statistics.

As a programmer, the best thing to do is to avoid the problem:  Do not depend
on subfunctions handling missing values.  Moreover, excluding missing values
is easy.  Category (S) functions, by definition, accept data matrices as
arguments.  Data are usually stored in the Stata data set.  The functions to
access those data, st_data() and st_view(), make it easy to exclude missing
values.

The best approach is to create a (temporary) to-use variable in the dataset
that marks the observations to be used, and then use that variable all calls
to st_data() and st_view().  If you set up all of your matrices in this way at
the outset, you will have no hidden bugs.



Another question
----------------

Ben also complained about an earlier question of his being ignored:

> This also raises another question I already posted some days 
> ago on statalist: How can we quickly select a submatrix (e.g.
> all non-missing elements of a vector, all rows of a matrix 
> that do not contain any missing values) in Mata?

The short answer is that Mata needs a high-speed -select()- function, 
and we will write that.  In the meantime, what Ben was doing looked 
reasonable to me.


-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- st: how to use lag function to create sequential number
  - From: [email protected]
Prev by Date: st: RE: new metan command question
Next by Date: st: RE: RE: new metan command question
Previous by thread: st: Missing values in Mata functions
Next by thread: st: how to use lag function to create sequential number
Index(es):
- Date
- Thread