[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: row mean (mean across columns)

From   "Nick Cox" <>
To   <>
Subject   st: RE: row mean (mean across columns)
Date   Thu, 9 Oct 2008 13:31:08 +0100

This takes some unravelling. 

To cut to my conclusion, Jacob has 

(a) introduced a bug into his code 
(b) identified its effects 
(c) found a way to reverse its effects. 

This he calls a "correction". He is then puzzled not to find an echo of
what he has done in my posting
<> and
asserts that the code therein must contain a bug. 

However, there is no reason either for me or for anyone else to echo
what he has done, so that puzzle evaporates. I am puzzled, however, that
Jacob was not more puzzled that he had to write such roundabout code for
his problem. 

I'll work through this under three headings: 

Jacob's bug 

How to get row means the way Jacob wants them 

A warning about * and _all in programs 

If none of that looks interesting, you should naturally bail out now.
But if just one or two of those headings looks interesting, you should
stick with me, because it's all one story.

Jacob's bug

Look at a subset of Jacob's code. Upstream of this, Jacob has just three
variables in his data, which we can abbreviate -x-, -z-, -w-. (Here and
elsewhere I make some small style edits without comment.) 

	tempvar RowTotalTooMuch
	gen `RowTotalTooMuch' = 0
 	quietly foreach var of varlist * {
 		replace `RowTotalTooMuch' = `RowTotalTooMuch' + `var'

What Jacob wants to do is get a row sum across his variables -x-, -z-,
-w-. He declares a temporary variable, initialises it to 0 and loops
across his variables adding each in turn. 

Is this correct code? No. 

The bug is that the wildcard * _includes_ the temporary variable that
Jacob has just created. That is, * unwraps to 

x z w `RowTotalTooMuch' 

As the loop passes over -x-, -z-, -w-, we get the row total Jacob wants,
but then we keep going and add the row total to itself. 

Jacob is also keeping track of how many variables he has: 

	tempvar RowTotalTooMuch
 	scalar ncols = 0
 	gen `RowTotalTooMuch' = 0 
 	quietly foreach var of varlist * {
 		replace `RowTotalTooMuch' =` RowTotalTooMuch' + `var'
 		scalar ncols = ncols + 1

Result of this bug: Jacob's row total is twice as bug as it should be,
and his count of variables is one more than it should be. Jacob is,
quite correctly, testing his code against a small numerical example he
has worked out independently, so he clearly worked out that he needed to
fix his code. Hence the corrections that follow: 

	scalar nOrigCols = ncols - 1
 	gen `rowtotal' = `RowTotalTooMuch' / 2
 	gen `1' = `rowtotal' / nOrigCols

As already stated, you need not do it that way. In fact this code is
fragile in the following sense: use the same procedure in any other
program in which there are any other temporary variables and it will be
wrong. It is also longer and more roundabout than required. 

How to get row means the way Jacob wants them

Let's back up at look at Jacob's problem directly. He wants row
(observation) means to be reported as missing whenever any value in a
row is missing. 

Here and there I sense some dogma in his comments with the implication
that row means are not defined if any value in the row is not missing.
After all, is a mean from -summarize- not defined if any single value
fed to it is missing? Is a set of means from -egen, mean()- not
definable in the same circumstance? What's so different about row means?

But set that on one side. Let's readily admit that in some problems you
may want to be purist and insist on complete observations. One recipe is
already apparent: 

gen rowmean = 0 
local ncols = 0 
quietly foreach v of var <varlist> { 
	replace rowmean = rowmean + `v' 
	local ncols = `ncols' + 1 
replace rowmean = rowmean / `ncols' 

If there are any missings in the varlist, they will result in the
working mean being replaced by missing, and that will never change once
it's happened in any observation. 
The tricky bit is ensuring that the varlist contains exactly what you

-egen- does not let you do what you want directly. You would need to do
something like this: 

egen rowmean = rowmean(<varlist>) 
egen rowmiss = rowmiss(<varlist>) 
replace rowmean = . if rowmiss > 0 

or the converse using -egen, rownonmiss()-, which is more awkward. 

Or you could do it like this: 

egen rowmean = rowmean(<varlist>) 
replace rowmean = . if missing(x, y, z) 

That's more efficient but also needs some work at automating when there
is a larger varlist. 

But Jacob wants an one-liner to do this and one answer is to write his
own -egen- function. Know that -egen- function -foo- will be defined
within _gfoo.ado which must be along your adopath. In this case, Jacob
should start from _growmean.ado. 

Here is an example, minimally changed: 

*! _gjrowmean 1.0.0 NJC 9 Oct 2008 
*! _growmean 1.0.0  04oct2004
program define _gjrowmean
	version 6, missing
	gettoken type 0 : 0
	gettoken h    0 : 0 
	gettoken eqs  0 : 0

	syntax varlist(min=1) [if] [in] [, BY(string)]
	if `"`by'"' != "" {
		_egennoby rowmean() `"`by'"'

	tempvar NOBS touse g
	mark `touse' `if' `in'
	quietly { 
		gen double `g' = 0 if `touse'
		gen long `NOBS' = 0 if `touse'
		tokenize `varlist'
		while "`1'"!="" {
			replace `g' = `g' + `1' if `touse'
			replace `NOBS' = `NOBS' + (`1'<.) if `touse'
			mac shift 
		gen `type' `h' = `g'/`NOBS' if `touse'

What did I change? There are two different lines of any consequence: 

program define _gjrowmean

			replace `g' = `g' + `1' if `touse'

The code could be made more efficient: 

*! _gjrowmean2 1.0.0 NJC 9 Oct 2008 
*! _growmean 1.0.0  04oct2004
program define _gjrowmean2
	version 6, missing
	gettoken type 0 : 0
	gettoken h    0 : 0 
	gettoken eqs  0 : 0

	syntax varlist(min=1) [if] [in] [, BY(string)]
	if `"`by'"' != "" {
		_egennoby rowmean() `"`by'"'

	tempvar touse g
	mark `touse' `if' `in'
	quietly { 
		gen double `g' = 0 if `touse'
		foreach v of local varlist {
			replace `g' = `g' + `v' if `touse'
		local NOBS : word count `varlist' 
		gen `type' `h' = `g'/`NOBS' if `touse'

Despite the -version- statement, you need version 7 for that to work. 

A warning about * and _all in programs

Jacob also was puzzled about what happens with -egen, rowmean(*)- and
the difference between that and results from -egen, rowmean(_all)-. 

Eva Poen put her finger on the difference, and her comment can be
expanded. When you fire up -egen-, Stata has to deliver on its promise
that it preserves the sort order of your data. What Stata does to ensure
that is, as far as the user is concerned, equivalent to doing this as
the start of your program 

tempvar _sortindex
gen long `_sortindex' = _n 

and this at the end 

sort `_sortindex' 

together with ensuring that Stata does not change its mind about what
(if anything) it's sorted by. 

Thus, at a minimum, * _within_ -egen- will include one temporary
variable created on your behalf. 

That should also be true of _all, except that some years ago I persuaded
StataCorp that no-one would intend that _all included `_sortindex' --
indeed very few Stata users even know that it ever exists. So, there is
work-around code within -egen- to take that out of the varlist implied
by _all. 

However, both I and StataCorp failed to see at the time that the same
should be done for *. Hence, Jacob has tickled a bug in Stata, as Eva

But there's a larger lesson. * and _all within programs (and also
interactively) will include, necessarily, any temporary variables that
exist. So, looping across varlists defined by such wildcards is a source
of bugs. Arguably, they are yours, not Stata's. 


Jacob Wegelin

Given any dataset of all numeric variables, I want to generate a new
variable called myMean, which is the arithmetic mean (the average)
across all the variables. The program below solves this problem. But
surely there is a one-line command that will perform this task in Stata?

The post
appears to contain a bug, in the sense that the row total computed is
not corrected as in my code below.

This should be done in a general manner:

(1) As in the current dataset, the variables will not necessarily be in
a form like a1 to a100.

(2) The number of variables is arbitrary, so I cannot hard-code the
denominator as when myMeanByHand is computed below.

(3) If any value in a row is missing (.), the mean computed must also be
missing, since then the mean across all variables is not defined. (Thus
egen rowtotal is not the answer.)

Here is the code:

/* Generate a toy dataset */
set obs 5
gen x= _n
gen zoo = 20-x
gen whiskey=(_N - x) ^ 2
replace x = . in 2
/* First compute "by hand" with hard-coded denominator and variable
names */
gen myMeanByHand= (x + zoo + whiskey ) / 3
sort x
save tmp, replace
drop myMeanByHand

capture program drop computeMeanAcrossColumns
program define computeMeanAcrossColumns
 	/* Compute arithmetic mean across all columns */
 	tempvar RowTotalTooMuch
 	tempvar rowtotal
 	scalar ncols=0
 	gen `RowTotalTooMuch'=0
 	foreach var of varlist * {
 		quietly replace `RowTotalTooMuch'=`RowTotalTooMuch' +
 		scalar ncols=ncols + 1
 	scalar nOrigCols=ncols-1
 	gen `rowtotal' = `RowTotalTooMuch' / 2
 	gen `1'= `rowtotal' / nOrigCols

computeMeanAcrossColumns "myMean"

/* Check myMean against myMeanByHand */
sort x 
merge x using tmp
assert _merge==3
drop _merge
assert myMean==myMeanByHand
drop myMeanByHand

/* An illustration with egen rowmean */

keep x zoo whiskey
/* The following works for rows with no missing values. It gives a
misleading answer for a row that contains a missing value, since the
average in that row is not defined. */
egen junk=rowmean(_all) 
drop junk

/* A related question: The following gives an incorrect answer. What in
the world is it doing? */
egen junk=rowmean(*) 

Thanks for any insights

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index