[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <[email protected]> |

To |
<[email protected]> |

Subject |
st: RE: row mean (mean across columns) |

Date |
Thu, 9 Oct 2008 13:31:08 +0100 |

<> This takes some unravelling. To cut to my conclusion, Jacob has (a) introduced a bug into his code (b) identified its effects (c) found a way to reverse its effects. This he calls a "correction". He is then puzzled not to find an echo of what he has done in my posting <http://www.stata.com/statalist/archive/2008-09/msg00597.html> and asserts that the code therein must contain a bug. However, there is no reason either for me or for anyone else to echo what he has done, so that puzzle evaporates. I am puzzled, however, that Jacob was not more puzzled that he had to write such roundabout code for his problem. I'll work through this under three headings: Jacob's bug How to get row means the way Jacob wants them A warning about * and _all in programs If none of that looks interesting, you should naturally bail out now. But if just one or two of those headings looks interesting, you should stick with me, because it's all one story. Jacob's bug =========== Look at a subset of Jacob's code. Upstream of this, Jacob has just three variables in his data, which we can abbreviate -x-, -z-, -w-. (Here and elsewhere I make some small style edits without comment.) tempvar RowTotalTooMuch gen `RowTotalTooMuch' = 0 quietly foreach var of varlist * { replace `RowTotalTooMuch' = `RowTotalTooMuch' + `var' } What Jacob wants to do is get a row sum across his variables -x-, -z-, -w-. He declares a temporary variable, initialises it to 0 and loops across his variables adding each in turn. Is this correct code? No. The bug is that the wildcard * _includes_ the temporary variable that Jacob has just created. That is, * unwraps to x z w `RowTotalTooMuch' As the loop passes over -x-, -z-, -w-, we get the row total Jacob wants, but then we keep going and add the row total to itself. Jacob is also keeping track of how many variables he has: tempvar RowTotalTooMuch scalar ncols = 0 gen `RowTotalTooMuch' = 0 quietly foreach var of varlist * { replace `RowTotalTooMuch' =` RowTotalTooMuch' + `var' scalar ncols = ncols + 1 } Result of this bug: Jacob's row total is twice as bug as it should be, and his count of variables is one more than it should be. Jacob is, quite correctly, testing his code against a small numerical example he has worked out independently, so he clearly worked out that he needed to fix his code. Hence the corrections that follow: scalar nOrigCols = ncols - 1 gen `rowtotal' = `RowTotalTooMuch' / 2 gen `1' = `rowtotal' / nOrigCols As already stated, you need not do it that way. In fact this code is fragile in the following sense: use the same procedure in any other program in which there are any other temporary variables and it will be wrong. It is also longer and more roundabout than required. How to get row means the way Jacob wants them ============================================= Let's back up at look at Jacob's problem directly. He wants row (observation) means to be reported as missing whenever any value in a row is missing. Here and there I sense some dogma in his comments with the implication that row means are not defined if any value in the row is not missing. After all, is a mean from -summarize- not defined if any single value fed to it is missing? Is a set of means from -egen, mean()- not definable in the same circumstance? What's so different about row means? But set that on one side. Let's readily admit that in some problems you may want to be purist and insist on complete observations. One recipe is already apparent: gen rowmean = 0 local ncols = 0 quietly foreach v of var <varlist> { replace rowmean = rowmean + `v' local ncols = `ncols' + 1 } replace rowmean = rowmean / `ncols' If there are any missings in the varlist, they will result in the working mean being replaced by missing, and that will never change once it's happened in any observation. The tricky bit is ensuring that the varlist contains exactly what you want. -egen- does not let you do what you want directly. You would need to do something like this: egen rowmean = rowmean(<varlist>) egen rowmiss = rowmiss(<varlist>) replace rowmean = . if rowmiss > 0 or the converse using -egen, rownonmiss()-, which is more awkward. Or you could do it like this: egen rowmean = rowmean(<varlist>) replace rowmean = . if missing(x, y, z) That's more efficient but also needs some work at automating when there is a larger varlist. But Jacob wants an one-liner to do this and one answer is to write his own -egen- function. Know that -egen- function -foo- will be defined within _gfoo.ado which must be along your adopath. In this case, Jacob should start from _growmean.ado. Here is an example, minimally changed: *! _gjrowmean 1.0.0 NJC 9 Oct 2008 *! _growmean 1.0.0 04oct2004 program define _gjrowmean version 6, missing gettoken type 0 : 0 gettoken h 0 : 0 gettoken eqs 0 : 0 syntax varlist(min=1) [if] [in] [, BY(string)] if `"`by'"' != "" { _egennoby rowmean() `"`by'"' /* NOTREACHED */ } tempvar NOBS touse g mark `touse' `if' `in' quietly { gen double `g' = 0 if `touse' gen long `NOBS' = 0 if `touse' tokenize `varlist' while "`1'"!="" { replace `g' = `g' + `1' if `touse' replace `NOBS' = `NOBS' + (`1'<.) if `touse' mac shift } gen `type' `h' = `g'/`NOBS' if `touse' } end What did I change? There are two different lines of any consequence: program define _gjrowmean replace `g' = `g' + `1' if `touse' The code could be made more efficient: *! _gjrowmean2 1.0.0 NJC 9 Oct 2008 *! _growmean 1.0.0 04oct2004 program define _gjrowmean2 version 6, missing gettoken type 0 : 0 gettoken h 0 : 0 gettoken eqs 0 : 0 syntax varlist(min=1) [if] [in] [, BY(string)] if `"`by'"' != "" { _egennoby rowmean() `"`by'"' /* NOTREACHED */ } tempvar touse g mark `touse' `if' `in' quietly { gen double `g' = 0 if `touse' foreach v of local varlist { replace `g' = `g' + `v' if `touse' } local NOBS : word count `varlist' gen `type' `h' = `g'/`NOBS' if `touse' } end Despite the -version- statement, you need version 7 for that to work. A warning about * and _all in programs ====================================== Jacob also was puzzled about what happens with -egen, rowmean(*)- and the difference between that and results from -egen, rowmean(_all)-. Eva Poen put her finger on the difference, and her comment can be expanded. When you fire up -egen-, Stata has to deliver on its promise that it preserves the sort order of your data. What Stata does to ensure that is, as far as the user is concerned, equivalent to doing this as the start of your program tempvar _sortindex gen long `_sortindex' = _n and this at the end sort `_sortindex' together with ensuring that Stata does not change its mind about what (if anything) it's sorted by. Thus, at a minimum, * _within_ -egen- will include one temporary variable created on your behalf. That should also be true of _all, except that some years ago I persuaded StataCorp that no-one would intend that _all included `_sortindex' -- indeed very few Stata users even know that it ever exists. So, there is work-around code within -egen- to take that out of the varlist implied by _all. However, both I and StataCorp failed to see at the time that the same should be done for *. Hence, Jacob has tickled a bug in Stata, as Eva surmised. But there's a larger lesson. * and _all within programs (and also interactively) will include, necessarily, any temporary variables that exist. So, looping across varlists defined by such wildcards is a source of bugs. Arguably, they are yours, not Stata's. Nick [email protected] Jacob Wegelin Given any dataset of all numeric variables, I want to generate a new variable called myMean, which is the arithmetic mean (the average) across all the variables. The program below solves this problem. But surely there is a one-line command that will perform this task in Stata? The post http://www.stata.com/statalist/archive/2008-09/msg00597.html appears to contain a bug, in the sense that the row total computed is not corrected as in my code below. This should be done in a general manner: (1) As in the current dataset, the variables will not necessarily be in a form like a1 to a100. (2) The number of variables is arbitrary, so I cannot hard-code the denominator as when myMeanByHand is computed below. (3) If any value in a row is missing (.), the mean computed must also be missing, since then the mean across all variables is not defined. (Thus egen rowtotal is not the answer.) Here is the code: /* Generate a toy dataset */ clear set obs 5 gen x= _n gen zoo = 20-x gen whiskey=(_N - x) ^ 2 replace x = . in 2 /* First compute "by hand" with hard-coded denominator and variable names */ gen myMeanByHand= (x + zoo + whiskey ) / 3 sort x save tmp, replace list drop myMeanByHand capture program drop computeMeanAcrossColumns program define computeMeanAcrossColumns /* Compute arithmetic mean across all columns */ tempvar RowTotalTooMuch tempvar rowtotal scalar ncols=0 gen `RowTotalTooMuch'=0 foreach var of varlist * { quietly replace `RowTotalTooMuch'=`RowTotalTooMuch' + `var' scalar ncols=ncols + 1 } scalar nOrigCols=ncols-1 gen `rowtotal' = `RowTotalTooMuch' / 2 gen `1'= `rowtotal' / nOrigCols end computeMeanAcrossColumns "myMean" /* Check myMean against myMeanByHand */ sort x merge x using tmp assert _merge==3 drop _merge assert myMean==myMeanByHand drop myMeanByHand list /* An illustration with egen rowmean */ keep x zoo whiskey /* The following works for rows with no missing values. It gives a misleading answer for a row that contains a missing value, since the average in that row is not defined. */ egen junk=rowmean(_all) list drop junk /* A related question: The following gives an incorrect answer. What in the world is it doing? */ egen junk=rowmean(*) list Thanks for any insights * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: RE: row mean (mean across columns)***From:*"Eva Poen" <[email protected]>

**References**:**st: row mean (mean across columns)***From:*Jacob Wegelin <[email protected]>

- Prev by Date:
**RE: st: RE: row mean (mean across columns)** - Next by Date:
**st: RE: if not iffing in ado** - Previous by thread:
**Re: st: row mean (mean across columns)** - Next by thread:
**Re: st: RE: row mean (mean across columns)** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |