Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: egen & sum()


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: egen & sum()
Date   Wed, 26 Nov 2008 10:08:26 -0000

-egen, sum()- is just the same as -egen, total()-. 

-egen, sum()- was cloned as -egen, total()- in Stata 9 for precisely the
reason you identify. 

As Svend Juul in particular pointed out in various very entertaining
talks in 2004, it was not a good idea to use -sum()- for cumulative or
running sum in one context and the same name for unqualified sums in
another. Although many experienced users had got used to this, it was
(quite understandably) sometimes puzzling to newer users. 

But -egen, sum()- remains there so as to not to break old scripts or
habits. It is just undocumented. 

Various other -egen- functions were renamed at the same time, but
nothing should have broken anybody's code. 

-viewsource _gsum.ado- and -viewsource _gtotal.ado- lets you see the
code. 

Another way to understand what's going on is to -set trace- and see what
calls what. -egen- functions are totally transparent. 

It's vital to understand that functions and -egen- functions are
completely separate beasts. -egen- functions are only understood by
-egen- and the only functions -egen- understands are -egen- functions.
That's two absolute rules. I've sometimes wondered whether -egen-
functions should have been called something different, but it's rather
late for that. 

Nick 
[email protected] 

Neil Shephard

I've been poring over someone else's Stata code trying to understand it
and have discovered what seems to be inconsistent or undocumented
behaviour.

The do-file has the following line...

egen b = sum(a)

This stood out to me as I thought the current version of -egen- uses the
-total()- function to obtain the combined (as opposed to running) sum of
a variable so checked the -man egen- and -man egenmore- pages and sure
enough there is no mention of -sum() as an -egen- function.

-sum()- is however a [P] function and returns the running sum of the
specified variable.

Thus I would have expected -egen b = sum(a)- to return the running sum
of b, but this is not the case, it behaves as though -sum()- is a
synonym for -total()- as the following example demonstrates...

. clear

. set obs 10
obs was 0, now 10

. gen a = _n

. gen b = sum(a)

. egen b2 = sum(a)

. egen b3 = total(a)

. list

     +-------------------+
     |  a    b   b2   b3 |
     |-------------------|
  1. |  1    1   55   55 |
  2. |  2    3   55   55 |
  3. |  3    6   55   55 |
  4. |  4   10   55   55 |
  5. |  5   15   55   55 |
     |-------------------|
  6. |  6   21   55   55 |
  7. |  7   28   55   55 |
  8. |  8   36   55   55 |
  9. |  9   45   55   55 |
 10. | 10   55   55   55 |
     +-------------------+

Based on the help-pages and documentation I would have expected b2 == b1
as -egen b2 = sum(a)- should be treating -sum()- as described in the
-man sum()- page.  Indeed even the third example in -man egen- shows
that -sum()- should be used with -generate- and -total()- should be used
with -egen-

Is there any historical legacy that anyone is aware of -sum()- being a
valid -egen- function that may be lingering around causing this
behaviour?

Should the behaviour or the documentation be modified?

Or have I completely misunderstood things?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index