[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Is there a "running count" command in Stata?

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: RE: Is there a "running count" command in Stata?
Date	Thu, 31 Aug 2006 16:45:43 +0100

There is an interesting underlying issue here, what 
exactly is "programming" in Stata? A precise
answer is that a program is whatever is defined 
by whatever follows a -program- statement. (There
is no circularity here, as program the English 
word and -program- the Stata command name are from
metalanguage and language.) 

OK, enough of that.  

The good news is that this can be done without
ever writing down the Stata command name -program-, 
so the answer is yes. 

The other news looks bad, but isn't so bad really. 
In fact, it is really good news. 

You can do this, but it requires a little more
Stata than you may want at this moment. However, the features
to be used are among the most Stataish of all
Stata features and are very, very useful. 

Using your second list of values (which differs 
slightly from your first) we have 

. l

     +------+
     |    x |
     |------|
  1. |  cd1 |
  2. |  cd2 |
  3. |  cd2 |
  4. |  cd3 |
  5. |  cd1 |
     |------|
  6. |  cd3 |
  7. |  cd4 |
  8. |  cd1 |
  9. |  cd5 |
 10. |  cd3 |
     +------+

We need to tag the first time any value 
occurs. That will need a -sort-, and because
of that we should keep a record of the current
sort order, not least because we will want
to return to it. That means 

. gen order = _n

If your dataset is really big, that should be 

. gen long order = _n

We sort into groups of -x- and ensure that the 
within groups of -x- the original sort order 
is followed. Then we tag the very first occurrence 
of each value of -x-. This can all be telescoped into one
statement. 

. bysort y (order) : gen y = _n == 1

There is a FAQ on constructs like those on the right-hand 
side of the assignment:

FAQ     . . . . . . . . . . . . . . . . . . . . . . .  True and false in Stata
        2/03    What is true and false in Stata?
                http://www.stata.com/support/faqs/data/trueorfalse.html

Now -sort- back to the original order. Then we just need a running
sum of -y-, as the number of distinct values
seen so far is equal to (or even defined as)
the number of first occurrences seen so far. 

. sort order

. replace y = sum(y)
(9 real changes made)

-order- has served its purpose. Bye-bye! 

. drop order

What have we got? 

. l

     +----------+
     |    x   y |
     |----------|
  1. |  cd1   1 |
  2. |  cd2   2 |
  3. |  cd2   2 |
  4. |  cd3   3 |
  5. |  cd1   3 |
     |----------|
  6. |  cd3   3 |
  7. |  cd4   4 |
  8. |  cd1   4 |
  9. |  cd5   5 |
 10. |  cd3   5 |
     +----------+

Now with a little more knowledge we could wrap that 
up into a command, or better an -egen- function. But
in many ways it is better to use the code here and 
understand its logic, which will help 
for that next problem with a similar flavour. 

The key construct here is -by:-. The documentation
for -by:- is scattered around the manuals. A Mickey Mouse
tutorial bringing together the main ideas was given in 

SJ-2-1  pr0004  . . . . . . . . . . Speaking Stata:  How to move step by: step
        Q1/02   SJ 2(1):86-102                         
        explains the use of the by varlist : construct to tackle
        a variety of problems with group structure, ranging from
        simple calculations for each of several groups to more
        advanced manipulations that use the built-in _n and _N

Nick 
[email protected] 

Mingfeng Lin
 
> I have checked the references as well as the Statalist 
> archive, but couldn't
> seem to find a solution to this.  I was trying to find some 
> command that
> works like a "running" count function; that is, it will give 
> the number of
> unique occurences from 1 to _n.  For example: variable x is 
> as follows: 
>  
> x 
> cd1
> cd2
> cd2
> cd2
> cd1
> cd3
> cd4
> cd1
> cd5
> cd3
>  
> And I was trying to generate a variable -y- such that 
> 
> x              y
> cd1          1
> cd2          2
> cd2          2
> cd3          3
> cd1          3 //there are cd1, cd2 and cd3 before this
> cd3          3
> cd4          4
> cd1          4
> cd5          5 
> cd3          5
>  
> I am still learning Stata and was hoping that this does not have to be
> solved by programming... Is there a command or package that 
> works like this?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: [no subject]
Next by Date: st: RE: RE: Is there a "running count" command in Stata?
Previous by thread: [no subject]
Next by thread: st: RE: RE: Is there a "running count" command in Stata?
Index(es):
- Date
- Thread