Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: loop question


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: loop question
Date   Tue, 03 Nov 2009 09:59:13 -0600

Sandu Cojocaru <scojocaru@gmail.com> asked, 

> I'm having trouble generating a variable that for each member i equals
> sum(Cj-Ci) over all Cj>Ci where i and j are members of the same group.
> Here's an example of the data setup - I'm trying to calculate
> `outcome_var'.
> For row 1 outcome_var=0, for row 3 = (200-100)+(300-100) = 300...and so on...
> 
>        group_id       member_id         C        outcome_var
>               1               1       300                  0
>               1               2       200                100
>               1               3       100                300
>               2               1       150                 50
>               2               2       200                  0
>               2               3       100                150
>               2               4        50                300
>               3               1     and so on...

This question has already been answered elegantly by Martin Weiss
<martin.weiss1@gmx.de>.  His answer was, 

>       clear*
>
>       input byte(group_id member_id) C 
>       1 1  300  
>       1 2  200  
>       1 3  100  
>       2 1  150  
>       2 2  200  
>       2 3  100  
>       2 4  50   
>       end
>
>       compress
>       list, noo sepby(group_id)
>
>       bys group_id (C):  /* 
>       */ gen diff=C[_n+1]-C[_n]
>       bys group_id: gen num=_N-_n
>       bys  group_id (num): /* 
>       */ gen outcome_var=sum(diff*num)
>       sort group_id member_id
>
>       drop diff num
>       list, noo sepby(group_id)

I'm about to give a different answer.  Sometimes one needs to create a 
variable that is a complicated combination of values in different 
observations.  There is always a way to do it in Stata, but somtimes 
the solution is elusive and one wished one could just loop across 
the observations and make the calculation directly even if that solution 
was inefficient.  I want to show how to do that using Mata.
The basic recipe is

    1.  Enter Mata:

                . mata

                : _


    2.  Create individual Mata variables that are a view onto each of 
        relevant Stata variables.  In the above, the relevant Stata variables
        are group_id and and C, so create Mata variables of the same
        name:

               : st_view(group_id=.,    ., "group_id")
               : st_view(C,             ., "C")
               : _

    3.  Go back to Stata and create the the desired new variable, filled 
        with missing values.  Create a view onto that, too.  In this 
        example, the new desired variable is outcome_var:

               : end
               . gen outcome_var = .
               . mata
               : st_view(outcome_var=., ., "outcome_var")

    4. Loop in Mata to fill in the new variable.

Before showing the solution to Sandu's problem, let me show how 
this works in an easier examples.


An easy example
----------------

We want to create new variable newx equal to x+1.  We could do this 
in Stata by typing

        . gen newx = x + 1

Alternatively, we could achieve the same result by typing, 

        . gen newx = .

        . mata

        : st_view(x=.,    ., "x")
        : st_view(newx=., ., "newx")

        : for (i=1; i<=st_nobs(); i++) {
        :        newx[i] = x[i] + 1
        : }

        : end

Try it.  The result after typing all that Mata code will be the same 
as -gen newx = x + 1-.

Note the use of Mata function st_nobs() to obtain the number of 
observations in the dataset.


Panel data (by)
---------------

Panel data adds complication.  Pretend we wanted to code the Mata 
equivalent to 

        . by group:  gen newx = x + 1

I know the -by group:- prefix adds nothing to the statement, but 
at this point I want to keep the example simple.

The equivalent Mata code is, 

        . gen newx = .

        . mata

        : st_view(group=., ., "group")
        : st_view(x=.,     ., "x")
        : st_view(newx=.,  ., "newx")

        : obs = panelsetup(group, 1)

        : for (g=1; g<=rows(obs); g++) {
        :        for (i=obs[g,1]; i<=obs[g,2]; g++) {
        :                newx[i] = x[i] + 1
        :        }
        : }

        : end

In the above code, I assume the data are already sorted by group.

Note the line 

        : obs = panelsetup(group, 1)

If we had two groups -- it wouldn't matter if they were numbered 1 and 2
or 6*_pi and 9 -- and we had three observations in the first group 
and five in the second, matrix obs would contain 

        1  3
        4  8

The first row states the observation numbers corresponding the first 
group (1 to 3); the second grow states the observation numbers 
corresponding to the second (4 through 8).  The matrix has two rows 
because there are two groups.  The matrix always has 2 columns. 
See -help mata panelsetup()-.

In the loop that follows, the outer loop (g) loops across the by 
groups.  The inner loop (i) loops across the observations within 
the group.


Putting it all together; the solution to Sandu's problem
--------------------------------------------------------

Here is the solution to Sandu's problem:

        . sort group_id
        . gen outcome_var = . 

        : mata:

        : st_view(group_id=.,     ., "group_id")
        : st_view(C=.,            ., "C")
        : st_view(outcome_var=.,  ., "outcome_var")

        : obs = panelsetup(group_id, 1)

        : for (g=1; g<=rows(obs); g++) {
        :         for (i=obs[g,1]; i<=obs[g,2]; i++) {
        :                 sum = 0
        :                 for (j=obs[g,1]; j<=obs[g,2]; j++) {
        :                         if (C[j]>C[i]) sum = sum + (C[j]-C[i])
        :                 }
        :                 outcome_var[i] = sum
        :         }
        : }

        : end

Note line the line 

                   if (C[j]>C[i]) sum = sum + (C[j]-C[i])

That line is coded almost exactly as Sandu stated the problem:
He requested the sum(Cj-Ci) over all Cj>Ci where i and j are members 
of the same group.

In the code above, the outer loop (g) loops over group_id.  The next 
loop (i) loops over the members of the group.  The inner loop (j) 
also loops over the members of the group so that we obtain all 
combinations of i and j. 

Martin's solution executes more quickly than the above solution.  I
tried both solutions on 5,000 groups, each with 100 members.  Martin's
solution ran in 2.28 seconds.  Mine took 35 seconds!  That's not so
much Mata's fault as mine.  My solution is not cleaver; I performed
the -if (C[j]>C[i]) sum = sum + (C[j]-C[i])- statement 50,000,000 times!

So what?  My solution was not clever and neither did it depend on me 
being clever.  I wonder which one of us had a solution to this problem 
sooner?  I just plugged into the recipe:

    1.  Enter Mata.

    2.  Create individual Mata variables that are a view onto each of 
        relevant Stata variables.

    3.  Go back to Stata and create the the desired new variable, filled 
        with missing values.  Create a view onto that, too.

    4. Loop in Mata to fill in the new variable.

The only new code I wrote was for (4), and that read

        : for (g=1; g<=rows(obs); g++) {
        :         for (i=obs[g,1]; i<=obs[g,2]; i++) {
        :                 sum = 0
        :                 for (j=obs[g,1]; j<=obs[g,2]; j++) {
        :                         if (C[j]>C[i]) sum = sum + (C[j]-C[i])
        :                 }
        :                 outcome_var[i] = sum
        :         }
        : }

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index