Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: cycling through individual indices


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: cycling through individual indices
Date   Wed, 17 Sep 2008 17:56:38 +0100

(You should start new threads, please; not reply to irrelevant previous
ones.) 

I wouldn't rule out the -reshape- solution. A -reshape- may be slow but
once it's done, it's done. 

Otherwise I think you can avoid loops over observations. 

-egen, anycount()- is in essence a wrapper for a loop over variables, as
-viewsource _ganycount.ado- will show. So, you too can loop over
variables. You just need to customise your loop. 

gen whatever = 0 
qui forval j = 1/624 { 
	replace whatever = whatever + (whatever <= Agemonth) * (Var_`j'
== <magic_number>) 
} 

So (whatever <= Agemonth) evaluates to 1 or 0 as the case may be, and in
particular annihilates invalid terms. This operation is automatically
vectorised. 

Another possibility is just to clean up your data first: 

qui forval j = 1/624 { 
	replace Var_`j' = . if `j' > Agemonth 
} 

Nick 
n.j.cox@durham.ac.uk 

Johannes Geyer

I have Stata 10.1 IC and I try to create individual specific sums in a 
large dataset. The problem is a bit complicated and I have to cycle 
through all individuals and variables using the "in" qualifier. I am 
curious if anyone has an idea how to solve this problem more
efficiently. 
Here is the problem:

The data are in wide format and look like

ID      Agemonth        Var_1   Var_2...                ...Var_623
Var_624
1       532             2       2                       14      14
2       345             7       7                       14      Mis
3       236             3       3                       Mis     Mis
4       267             2       2                       12      12

and so forth; there are about 50,000 observations. "Agemonth" indicates 
the observation period which is individual specific: "1" means January
of 
the year after the person turned 14, "2" is February and so forth. That 
means e.g. "ID" 1 was observed 532 months after the year he/she turned
14. 
The index of the variables indicate the same time index. Thus, person 1 
was observed from Var_1 until Var_532. Unfortunately, that does not mean

that Var_533 or even Var_623 is missing but it may have a value like in 
the example above.

Var_# has a number of distinct values and I need to sum them up in each 
case. If I had no invalid observations I could type

egen sum1 = anycount(Var_*), values(1)

However, then I count also invalid observations.

I ended up with looping through individuals (~50,000) and variables
(624), 
summing up one by one but I really doubt that this is the "best"
solution 
(and hope that it is not):

*******************
#d;

gen sum1 = 0;
sort ID;
gen index = _n;
qui sum index;

forvalues indis = `r(min)'/`r(max)' {;

        di "`indis'";

        forvalues f = 1/624 {;

                if `f' <=Agemonth in `indis' {;
                        qui replace sum1 = sum1 + (Var_`f' == 1) in 
`indis';
                        };

                };
};
*******************

Another possibilty would be to have the data in long format - however, 
since I have so many periods it takes a while to reshape the data, even
in 
portions. I tried that with a 10% sample and "reshape" took more than
one 
hour (maybe I have to ask for a better computer...). 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index