"Nick Cox" <n.j.cox@durham.ac.uk>

<statalist@hsphsun2.harvard.edu>

st: RE: cycling through individual indices

Wed, 17 Sep 2008 17:56:38 +0100

(You should start new threads, please; not reply to irrelevant previous ones.) I wouldn't rule out the -reshape- solution. A -reshape- may be slow but once it's done, it's done. Otherwise I think you can avoid loops over observations. -egen, anycount()- is in essence a wrapper for a loop over variables, as -viewsource _ganycount.ado- will show. So, you too can loop over variables. You just need to customise your loop. gen whatever = 0 qui forval j = 1/624 { replace whatever = whatever + (whatever <= Agemonth) * (Var_`j' == <magic_number>) } So (whatever <= Agemonth) evaluates to 1 or 0 as the case may be, and in particular annihilates invalid terms. This operation is automatically vectorised. Another possibility is just to clean up your data first: qui forval j = 1/624 { replace Var_`j' = . if `j' > Agemonth } Nick n.j.cox@durham.ac.uk Johannes Geyer I have Stata 10.1 IC and I try to create individual specific sums in a large dataset. The problem is a bit complicated and I have to cycle through all individuals and variables using the "in" qualifier. I am curious if anyone has an idea how to solve this problem more efficiently. Here is the problem: The data are in wide format and look like ID Agemonth Var_1 Var_2... ...Var_623 Var_624 1 532 2 2 14 14 2 345 7 7 14 Mis 3 236 3 3 Mis Mis 4 267 2 2 12 12 and so forth; there are about 50,000 observations. "Agemonth" indicates the observation period which is individual specific: "1" means January of the year after the person turned 14, "2" is February and so forth. That means e.g. "ID" 1 was observed 532 months after the year he/she turned 14. The index of the variables indicate the same time index. Thus, person 1 was observed from Var_1 until Var_532. Unfortunately, that does not mean that Var_533 or even Var_623 is missing but it may have a value like in the example above. Var_# has a number of distinct values and I need to sum them up in each case. If I had no invalid observations I could type egen sum1 = anycount(Var_*), values(1) However, then I count also invalid observations. I ended up with looping through individuals (~50,000) and variables (624), summing up one by one but I really doubt that this is the "best" solution (and hope that it is not): ******************* #d; gen sum1 = 0; sort ID; gen index = _n; qui sum index; forvalues indis = `r(min)'/`r(max)' {; di "`indis'"; forvalues f = 1/624 {; if `f' <=Agemonth in `indis' {; qui replace sum1 = sum1 + (Var_`f' == 1) in `indis'; }; }; }; ******************* Another possibilty would be to have the data in long format - however, since I have so many periods it takes a while to reshape the data, even in portions. I tried that with a 10% sample and "reshape" took more than one hour (maybe I have to ask for a better computer...). * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

