Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Maarten buis <maartenbuis@yahoo.co.uk> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Splitting a dataset efficiently/run regression repeatedly in subsets |
Date | Mon, 15 Nov 2010 15:39:01 +0000 (GMT) |
--- On Mon, 15/11/10, Trelle Sven wrote: > I have a large (simulated) dataset with 400,000 > observations (from overall 50,000 simulations each creating > 8 observations). I need to perform a linear regression for > each simulation separately. I noticed the following: > > 1) keeping all observations in the dataset and looping > through the simulations is very inefficient i.e. it takes > several hours to run e.g. > * first example starts; run is an ID for simulation > gen regcoeff = . > forval s=1/50000 { > regress x y if run==`s' > replace regcoeff = _b[y] if _n==`s' > } > * first example ends An -in- condition is often quicker than an -if- condition. You need to do more work to make sure that the -in- condition is appropriate, but that is the price to pay. > 2) preserving and restoring is even more time-consuming that makes sense > 3) I thought of creating a loop as before but load the data > at the beginning and then keeping only the data for the > particular simulation. Sounds like that would be slow also. Anyhow, before doing all this I would start with -statsby-, see: -help statsby-. Hope this helps, Maarten -------------------------- Maarten L. Buis Institut fuer Soziologie Universitaet Tuebingen Wilhelmstrasse 36 72074 Tuebingen Germany http://www.maartenbuis.nl -------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/