Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Splitting a dataset efficiently/run regression repeatedly in subsets


From   "Trelle Sven" <[email protected]>
To   <[email protected]>
Subject   RE: st: Splitting a dataset efficiently/run regression repeatedly in subsets
Date   Mon, 15 Nov 2010 17:28:33 +0100

> Maarten buis
> Sent: Monday, November 15, 2010 4:39 PM

> > I have a large (simulated) dataset with 400,000 observations (from 
> > overall 50,000 simulations each creating
> > 8 observations). I need to perform a linear regression for each 
> > simulation separately. I noticed the following:
> > 
> > 1) keeping all observations in the dataset and looping through the 
> > simulations is very inefficient i.e. it takes several hours to run 
> > e.g.
> > * first example starts; run is an ID for simulation gen regcoeff = .
> > forval s=1/50000 {
> >     regress x y if run==`s'
> >     replace regcoeff = _b[y] if _n==`s'
> > }
> > * first example ends
> 
> An -in- condition is often quicker than an -if- condition. 
> You need to do more work to make sure that the -in- condition 
> is appropriate, but that is the price to pay.

I will try this. Thanks.
 
 
> Anyhow, before doing all this I would start with -statsby-,
> see: -help statsby-.

As always, I wasn't 100% precise ...
The statsby command is actually much quicker (and thanks for the advice). However, I also need to predict after each regression and apparently this is not possible with statsby.

Consequently, I will try 

1) the "in" condition instead of "if".
2) Use statsby and predict by hand using these results

Sven


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index