Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Memory requirements for factor variables


From   Federico Belotti <f.belotti@econometrics.it>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Memory requirements for factor variables
Date   Mon, 3 May 2010 10:54:30 +0200

Partha,

I think there is no way to do that in stata. An alternative could be mata. Clearly, you have to write down the ado for your econometric model. An example using OLS is below.

HTH

Federico


******  do *******
clear all
set mem 10m
set more off

set seed 123456

set obs 100000

mata
real matrix factor_reg(rows,cols,d1,d2,d3,d4,x,y) {

	D = J(rows,cols,0)
	for(i=1;i<=cols;i++) {
		for(j=1;j<=rows;j++) {
			if (d1[j]==i | d2[j]==i | d3[j]==i | d4[j]==i) D[j,i]=1
		}
	}
	X = x,D,J(100000,1,1)
	Y = y
	beta = invsym(X'X)*(X'Y)
	beta
} 

end

gen x = rnormal()
gen u = rnormal()
gen int d = int(_n/1000)
gen int d1 = int(_n/1100)
gen int d2 = int(_n/1200)
gen int d3 = int(_n/1300)
gen int d4 = int(_n/1400)

sum

gen y = x + u

describe,s

regress y x i.d

sum d

tomata
mata: factor_reg(100000,100,d1,d2,d3,d4,x,y)

forvalues i=1/`r(max)' {

gen byte Id`i' = (d1==`i' | d2==`i' | d3==`i' | d4==`i')
}

describe,s

regress y x Id*


exit




-- 
Federico Belotti
Faculty of Economics
Department of Financial and Quantitative Economics
University of Rome Tor Vergata 
tel: +39 06 7259 5624
e-mail: federico.belotti@uniroma2.it
url: http://www.econometrics.it


On 3 May 2010, at 00:29, Partha Deb wrote:

> Hi all,
> 
> I'm working with a large dataset and am running into the limits of RAM on my machine (8G).  I run into this problem when I try to create about 500 indicator variables from a set of categorical variables.  If I had only one categorical variable from which to create the indicators, I would do this directly in my -regress- command.
> 
> regress y x i.D
> 
> The example below shows that using -i.varname- is considerably more memory-efficient as compared to generating the indicators manually before -regress- , i.e. if one does,
> 
> forvalues i=1/100 {
> gen byte ID`i' = (D==`i')
> }
> 
> If I had only one categorical variable to deal with, I would obviously use -i.varname- .  But I need to do something like
> 
> forvalues i=1/100 {
> gen byte ID`i' = (D1==`i' | D2==`i' | D3==`i' | D4==`i')
> }
> 
> How I can achieve this in a more memory efficient way?  Thanks a lot.  The example do and log are below.
> 
> Partha
> 
> ******  do *******
> clear all
> set mem 10m
> set more off
> 
> set seed 123456
> 
> set obs 100000
> 
> gen x = rnormal()
> gen u = rnormal()
> gen int d = int(_n/1000)
> 
> gen y = x + u
> 
> describe,s
> 
> qui regress y x i.d
> 
> sum d
> 
> forvalues i=1/`r(max)' {
> gen byte Id`i' = (d==`i')
> }
> 
> describe,s
> 
> regress y x Id*
> 
> exit
> 
> 
> ******* log **********
> 
> . clear all
> 
> . set mem 10m
> 
> Current memory allocation
> 
>                 current                                 memory usage
> settable          value     description                 (1M = 1024k)
> --------------------------------------------------------------------
> set maxvar         5000     max. variables allowed           1.909M
> set memory           10M    max. data space                 10.000M
> set matsize         400     max. RHS vars in models          1.254M
>                                                         -----------
>                                                             13.163M
> 
> . set more off
> 
> .
> . set seed 123456
> 
> .
> . set obs 100000
> obs was 0, now 100000
> 
> .
> . gen x = rnormal()
> 
> . gen u = rnormal()
> 
> . gen int d = int(_n/1000)
> 
> .
> . gen y = x + u
> 
> .
> . describe,s
> 
> Contains data
> obs:       100,000                         vars:             4                         size:     2,200,000 (82.8% of memory free)
> Sorted by:     Note:  dataset has changed since last saved
> 
> .
> . qui regress y x i.d
> 
> .
> . sum d
> 
> Variable |       Obs        Mean    Std. Dev.       Min        Max
> -------------+--------------------------------------------------------
>        d |    100000      49.501    28.86623          0        100
> 
> .
> . forvalues i=1/`r(max)' {
> 2.         gen byte Id`i' = (d==`i')
> 3. }
> no room to add more variables because of width
> An attempt was made to add a variable that would have increased the memory required to store
> an observation beyond what is currently possible.  You have the following alternatives:
> 
>  1.  Store existing variables more efficiently; see help compress.
> 
>  2.  Drop some variables or observations; see help drop.  (Think of Stata's data area as the
>      area of a rectangle; Stata can trade off width and length.)
> 
>  3.  Increase the amount of memory allocated to the data area using the set memory command;
>      see help memory.
> r(902);
> 
> 
> -- 
> Partha Deb
> Professor of Economics
> Hunter College
> ph:  (212) 772-5435
> fax: (212) 772-5398
> http://urban.hunter.cuny.edu/~deb/
> 
> Emancipate yourselves from mental slavery
> None but ourselves can free our minds.
> 	- Bob Marley
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/







*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index