Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: Memory requirements for factor variables

 From f.belotti@econometrics.it To statalist@hsphsun2.harvard.edu Subject Re: st: Memory requirements for factor variables Date Mon, 3 May 2010 11:03:52 +0200 (CEST)

```Partha,

I think there is no way to do that in stata. An alternative could be mata.
Clearly, you have to write down the ado for your econometric model. An
example using OLS is below.

HTH

Federico

******  do *******
clear all
set mem 10m
set more off

set seed 123456

set obs 100000

mata
real matrix factor_reg(rows,cols,d1,d2,d3,d4,x,y) {

D = J(rows,cols,0)
for(i=1;i<=cols;i++) {
for(j=1;j<=rows;j++) {
if (d1[j]==i | d2[j]==i | d3[j]==i | d4[j]==i) D[j,i]=1
}
}
X = x,D,J(100000,1,1)
Y = y
beta = invsym(X'X)*(X'Y)
beta
}

end

gen x = rnormal()
gen u = rnormal()
gen int d = int(_n/1000)
gen int d1 = int(_n/1100)
gen int d2 = int(_n/1200)
gen int d3 = int(_n/1300)
gen int d4 = int(_n/1400)

sum

gen y = x + u

describe,s

regress y x i.d

sum d

tomata
mata: factor_reg(100000,100,d1,d2,d3,d4,x,y)

forvalues i=1/`r(max)' {

gen byte Id`i' = (d1==`i' | d2==`i' | d3==`i' | d4==`i')
}

describe,s

regress y x Id*

exit

--
Federico Belotti
Faculty of Economics
Department of Financial and Quantitative Economics
University of Rome Tor Vergata
tel: +39 06 7259 5624
e-mail: federico.belotti@uniroma2.it
url: http://www.econometrics.it

> Hi all,
>
> I'm working with a large dataset and am running into the limits of RAM
> on my machine (8G).  I run into this problem when I try to create about
> 500 indicator variables from a set of categorical variables.  If I had
> only one categorical variable from which to create the indicators, I
> would do this directly in my -regress- command.
>
> regress y x i.D
>
> The example below shows that using -i.varname- is considerably more
> memory-efficient as compared to generating the indicators manually
> before -regress- , i.e. if one does,
>
> forvalues i=1/100 {
>     gen byte ID`i' = (D==`i')
> }
>
> If I had only one categorical variable to deal with, I would obviously
> use -i.varname- .  But I need to do something like
>
> forvalues i=1/100 {
>     gen byte ID`i' = (D1==`i' | D2==`i' | D3==`i' | D4==`i')
> }
>
> How I can achieve this in a more memory efficient way?  Thanks a lot.
> The example do and log are below.
>
> Partha
>
> ******  do *******
> clear all
> set mem 10m
> set more off
>
> set seed 123456
>
> set obs 100000
>
> gen x = rnormal()
> gen u = rnormal()
> gen int d = int(_n/1000)
>
> gen y = x + u
>
> describe,s
>
> qui regress y x i.d
>
> sum d
>
> forvalues i=1/`r(max)' {
>     gen byte Id`i' = (d==`i')
> }
>
> describe,s
>
> regress y x Id*
>
> exit
>
>
> ******* log **********
>
> . clear all
>
> . set mem 10m
>
> Current memory allocation
>
>                     current                                 memory usage
>     settable          value     description                 (1M = 1024k)
>     --------------------------------------------------------------------
>     set maxvar         5000     max. variables allowed           1.909M
>     set memory           10M    max. data space                 10.000M
>     set matsize         400     max. RHS vars in models          1.254M
>                                                             -----------
>                                                                 13.163M
>
> . set more off
>
> .
> . set seed 123456
>
> .
> . set obs 100000
> obs was 0, now 100000
>
> .
> . gen x = rnormal()
>
> . gen u = rnormal()
>
> . gen int d = int(_n/1000)
>
> .
> . gen y = x + u
>
> .
> . describe,s
>
> Contains data
>   obs:       100,000
>  vars:             4
>  size:     2,200,000 (82.8% of memory free)
> Sorted by:
>      Note:  dataset has changed since last saved
>
> .
> . qui regress y x i.d
>
> .
> . sum d
>
>     Variable |       Obs        Mean    Std. Dev.       Min        Max
> -------------+--------------------------------------------------------
>            d |    100000      49.501    28.86623          0        100
>
> .
> . forvalues i=1/`r(max)' {
>   2.         gen byte Id`i' = (d==`i')
>   3. }
> no room to add more variables because of width
>     An attempt was made to add a variable that would have increased the
> memory required to store
>     an observation beyond what is currently possible.  You have the
> following alternatives:
>
>      1.  Store existing variables more efficiently; see help compress.
>
>      2.  Drop some variables or observations; see help drop.  (Think of
> Stata's data area as the
>          area of a rectangle; Stata can trade off width and length.)
>
>      3.  Increase the amount of memory allocated to the data area using
> the set memory command;
>          see help memory.
> r(902);
>
>
> --
> Partha Deb
> Professor of Economics
> Hunter College
> ph:  (212) 772-5435
> fax: (212) 772-5398
> http://urban.hunter.cuny.edu/~deb/
>
> Emancipate yourselves from mental slavery
> None but ourselves can free our minds.
> 	- Bob Marley
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```